Can we learn a model for an object category from textual descriptions?


The aim of this research is to investigate the use of textual descriptions for learning visual object category recognition. More specifically, given a textual description describing the appearance of an object category, can we learn a model for this object category without using any training images? The work contributes primarily to the recognition of fine-grained object categories, such as animal and plant species, where it may be difficult to collect many images for training, but where textual descriptions are readily available, for example from online nature guides.


The prime motivation for using textual descriptions for learning is to address the problem of collecting large training datasets. Conventional approaches to machine learning require many example images, and to manually label such datasets is an onerous task. Thus, textual descriptions could potentially reduce the need for manual annotation. Another motivation is that information learnt from text can be shared across various categories, allowing recognition of previously unseen object categories. For example, a 'white spot' detector can be learnt independent of butterfly categories, and subsequently used to recognise a butterfly without requring any new training images for the butterfly.

The research provides ample opportunities for industrial exploitation, for example interesting applications such as a "mobile field guide" (take a picture of a butterfly with your phone and be told its name and information) could be derived from this work.


Our proposed framework comprises three components:

  1. natural language processing to build object category models from textual descriptions
  2. visual processing to extract visual attributes from test images
  3. a generative model to connect textual terms from textual descriptions and visual attributes from images

Our proposed framework, consisting of three components: natural language processing, visual processing, and a generative model.


The Leeds Buterfly Dataset used in this work is available here. The dataset contains the butterfly images, ground truth segmentations for each image, and textual descriptions for each butterfly category.

Related Publications

  • Josiah Wang, Katja Markert, and Mark Everingham. "Learning Models for Object Recognition from Natural Language Descriptions". In Proceedings of the 20th British Machine Vision Conference (BMVC2009), September 2009.
       title = "Learning Models for Object Recognition from Natural Language Descriptions",
       author = "Josiah Wang and Katja Markert and Mark Everingham",
       booktitle = "Proceedings of the British Machine Vision Conference",
       year = "2009"