There is increasing interest in problem of grounding language to vision, in particular video, i.e. to build a system which is able to determine objects, object properties, or activities, words or phrases in language correspond to. 

This project will build on existing work at Leeds on this problem, which shows how it is possible to ground loosely aligned short clips with simple textual descriptions, and to learn a grammar corresponding to the textual input, but will aim to extend it in several directions; one such direction is enable longer video clips and longer textual descriptions to be handled; another is to allow more complex objects and object descriptions, as well as more complex activity descriptions. Another possible avenue of exploration is allow the incorporation of existing world knowledge, as is extension to other modalities apart from vision, e.g. touch or sounds.

