A.I. Deep Learning word embeddings for under-resourced languages and dialects

Project description

Machine Learning language models should be learnable from any language, but the training resources and linguistic models are well-developed only for some languages. Deep Learning applied to a text corpus produces word embedding models; these are widely used in Natural Language Processing, as a way to capture the meaning of a word, phrase or short text as a vector of numbers. Deep Learning usually requires a large corpus to train the word embeddings; for under-resourced languages and dialects, we may not have enough data for standard Deep Learning. Adaptation or Transfer Learning may improve the models of lesser-resourced languages by taking into account the resources available for closely related languages (Rios and Sharoff 2016).

This research project will explore methods to adapt or transfer word embedding models learnt from “big languages” to handle related “small languages” (Sharoff 2018, Adams et al 2017). Transfer Learning may be informed by linguistic knowledge of the target language and mappings from source to target, formalized into Transfer Learning representations (Yang et al 2016, Ruder et al 2017, Alosaimy and Atwell 2017), which may make use of morphological or sub-word patterns (Soricut and Och 2015, Bojanowski et al 2017). For example, Transfer Learning from Russian to Ukranian (Babych 2017), or from Arabic to Mehri (Watson 2012); or from a standard language to an under-resourced dialect, such as Classical Arabic (Alosaimy and Atwell 2017), Sudanese Arabic (Dickins 2010), New Zealand English (Vine 2017) or Scottish English (Douglas 2009). 

This will enable researchers and industry to extend Deep Learning NLP and Text Analytics methods and tools to a wider range of minority languages and dialects, for tasks such as language/dialect identification (Tulkens et al. 2016), semantic tagging and parsing of texts (Bordes et al 2012), clustering or classification of texts (Wang et al 2015), learning and understanding the Quran and the Bible (Alturayeif 2017) and Lexical Computing (SketchEngine 2017).


Entry requirements

Applications are invited from candidates with or expecting a minimum of a UK upper second class honours degree (2:1), and/ or a Master's degree in a relevant subject.

How to apply

Formal applications for research degree study should be made online through the university's website. Please state clearly in the research information section of your application, the name of the PhD you wish to apply for is 'A.I. Deep Learning word embeddings for under-resourced languages and dialects' as well as Dr Eric Atwell as your proposed supervisor. 

If English is not your first language, you must provide evidence that you meet the University’s minimum English Language requirements.

We welcome scholarship applications from all suitably-qualified candidates, but UK black and minority ethnic (BME) researchers are currently under-represented in our Postgraduate Research community, and we would therefore particularly encourage applications from UK BME candidates.  All scholarships will be awarded on the basis of merit.

If you require any further information please contact the Graduate School Office
e: phd@engineering.leeds.ac.uk, t: +44 (0)113 343 8000.