A.I. Deep Learning word embeddings for under-resourced languages and dialects
- Self-funded PhD students only
- Number of awards: 4
- Deadline: 01/07/2018
- Supervisors: Contact Dr Eric Atwell to discuss this project further informally.
Machine Learning language models should be learnable from any language, but the training resources and linguistic models are well-developed only for some languages. Deep Learning applied to a text corpus produces word embedding models; these are widely used in Natural Language Processing, as a way to capture the meaning of a word, phrase or short text as a vector of numbers. Deep Learning usually requires a large corpus to train the word embeddings; for under-resourced languages and dialects, we may not have enough data for standard Deep Learning. Adaptation or Transfer Learning may improve the models of lesser-resourced languages by taking into account the resources available for closely related languages (Rios and Sharoff 2016).
This research project will explore methods to adapt or transfer word embedding models learnt from “big languages” to handle related “small languages” (Sharoff 2018, Adams et al 2017). Transfer Learning may be informed by linguistic knowledge of the target language and mappings from source to target, formalized into Transfer Learning representations (Yang et al 2016, Ruder et al 2017, Alosaimy and Atwell 2017), which may make use of morphological or sub-word patterns (Soricut and Och 2015, Bojanowski et al 2017). For example, Transfer Learning from Russian to Ukranian (Babych 2017), or from Arabic to Mehri (Watson 2012); or from a standard language to an under-resourced dialect, such as Classical Arabic (Alosaimy and Atwell 2017), Sudanese Arabic (Dickins 2010), New Zealand English (Vine 2017) or Scottish English (Douglas 2009).
This will enable researchers and industry to extend Deep Learning NLP and Text Analytics methods and tools to a wider range of minority languages and dialects, for tasks such as language/dialect identification (Tulkens et al. 2016), semantic tagging and parsing of texts (Bordes et al 2012), clustering or classification of texts (Wang et al 2015), learning and understanding the Quran and the Bible (Alturayeif 2017) and Lexical Computing (SketchEngine 2017).
- Adams, O. et al. (2017). Cross-Lingual Word Embeddings for Low-Resource Language Modeling.
- Alosaimy, A. and Atwell, E. (2017). Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers.
- Alturayeif, N. (2017). Text Mining and Similarity Measures of Quran and Bible.
- Babych, B. (2017). Unsupervised induction of morphological lexicon for Ukrainian. To appear in Proc CAMRL’2017
- Bojanowski, P. et al. (2017). Enriching Word Vectors with Subword Information.
- Bordes, A. et al. (2012). Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing.
- Dickins, J. (2010). Basic Sentence Structure in Sudanese Arabic.
- Douglas, F. (2009). Scottish Newspapers, Language and Identity.
- Rios, M. and Sharoff, S. (2016). Language adaptation for extending post-editing estimates for closely related languages.
- Ruder, S. et al. (2017). A Survey of Cross-lingual Word Embedding Models.
- Sharoff, S. (2018). Language adaptation experiments via cross-lingual embeddings for related languages.
- SketchEngine. (2017). Embedding Viewer.
- Soricut, R. and Och, F. (2015). Unsupervised morphology induction using word embeddings.
- Tulkens, S. et al. (2016). Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource.
- Wang, P. et al. (2015). Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification.
- Vine, B. (2017). Archive of New Zealand English.
- Watson, J. (2012). The structure of Mehri.
- Yang, Z. et al. (2016). Multi-Task Cross-Lingual Sequence Tagging from Scratch.
Applications are invited from candidates with or expecting a minimum of a UK upper second class honours degree (2:1), and/ or a Master's degree in a relevant subject.
How to apply
Formal applications for research degree study should be made online through the university's website. Please state clearly in the research information section of your application, the name of the PhD you wish to apply for is 'A.I. Deep Learning word embeddings for under-resourced languages and dialects' as well as Dr Eric Atwell as your proposed supervisor.
If English is not your first language, you must provide evidence that you meet the University’s minimum English Language requirements.
We welcome scholarship applications from all suitably-qualified candidates, but UK black and minority ethnic (BME) researchers are currently under-represented in our Postgraduate Research community, and we would therefore particularly encourage applications from UK BME candidates. All scholarships will be awarded on the basis of merit.
If you require any further information please contact the Graduate School Office
e: email@example.com, t: +44 (0)113 343 8000.