Arabella Jane Sinclair About Teaching Projects Contact
Avatar

Arabella Jane Sinclair

Lecturer, Computer Science

University of Aberdeen
scholar linkedin

BSc Honours Thesis Topics

If you are a current 4th year Honours student at the University of Aberdeen, and are interested in one of these projects, or have ideas for a project that is related to my research (see publications and past supervision projects), then please feel free to contact me via email or blackboard.

Interpreting Transformer Attention During Next Utterance Dialogue Generation from Human Contexts

Evaluation, Interpretability, Dialogue, Language Generation, Alignment

Transformer language models are increasingly proficient at generating convincingly human-like language, even when generating dialogue. Humans have been found to adapt and align their language to one another in dialogue, a convincing model of dialogue should do the same. This project aims to explore what features of the dialogue context provided to the model most influence a model’s generation of a proposed next utterance. It will then compare the influence the context has on it’s comprehension of the human utterance that came next in the true dialogue data. From this comparison, as well as building on work measuring the degree of vocabulary overlap between human vs machine ‘speakers’ we can analyse and evaluate the degree to which model generation behaviour is human-like and where this diverges, through the perspective of linguistic alignment. The addition of interpretability techniques will allow our analysis to be more fine grained: exploring to what extent the model attends to different speakers in the preceding dialogue context, and answering questions about between vs within speaker alignment differences between humans and language models.

Expectations: Student will have strong programming skills, a good working knowledge of python, and is willing to learn and work with transformer language models, such as gpt2, T5, or Dialogpt. This project will require the use of multiple libraries for analysis, and potentially making use of the Macleod server for performing some experiments, depending on the ability/ambition of the student.

Datasets
  • Conversational open domain dialogue: Switchboard
  • Task based closed domain dialogue MapTask
Models
References

Exploring Bilingual vs. Language Learner Code Switching Patterns in Dialogue

Evaluation, Analysis, Multilingual Models, Code switching, Language Learning, Clustering, Alignment, Dialogue

Code switching, or the choice of a speaker to change language during a single interaction, is typical of bilingual speakers, and can be useful to both teachers and learners in second language (L2) tuition. Understanding and predicting appropriate use of code switching in this setting has the potential to greatly enhance current computer aided language learning technologies, as well as provide an evaluation of existing multilingual representations. This project will involve making use of the multilingual representations afforded to us by models such as BLOOM and using them to explore and compare the code switching behaviour of a) native spanish/catalan learners of english, and b) bilingual speakers of spanish and english, in a dialogue setting. Initially some exploratory data analysis of the two corpora will be carried out, following clustering of utterance embeddings extracted from BLOOM based on cosine similarity (to measure semantic similarity) for both the languages in the dataset. From this, this project aims to answer open research questions such as a) how semantically similar are the two languages used in these dialogues? Does this vary with learner ability level? Do clusters represent different uses of the second language in these learner dialogues, and does this differ from its purpose in bilingual dialogue? This work will build on existing work (in press, will be shared in confidence with the student working on this project) examining the role of code switching and alignment in learner dialogue, applying modern techniques to the analysis and uses of code switching in second language learner dialogue. Depending on the ability and ambition of the student, this project can also attempt to predict the incidence of code switching in an L2 setting

Datasets
Models
References

Towards Evaluating Creativity in Language

Evaluation, Linguistic features, Prediction, Creativity, Time series

Humans use language creatively in many settings for many purposes, either for humour, creating written works of art such as poetry, creative writing or theatre, or even in their everyday language use. Some language is created with the intent to convey factual information, in this setting, use of abstraction, metaphor, or ambiguous language, is undesirable, and thus potentially limits the degree to which we categorise this language as creative This project aims to explore what linguistic markers can robustly characterise language that humans find creative and appealing, using a variety of measures drawn from areas of psycholinguistics and evaluating them on texts selected both for their creativity and for their relative lack thereof. We can then use these markers to evaluate and compare the extent to which machine-generated texts display these creative traits, independent of their success in displaying other desirable properties in generation such as fluency or coherence. We aim to address the hypothesis that creativity in language is independent of fluency, and that it is a quantity that both varies depending on context and individual human opinion. We will explore how these features vary across text of different genres, using a feature based and neural classifier to score texts for creativity. We will then (depending on the ability & ambitions of the student) design and conduct a small human evaluation of language snippets from human and model generated texts to investigate to what extent the properties discovered correlate with human judgement of creative language use.

Datasets
Models
  • Tbd: will depend on approach taken
References

Evaluating Multilingual Semantic Representations of Colour and Emotion

Evaluation, Multilingual, Psycholinguistics, Semantic Representation

Multilingual models capture semantics about words in different languages, granting them increased semantic capabilities over lower resource languages. However, semantic associations for words can vary across languages and the cultures that speak them, for example, colour and emotion associations have been shown to vary by culture. Do multilingual language models capture the same language and thus culturally specific associations? This project will use as a case study a dataset in English containing human data on emotion-word associations, and evaluate the degree of difference between model’s semantic representation for these words in comparison to their corresponding emotion across selected languages. We will then compare this to human data for colour-emotion relationship in japanese, as well as colour emotion relationship (gathered in english) for Indian speakers of english. We thus have 3 distinct cultural populations to compare representations across. Understanding the cultural variations in semantic differences between words can be very important for downstream applications such as sentiment classification in reviews, or simply in evaluating texts in terms of coherence.

Datasets
Models
References