Current PhD students:

Document Length-Normalization and Re-Weighting in Representation-Based Methods for Neural Information Retrieval

The main purpose of ad-hoc retrieval in Information Retrieval (IR) is to fulfil users’ information needs. In traditional retrieval models, this usually is achieved using ranking methods such as exact keyword matching. However, this frequently does not fit the users’ requirements, as several factors can influence the search. One of such factors is the precision of the query terms describing the information need or length of the query. The basic problems, like misspelled queries or short queries, can be solved relatively easily, however, it becomes challenging for the retrieval system when distinct queries which differ semantically are used. This often results in relevant documents being ranked lower due to keyword mismatch and limited semantic knowledge. Matching relevance and matching semantic properties are two tales of the same story. Relevancy is all about finding exact matches to the query, whereas semantics is about the similarity between two words or sentences. The more advanced language models are considered the best choice for such constraints involving both matching semantic similarity and relevancy. The current generation of language models excels at learning these semantic relationships between words to sentences from large amounts of unlabelled text data. Given these capabilities, neural network-based approaches such as representation-based ranking is increasingly applied to various tasks in IR. This method extensively addresses the semantic mismatch problem during retrieval. However, it is sensitive to basic IR characteristics such as the length, word occurrence, and noise in candidate documents. All this, together, challenges the language modelling-based paradigm in IR. To improve these challenges, this research makes several contributions to representation-based ad-hoc retrieval by significantly improving existing methods while proposing new approaches. Firstly, in embeddings settings, it proposes to normalize length of the document using traditional length-normalization methods to improve the retrieval efficiency. Because, documents with long text are likely to skew the notion of similarity while averaging over vectors resulting in a lower ranking. The length-normalization here prevents the unfair ranking of relevant documents with different lengths and shows significant improvement over defined baselines. Secondly, it proposes a novel approach to re-weighting word vectors based on contextual information extracted from language models. This outperforms existing reweighting methods such as Inverse-Document-Frequency and Smoothed Inverse-Frequency, with an average Mean Average Precision (MAP) increase of 6.67%. In other scenarios, together with traditional rankers it also outperforms learning-to-rank baselines with an average increment of 2.93% in terms of Normalized Discounted Cumulative Gain (NDCG). This research further shows that the proposed re-weighting method can also benefit other language modelling related tasks, such as Semantic Textual Similarity (STS).

Data science methods for drug repurposing and hypothesis expansion using literature-based discovery

In 2011, Emily Whitehead was at the Children’s Hospital of Philadelphia suffering from acute lymphoblastic leukaemia when by chance, a member of the medical team recognized that an elevated protein blocking defensive cells is involved in rheumatoid arthritis (RA), and there is an RA drug that stops production of that protein. Emily went on to fully recover, and Emily Whitehead’ case became a prominent example of a chance finding creating a positive outcome through serendipity. Ultimately, the matching of a stratified disease profile with a stratified patient profile, and then aligning a targeted treatment strategy is the goal of precision medicine. For the pharmaceutical industry, the opportunity lies in looking for ways to manage costs and increase productivity in an environment where drug development is lengthy (15-20 years) and requires significant investment ($500m-$2+bn). Repurposing of existing drugs to new diseases can achieve a cost reduction by a factor of 7.5, and data is a key enabler.

It is these types of situations that lead to the hypothesis that computational drug repositioning can have an influential impact. The hypothesis is that the types of searches previously described can be accelerated and improved by using a computational approach. For example, text mining could be used to mine the world’s research and clinical literature for the relevant connections between drugs and diseases, and thus empower doctors and scientists to make faster, more informed decisions. The most important problem to solve for, is one of scale; it is not possible for a single person or persons to conduct a thorough review of thousands of relevant documents – the challenge is to be able to process the different document types and distributed nature of the literature that combines all the evidence available for biomedical associations.

Computational approaches to drug repurposing have developed as we progress through the era of Big Data. As scientific publication rates increase, knowledge acquisition and the research development process have become more complex and time-consuming. Literature-Based Discovery (LBD), supporting automated knowledge discovery, helps facilitate this process by eliciting novel knowledge by analysing existing scientific literature. The traditional hypothesis generation models of Swanson and Vos involved connecting disjoint concepts from literature to create new hypotheses, though were produced via manual means. Hypothesis generation involves the processing of large datasets, in this case from literature, in order to find novel connections between biomedical entities. Where a novel linkage is found, this forms the basis of a new hypothesis that can be tested in a research or clinical setting or reviewed for relevance by subject matter experts. These novel hypotheses form the basis of a data-driven approach to drug repurposing and thus precision medicine.

This LBD PhD research aims to use a hybrid of data science and natural language processing methods to implement the ANC discovery model. The objective of the ANC model is to unify the classic hypothesis generation models of Swanson and Vos, with a modern application of machine learning methods. The research aims to develop an approach for combining natural language processing-based machine learning, multiple biological entity types, and custom evaluation metrics into a methodology to predict and rank biomedical discoveries.

The dataset is a large literature corpus from the biomedical research publisher PubMed. The data are pre-processed sentences, containing various biomedical entity co-occurrences. The project has defined stages relating to data input, processing, output, and evaluation. The input stage relates to the application of classification models being applied to biomedical entity pairs to express the strength of the relation. Once sentences are scored, processing relates to a secondary stage that aggregates sentence data to the relation level. To produce the output, that is, a predicted set of A-C relations, weighting schemes will be developed that express pathways through the model and tested against a set of evaluation metrics, that make up the evaluation stage.

A Deeper Analysis of Text Misclassifcations

This study presents a Python library developed as a means of gaining an understanding and insight into the occurrence of misclassified instances in text classification tasks. The principle is that from the insight, this in turn can be used in an effort to reduce the number of misclassifications. The Python programmed library, called py text misclass, produces meaningful comprehensive analysis of binary text misclassifications. The library is structured as one principal module to generate comprehensive analysis and a supplementary and optional module to pre-process and classify the raw text, transforming it to appropriate formats in preparation for the analysis phase. In this study, we use a sample binary text data set to demonstrate the library in use and illustrate the analysis that is accumulated through use of the py text misclass library.

Title: An Analysis of Grammatical Classes from the Signs of Ireland Corpus Using Association Rules Learning

Completed PhDs:



Your Content Goes Here

An application of data machine learning to explore relationships between factors of organisational silence and culture, with specific focus on predicting silence behaviours 

Research indicates that there are many individual reasons why people do not speak up when confronted with situations that may concern them within their working environment. One of the areas that requires more focused research is the role culture plays in why a person may remain silent when such situations arise. The purpose of this study is to use data science techniques to explore the patterns in a data set that would lead a person to engage in organisational silence. The main research question the thesis asks is: Is Machine Learning a tool that Social Scientists can use with respect to Organisational Silence and Culture, that augments commonly used statistical analysis approaches in this domain.

This study forms part of a larger study being run by the third supervisor of this thesis. A questionnaire was developed by organisational psychologists within this group to collect data covering six traits of silence as well as cultural and individual attributes that could be used to determine if someone would engage in silence or not. This thesis explores three of those cultures to find main effects and interactions between variables that could influence silence behaviours.

Data analysis was carried out on data collected in three European countries, Italy, Germany and Poland (n=774). The data analysis comprised of (1) exploring the characteristics of the data and determining the validity and reliability of the questionnaire; (2) identifying a suitable classification algorithm which displayed good predictive accuracy and modelled the data well based on eight already confirmed hypotheses from the organisational silence literature and (3) investigate newly discovered patterns and interactions within the data, that were previously not documented in the Silence literature on how culture plays a role in predicting silence.

It was found that all the silence constructs showed good validity with the exception of Opportunistic Silence and Disengaged Silence. Validation of the cultural dimensions was found to be poor for all constructs when aggregated to individual level with the exception of Humane Orientation Organisational Practices, Power Distance Organisational Practices, Humane Orientation Societal Practices and Power Distance Societal Practices. In addition, not all constructs were invariant across countries. For example, a number of constructs showed invariance across the Poland and Germany samples, but failed for the Italian sample.

Ten models were trained to identify predictors of a binary variable, engaged in Organisational Silence. Two of the most accurate models were chosen for further analysis of the main effects and interactions within the dataset, namely Random Forest (AUC = 0.655) and Conditional Inference Forests (AUC = 0.647). Models confirmed 9 out of 16 of the known relationships, and identified three additional potential interactions within the data that were previously not documented in the silence literature on how culture plays a role in predicting silence. For example, Climate for Authenticity was discovered to moderate the effect of both Power Distance Societal Practices and Diffident Silence in reducing the probability of someone engaging in silence.

This is the first time this instrument was validated via statistical techniques for suitability to be used across cultures. The techniques of modelling the silence data using classification algorithms with Partial Dependency Plots is a novel and previously unexplored method of exploring organizational silence. In addition, the results identified new information on how culture plays a role in silence behaviours. The results also highlighted that models such as ensembles that identify non-linear relationships without making assumptions about the data, and visualisations depicting interactions identified by such models, can offer new insights over and above the current toolbox of analysis techniques prevalent in social science research.