Document Length-Normalization and Re-Weighting in Representation-Based Methods for Neural Information Retrieval
The main purpose of ad-hoc retrieval in Information Retrieval (IR) is to fulfil users’ information needs. In traditional retrieval models, this usually is achieved using ranking methods such as exact keyword matching. However, this frequently does not fit the users’ requirements, as several factors can influence the search. One of such factors is the precision of the query terms describing the information need or length of the query. The basic problems, like misspelled queries or short queries, can be solved relatively easily, however, it becomes challenging for the retrieval system when distinct queries which differ semantically are used. This often results in relevant documents being ranked lower due to keyword mismatch and limited semantic knowledge. Matching relevance and matching semantic properties are two tales of the same story. Relevancy is all about finding exact matches to the query, whereas semantics is about the similarity between two words or sentences. The more advanced language models are considered the best choice for such constraints involving both matching semantic similarity and relevancy. The current generation of language models excels at learning these semantic relationships between words to sentences from large amounts of unlabelled text data. Given these capabilities, neural network-based approaches such as representation-based ranking is increasingly applied to various tasks in IR. This method extensively addresses the semantic mismatch problem during retrieval. However, it is sensitive to basic IR characteristics such as the length, word occurrence, and noise in candidate documents. All this, together, challenges the language modelling-based paradigm in IR. To improve these challenges, this research makes several contributions to representation-based ad-hoc retrieval by significantly improving existing methods while proposing new approaches. Firstly, in embeddings settings, it proposes to normalize length of the document using traditional length-normalization methods to improve the retrieval efficiency. Because, documents with long text are likely to skew the notion of similarity while averaging over vectors resulting in a lower ranking. The length-normalization here prevents the unfair ranking of relevant documents with different lengths and shows significant improvement over defined baselines. Secondly, it proposes a novel approach to re-weighting word vectors based on contextual information extracted from language models. This outperforms existing reweighting methods such as Inverse-Document-Frequency and Smoothed Inverse-Frequency, with an average Mean Average Precision (MAP) increase of 6.67%. In other scenarios, together with traditional rankers it also outperforms learning-to-rank baselines with an average increment of 2.93% in terms of Normalized Discounted Cumulative Gain (NDCG). This research further shows that the proposed re-weighting method can also benefit other language modelling related tasks, such as Semantic Textual Similarity (STS).
Data science methods for drug repurposing and hypothesis expansion using literature-based discovery
In 2011, Emily Whitehead was at the Children’s Hospital of Philadelphia suffering from acute lymphoblastic leukaemia when by chance, a member of the medical team recognized that an elevated protein blocking defensive cells is involved in rheumatoid arthritis (RA), and there is an RA drug that stops production of that protein. Emily went on to fully recover, and Emily Whitehead’ case became a prominent example of a chance finding creating a positive outcome through serendipity. Ultimately, the matching of a stratified disease profile with a stratified patient profile, and then aligning a targeted treatment strategy is the goal of precision medicine. For the pharmaceutical industry, the opportunity lies in looking for ways to manage costs and increase productivity in an environment where drug development is lengthy (15-20 years) and requires significant investment ($500m-$2+bn). Repurposing of existing drugs to new diseases can achieve a cost reduction by a factor of 7.5, and data is a key enabler.
It is these types of situations that lead to the hypothesis that computational drug repositioning can have an influential impact. The hypothesis is that the types of searches previously described can be accelerated and improved by using a computational approach. For example, text mining could be used to mine the world’s research and clinical literature for the relevant connections between drugs and diseases, and thus empower doctors and scientists to make faster, more informed decisions. The most important problem to solve for, is one of scale; it is not possible for a single person or persons to conduct a thorough review of thousands of relevant documents – the challenge is to be able to process the different document types and distributed nature of the literature that combines all the evidence available for biomedical associations.
Computational approaches to drug repurposing have developed as we progress through the era of Big Data. As scientific publication rates increase, knowledge acquisition and the research development process have become more complex and time-consuming. Literature-Based Discovery (LBD), supporting automated knowledge discovery, helps facilitate this process by eliciting novel knowledge by analysing existing scientific literature. The traditional hypothesis generation models of Swanson and Vos involved connecting disjoint concepts from literature to create new hypotheses, though were produced via manual means. Hypothesis generation involves the processing of large datasets, in this case from literature, in order to find novel connections between biomedical entities. Where a novel linkage is found, this forms the basis of a new hypothesis that can be tested in a research or clinical setting or reviewed for relevance by subject matter experts. These novel hypotheses form the basis of a data-driven approach to drug repurposing and thus precision medicine.
This LBD PhD research aims to use a hybrid of data science and natural language processing methods to implement the ANC discovery model. The objective of the ANC model is to unify the classic hypothesis generation models of Swanson and Vos, with a modern application of machine learning methods. The research aims to develop an approach for combining natural language processing-based machine learning, multiple biological entity types, and custom evaluation metrics into a methodology to predict and rank biomedical discoveries.
The dataset is a large literature corpus from the biomedical research publisher PubMed. The data are pre-processed sentences, containing various biomedical entity co-occurrences. The project has defined stages relating to data input, processing, output, and evaluation. The input stage relates to the application of classification models being applied to biomedical entity pairs to express the strength of the relation. Once sentences are scored, processing relates to a secondary stage that aggregates sentence data to the relation level. To produce the output, that is, a predicted set of A-C relations, weighting schemes will be developed that express pathways through the model and tested against a set of evaluation metrics, that make up the evaluation stage.
A Deeper Analysis of Text Misclassifcations
This study presents a Python library developed as a means of gaining an understanding and insight into the occurrence of misclassified instances in text classification tasks. The principle is that from the insight, this in turn can be used in an effort to reduce the number of misclassifications. The Python programmed library, called py text misclass, produces meaningful comprehensive analysis of binary text misclassifications. The library is structured as one principal module to generate comprehensive analysis and a supplementary and optional module to pre-process and classify the raw text, transforming it to appropriate formats in preparation for the analysis phase. In this study, we use a sample binary text data set to demonstrate the library in use and illustrate the analysis that is accumulated through use of the py text misclass library.
Title: An Analysis of Grammatical Classes from the Signs of Ireland Corpus Using Association Rules Learning