Automatically reading biomedical papers

A stunning amount of work has been done on biochemical reactions, literally millions of experiments costing billions of dollars. These experiments produce little pieces to much larger puzzles, such as that of how cancer develops, and how drugs interact with the pathways to cancer. Unfortunately, this knowledge is not in a unified database that you could do statistical inference over, but rather spread out over millions of journal articles using inconsistent technical language that requires domain-specific knowledge.

The Reach project Reach on Github. is a system that uses hand-written rules to extract entities and events from text automatically, using a rule-based language for information extraction called Odin. Odin is part of Processors, an open-source library of Natural Language Processing tools on Github. Because the rules are hand-written, it is easy to interpret how you got a certain result, and relatively easy to debug. You can explore the results of our work or try out your own sentences at the Reach project site.

This is an annotation by the Reach Visualizer of a sentence from a real biomedical paper, displayed using brat annotation format.

It shows a single reaction, a binding Binding is one of the many kinds of biochemical reactions. It’s when two proteins stick together to make a complex protein. resulting in a complex of Raf and Ras. By building lots of this little links like this one, and linking those in turn (again, using text as our guide), we will make much larger directed graphs that are closer to what we want.

One of the many aspects of biomedical text that differentiate it from “normal” text is the way it uses coreference relations. Coreference resolution is linking different mentions in a text (like she and the Speaker of the House in a sentence about Nancy Pelosi) as referring to the same real-world referent—something that we as readers do constantly, but that’s hard to do automatically. Some of the cues that open-domain coreference resolution uses to great effect such as gender and animacy are less important here, since almost all pronouns and other deictic expressions in biomedical texts are third-person inanimate (it, they, its, etc.) in English. On the other hand, we get a lot of additional information constraining how we link our mentions that isn’t availble in the open domain, like what kind and how many molecules can interact in a given reaction. In order to improve our automatic reading as much as possible, my colleagues and I adapted Bell, Hahn-Powell, Valenzuela-Esc├írcega, & Surdeanu (2016) an open-domain coreference resolution system to take advantage of this information. Like Reach, this system is rule-based and easily interpreted and modified.

The language of food on Twitter

Social media sites are treasure troves of information about their users being updated by the minute. The social media sites themselves are leveraging this information constantly to improve the effectiveness of marketing to users, but the same data can be used for a wide variety of applications, including public health. Colleagues at the University of Arizona and I use the language of food in a project called Twitter for Food Twitter for Food to predict demographic data including obesity rate, rate of diabetes, political party, and geographical location (city and state). Age-corrected obesity rates in US adults according to CDC data, with diabetes prevalence also from the CDC divided over US Census data (1, 2, 3, 4). Here, overweight means a Body Mass Index (BMI) ≥ 25.0, and obese ≥ 30.0. Although Body Mass Index is not a direct measure of fatness or of healthiness, it is very simple to collect, and does correlate with risk for health problems.

Obesity and related health problems such as Type II Diabetes are epidemic in the United States. An estimated 86 million Americans over the age of 20 exhibit signs of pre-diabetes 2014 National Diabetes Statistics Report, and as much as 70% of these pre-diabetic individuals will eventually develop Type II Diabetes Nathan et al. (2007), a chronic and debilitating disease associated with heart disease, stroke, blindness, kidney failure, and amputations. However, previous work Ashrafian et al. (2014) has demonstrated that intervention by social media has modest but significant success in decreasing obesity, and so what is lacking is efficient identification of obese individuals.

In one study Fried, Surdeanu, Kobourov, Hingle, & Bell (2014), we found that we could classify a whole state as being more or less overweight than median with 80% accuracy (chance would be 51%, because we included the District of Columbia) based on the words in its tweets. But when we attempted this classification on individuals using a classifier trained on state-level data, the result was at chance, meaning that individuals are not strongly represented by or representative of their state on average. Since we want to classify individuals, this is a serious problem. So we developed a system Bell, Fried, Huangfu, Surdeanu, & Kobourov (2016) to take a classifier from the state-level data and make a 20-questions quiz out of it. The idea was to have a fun game people could play, and to get some more training data if they were willing to share their Twitter usernames with us as well as their BMI. That would help us build a training set for hopefully more accurate classification of individuals. In fact, hundreds of people did take the quiz, although the accuracy of the quiz was not good relative to chance. We're also pursuing other methods of building such a training set and making a more sophisticated classifier for individuals.