Big Questions In Science- 7) How can artificial intelligence decipher complex data?

It's been said that you can learn a lot about someone by the company they keep.

Computer Science and UMIACS Associate Professor Lise Getoor is tapping into the abundance of date-rich networks and using computers to understand and analyze that relational information and how it can be applied. For example, biologists can infer a protein's cellular function by examining its relationships in a protein-to-protein interaction network. Marketers can predict whether someone will buy a product based on whether their friends have bought it. And political scientists can infer a person's views by analyzing membership in online groups.

The biggest challenge: Data sets
 are frequently jumbled, redundant
 and filled with noise that must be filtered. In
 2000, Getoor’s doctoral research launched a new area
 of artificial intelligence called “statistical relational machine learning,” which combines statistical approaches with relational machine learning strategies to make sense out of messy data sets. Today, her research
is funded by a range of government agencies, including NSF and the Defense Advanced Research Projects Agency, and technology giants Google, Microsoft and Yahoo!

Another common problem: inconsistencies in data sets. “How do you figure out whether two similar references refer to the same entity?” Getoor poses. For example, in bibliographic information, do J. Smith, Jonathan Smith and John Smith all refer to the same person? Getoor has developed “entity resolution” strategies that tackle this problem by examining relational information. If J. Smith and Jonathan Smith have several co-authors in common, they more likely are the same entity. Getoor and her students have developed new algorithms that make use of relational information and other contextual information to improve the accuracy of entity resolution. With fellow researchers at the university’s Human-Computer Interaction Laboratory, Getoor developed D-Dupe, a tool for eliminating data duplication that is available as opensource software. 


Among Getoor’s crowning achievements is a data-cleaning approach called graph identification that combines three techniques: 1) entity resolution, which weeds out duplicate information, 2) collective classification, where nodes are identified and labeled based on their relationship with other nodes in the network and 3) link prediction, in which the model predicts relationships between data. “This is the first time that these strategies have been integrated,” says Getoor. “The result is an improved model that ensures that you have more accurate information.” Getoor plans to apply her algorithms to specific areas, including personalized medicine in which extensive data sets can be used to tailor medical treatment to each patient’s individual characteristics.

Writer: Beth Panitz

About the College of Computer, Mathematical, and Natural Sciences

The College of Computer, Mathematical, and Natural Sciences at the University of Maryland educates more than 8,000 future scientific leaders in its undergraduate and graduate programs each year. The college's 10 departments and six interdisciplinary research centers foster scientific discovery with annual sponsored research funding exceeding $250 million.