Computer Science and UMIACS Associate Professor Lise Getoor is tapping into the abundance of date-rich networks and using computers to understand and analyze that relational information and how it can be applied. For example, biologists can infer a protein's cellular function by examining its relationships in a protein-to-protein interaction network. Marketers can predict whether someone will buy a product based on whether their friends have bought it. And political scientists can infer a person's views by analyzing membership in online groups.
The biggest challenge: Data sets are frequently jumbled, redundant and filled with noise that must be filtered. In 2000, Getoor’s doctoral research launched a new area of artificial intelligence called “statistical relational machine learning,” which combines statistical approaches with relational machine learning strategies to make sense out of messy data sets. Today, her research is funded by a range of government agencies, including NSF and the Defense Advanced Research Projects Agency, and technology giants Google, Microsoft and Yahoo!
Another common problem: inconsistencies in data sets. “How do you figure out whether two similar references refer to the same entity?” Getoor poses. For example, in bibliographic information, do J. Smith, Jonathan Smith and John Smith all refer to the same person? Getoor has developed “entity resolution” strategies that tackle this problem by examining relational information. If J. Smith and Jonathan Smith have several co-authors in common, they more likely are the same entity. Getoor and her students have developed new algorithms that make use of relational information and other contextual information to improve the accuracy of entity resolution. With fellow researchers at the university’s Human-Computer Interaction Laboratory, Getoor developed D-Dupe, a tool for eliminating data duplication that is available as opensource software.
Among Getoor’s crowning achievements is a data-cleaning approach called graph identification that combines three techniques: 1) entity resolution, which weeds out duplicate information, 2) collective classification, where nodes are identified and labeled based on their relationship with other nodes in the network and 3) link prediction, in which the model predicts relationships between data. “This is the first time that these strategies have been integrated,” says Getoor. “The result is an improved model that ensures that you have more accurate information.” Getoor plans to apply her algorithms to specific areas, including personalized medicine in which extensive data sets can be used to tailor medical treatment to each patient’s individual characteristics.
Writer: Beth Panitz