Colloquium: Human-in-the-loop Data Integration
Title: Human-in-the-loop Data Integration
Speaker: Guoliang Li, Associate Professor of Department of Computer Science, Tsinghua University, Beijing, China
Location: Northeastern University, 440 Huntington Avenue, West Village H, 3rd Floor, Room #366, Boston, Massachusetts 02115
Data integration aims to integrate data in different sources and provide users with a unified view. However, data integration cannot be completely addressed by purely automated methods. In this talk, I present a hybrid human-machine data integration framework that harnesses human ability to address this problem, and especially focus on the problem of entity matching. The framework first uses rule-based algorithms to identify possible matching pairs and then utilizes the crowd to refine these candidate pairs in order to compute actual matching pairs. In the first step, I introduce similarity-based rules and knowledge-based rules to obtain some candidate matching pairs, and develop effective algorithms to learn these rules based on some given positive and negative examples. I also introduce our distributed in-memory system DIMA to efficiently apply these rules. In the second step, I present a selection-inference-refine framework that uses the crowd to verify the candidate pairs, which first selects some “beneficial” tasks to ask the crowd and then uses transitivity and partial order to infer the answers of unasked tasks based on the crowdsourcing results of the asked tasks. I introduce our crowd-powered database system CDB that allows users to utilize a SQL-like language for processing crowd-based queries. Lastly, I provide emerging challenges in human-in-the-loop data integration.
About the Speaker
Guoliang Li is an Associate Professor of Department of Computer Science, Tsinghua University, Beijing, China. His research interests include crowdsourced data management, big spatio-temporal data analytics, large-scale data cleaning and integration. He has published more than 100 papers in premier conferences and journals, such as SIGMOD, VLDB, ICDE, SIGKDD, SIGIR, TODS, VLDB Journal, and TKDE. He is a PC co-chair of WAIM 2014, WebDB 2014, and NDBC 2016. He servers as associate editor for IEEE Transactions and Data Engineering, VLDB Journal, IEEE Data Engineering Bulletin, and BigData Research. He has regularly served as the PC members of many premier conferences, such as SIGMOD, VLDB, KDD, ICDE, WWW, IJCAI, and AAAI. His papers have been cited more than 4500 times. He received VLDB Early Research Contribution Award 2017, IEEE TCDE Early Career Award 2017, The National Youth Talent Support Program 2017, ChangJiang Young Scholar 2016, NSFC Excellent Young Scholars Award 2014, CCF Young Scientist 2014.
Friday, January 12 at 12:00pm