Entity Resolution: Past, Present and Yet-to-Come: From Structured to Heterogeneous, to Crowd-sourced, to Deep Learned
Entity Resolution (ER) lies at the core of data integration, with a bulk of research focusing on its effectiveness and its time efficiency. Most past relevant works were crafted for addressing Veracity over structured (relational) data. They typically rely on schema, expert and external knowledge to maximize accuracy. Part of these methods have been recently extended to process large volumes of data through massive parallelization techniques, such as the MapReduce paradigm. With the present advent of Big Web Data, the scope moved towards Variety, aiming to handle semi-structured data collections, with noisy and highly heterogeneous information. Relevant works adopt a novel, loosely schema-aware functionality that emphasizes scalability and robustness to noise. Another line of present research focuses on Velocity, i.e., processing data collections of a continuously increasing volume.
In this tutorial, we present the ER generations by discussing past, present, and yet-to-come mechanisms. For each generation, we outline the corresponding ER workflow along with the state-of-the-art methods per workflow step. Thus, we provide the participants with a deep understanding of the broad field of ER, highlighting the recent advances in crowd-sourcing and deep learning applications in this active research domain. We also equip them with practical skills in applying ER workflows through a hands-on session that involves our publicly available ER toolbox and data.
1. Introduction and motivation
2. The four generations of Entity Resolution
. . . a. The 1st ER Generation: Tackling the Veracity of structured data
. . . b. The 2nd ER Generation: Tackling the Volume and Veracity of structured data
. . . c. The 3rd ER Generation: Tackling the Variety, Volume and Veracity of (semi-)structured data
. . . d. The 4th ER Generation: Tackling the Velocity, Variety, Volume and Veracity of (semi-)structured data
3. Entity Resolution Revisited: Leveraging External Knowledge
. . . a. Deep Learning for Entity Resolution
. . . b. Crowd-sourced Entity Resolution
. . . . . b.1. Generating HITs
. . . . . b.2. Formulating HITs
. . . . . b.3. Balancing accuracy and monetary cost
. . . . . b.4. Restricting the labour cost
4. Hands-on Session: ER tools
. . . o The JedAI Open Source Toolkit
5. Challenges and Final Remarks
George Papadakis is an internal auditor of information systems and a research fellow at the University of Athens. He also worked at the NCSR "Demokritos", National Technical University of Athens (NTUA), L3S Research Center and "Athena" Research Center. He holds a PhD in Computer Science from University of Hanover and a Diploma in Computer Engineering from NTUA. His research focuses on web data mining.
Ekaterini Ioannou is an Assistant Professor at Tilburg University, the Netherlands. Prior, she worked as an Assistant Professor at Eindhoven University of Technology, as a Lecturer at the Open University of Cyprus, an adjunct faculty at EPFL in Switzerland, a research collaborator at the Technical University of Crete, and as an Independent Expert for the European Commission. Her research focuses on information integration with an emphasis on the challenges of managing data with uncertainties, heterogeneity or correlations, and, more recently, on achieving a deeper integration of information extraction tasks within databases, and on efficiently retrieving analytics over graphs/hypergraphs with evolving data.
Themis Palpanas is Senior Member of the French University Institute (IUF), and Professor of computer science at the University of Paris (France), where he is the director of diNo, the data management group. He is the author of nine US patents, three of which have been implemented in world-leading commercial data management products. He is the recipient of three Best Paper awards, and the IBM Shared University Research (SUR) Award. He is serving as Editor in Chief for BDR Journal, Associate Editor for PVLDB 2019 and TKDE journal, and Editorial Advisory Board member for IS journal.