On generating benchmark data for entity matching

Ekaterini Ioannou, Nataliya Rassadko, Yannis Velegrakis

Research output: Contribution to journalArticleScientificpeer-review

Abstract

Entity matching has been a fundamental task in every major integration and data cleaning effort. It aims at identifying whether two different pieces of information are referring to the same real world object. It can also form the basis of entity search by finding the entities in a repository that best match a user specification. Despite the many different entity matching techniques that have been developed over time, there is still no widely accepted benchmark for evaluating and comparing them. This paper introduces EMBench, a principled system for the evaluation of entity matching systems. In contrast to existing similar efforts, EMBench offers a unique test case generation approach that combines different levels of types, complexity, and scales, allowing a complete and accurate evaluation of the different aspects of a matching system. After presenting the basic principles of EMBench and its functionality, a comprehensive evaluation is performed on some existing matching systems that showcases its discriminative power in highlighting their capabilities and limitations. EMBench has all the characteristics of a benchmark and can serve as a standard evaluation methodology provided that it gains popularity and wide acceptance.
Original languageEnglish
Pages (from-to)37-56
JournalJournal on Data Semantics
Volume2
Issue number1
DOIs
Publication statusPublished - 1 Mar 2013
Externally publishedYes

Fingerprint

Cleaning
Specifications

Keywords

  • data integration
  • matching benchmark
  • entity matching

Cite this

Ioannou, Ekaterini ; Rassadko, Nataliya ; Velegrakis, Yannis. / On generating benchmark data for entity matching. In: Journal on Data Semantics. 2013 ; Vol. 2, No. 1. pp. 37-56.
@article{8a927f5c15e44d72a5aaf6190d7f9bde,
title = "On generating benchmark data for entity matching",
abstract = "Entity matching has been a fundamental task in every major integration and data cleaning effort. It aims at identifying whether two different pieces of information are referring to the same real world object. It can also form the basis of entity search by finding the entities in a repository that best match a user specification. Despite the many different entity matching techniques that have been developed over time, there is still no widely accepted benchmark for evaluating and comparing them. This paper introduces EMBench, a principled system for the evaluation of entity matching systems. In contrast to existing similar efforts, EMBench offers a unique test case generation approach that combines different levels of types, complexity, and scales, allowing a complete and accurate evaluation of the different aspects of a matching system. After presenting the basic principles of EMBench and its functionality, a comprehensive evaluation is performed on some existing matching systems that showcases its discriminative power in highlighting their capabilities and limitations. EMBench has all the characteristics of a benchmark and can serve as a standard evaluation methodology provided that it gains popularity and wide acceptance.",
keywords = "data integration, matching benchmark, entity matching",
author = "Ekaterini Ioannou and Nataliya Rassadko and Yannis Velegrakis",
year = "2013",
month = "3",
day = "1",
doi = "10.1007/s13740-012-0015-8",
language = "English",
volume = "2",
pages = "37--56",
journal = "Journal on Data Semantics",
issn = "1861-2032",
number = "1",

}

On generating benchmark data for entity matching. / Ioannou, Ekaterini; Rassadko, Nataliya; Velegrakis, Yannis.

In: Journal on Data Semantics, Vol. 2, No. 1, 01.03.2013, p. 37-56.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - On generating benchmark data for entity matching

AU - Ioannou, Ekaterini

AU - Rassadko, Nataliya

AU - Velegrakis, Yannis

PY - 2013/3/1

Y1 - 2013/3/1

N2 - Entity matching has been a fundamental task in every major integration and data cleaning effort. It aims at identifying whether two different pieces of information are referring to the same real world object. It can also form the basis of entity search by finding the entities in a repository that best match a user specification. Despite the many different entity matching techniques that have been developed over time, there is still no widely accepted benchmark for evaluating and comparing them. This paper introduces EMBench, a principled system for the evaluation of entity matching systems. In contrast to existing similar efforts, EMBench offers a unique test case generation approach that combines different levels of types, complexity, and scales, allowing a complete and accurate evaluation of the different aspects of a matching system. After presenting the basic principles of EMBench and its functionality, a comprehensive evaluation is performed on some existing matching systems that showcases its discriminative power in highlighting their capabilities and limitations. EMBench has all the characteristics of a benchmark and can serve as a standard evaluation methodology provided that it gains popularity and wide acceptance.

AB - Entity matching has been a fundamental task in every major integration and data cleaning effort. It aims at identifying whether two different pieces of information are referring to the same real world object. It can also form the basis of entity search by finding the entities in a repository that best match a user specification. Despite the many different entity matching techniques that have been developed over time, there is still no widely accepted benchmark for evaluating and comparing them. This paper introduces EMBench, a principled system for the evaluation of entity matching systems. In contrast to existing similar efforts, EMBench offers a unique test case generation approach that combines different levels of types, complexity, and scales, allowing a complete and accurate evaluation of the different aspects of a matching system. After presenting the basic principles of EMBench and its functionality, a comprehensive evaluation is performed on some existing matching systems that showcases its discriminative power in highlighting their capabilities and limitations. EMBench has all the characteristics of a benchmark and can serve as a standard evaluation methodology provided that it gains popularity and wide acceptance.

KW - data integration

KW - matching benchmark

KW - entity matching

U2 - 10.1007/s13740-012-0015-8

DO - 10.1007/s13740-012-0015-8

M3 - Article

VL - 2

SP - 37

EP - 56

JO - Journal on Data Semantics

JF - Journal on Data Semantics

SN - 1861-2032

IS - 1

ER -