Skip to main navigation Skip to search Skip to main content

Efficient semantic-aware detection of near duplicate resources

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.
Original languageEnglish
Title of host publicationThe Semantic Web: Research and Applications
Subtitle of host publication7th Extended Semantic Web Conference 2010, Proceedings part II
Place of PublicationBerlin
PublisherSpringer
Pages136-150
Number of pages15
ISBN (Print)978-3-642-13488-3
DOIs
Publication statusPublished - 2010
Externally publishedYes
Event7th Extended Semantic Web Conference - Heraklion, Greece
Duration: 30 May 20102 Jun 2010

Publication series

NameLecture Notes in Computer Science
Volume6089

Conference

Conference7th Extended Semantic Web Conference
Abbreviated titleESWC 2010
Country/TerritoryGreece
CityHeraklion
Period30/05/102/06/10

Fingerprint

Dive into the research topics of 'Efficient semantic-aware detection of near duplicate resources'. Together they form a unique fingerprint.

Cite this