Projects per year
We present TISC, a language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from a very large corpus of raw text, without supervision, and contains word unigrams and word bigrams. It is stored in a novel representation based on a purpose-built hashing function, which provides a fast and computationally tractable way of checking whether a particular word form likely constitutes a spelling error and of retrieving correction candidates. The system employs input context and lexicon evidence to automatically propose a limited number of ranked correction candidates when insufficient information for an unambiguous decision on a single correction is available. We describe the implemented prototype and evaluate it on English and Dutch text, containing real-world errors in more or less limited contexts. The results are compared with those of the isolated word spelling checking programs Ispell and the Microsoft Proofing Tools MPT.
|Title of host publication||Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004)|
|Place of Publication||Geneva|
|Number of pages||7|
|Publication status||Published - 2004|