Abstract
For philosophers to be able to take a computational turn in their field, especially if that field relies heavily on historical material, it is crucial to be able to build high-quality, easily and freely accessible corpora in a sustainable format composed from multi-language, multi-script books from different historical periods. At the moment, corpora matching these needs are virtually non-existent. Within the CLARIN-NL project @PhilosTEI, we have addressed the problem of building this kind of corpora by developing an open-source, web-based, user-friendly workflow from textual images to TEI, based on state-of-the-art open-source OCR software Tesseract, and a multi-language version of TICCL, a powerful OCR post-correction tool. We have demonstrated the utility of the @PhilosTEI tool by applying it to a multilingual, multi-script corpus of important 18th to 20th century European philosophical texts.
| Original language | English |
|---|---|
| Title of host publication | CLARIN-NL in the Low Countries |
| Editors | Jan Odijk, Arjan van Hessen |
| Place of Publication | London |
| Publisher | Ubiquity Press, London |
| Chapter | 32 |
| Pages | 379-392 |
| Number of pages | 13 |
| ISBN (Electronic) | 9781911529255 |
| ISBN (Print) | 9781911529248 |
| DOIs | |
| Publication status | Published - 28 Dec 2017 |
Keywords
- History and philosophy of science and technology
- History of ideas and intellectual history
- Software for humanities
- Textual and linguistic corpora
- TICCL
- OCR post-correction
- @PhilosTEI
Fingerprint
Dive into the research topics of '@PhilosTEI: Building Corpora for Philosophers'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver