@PhilosTEI

Building Corpora for Philosophers

Arianna Betti, Martin Reynaert, Hein van den Berg

    Research output: Chapter in Book/Report/Conference proceedingChapterScientificpeer-review

    Abstract

    For philosophers to be able to take a computational turn in their field, especially if that field relies heavily on historical material, it is crucial to be able to build high-quality, easily and freely accessible corpora in a sustainable format composed from multi-language, multi-script books from different historical periods. At the moment, corpora matching these needs are virtually non-existent. Within the CLARIN-NL project @PhilosTEI, we have addressed the problem of building this kind of corpora by developing an open-source, web-based, user-friendly workflow from textual images to TEI, based on state-of-the-art open-source OCR software Tesseract, and a multi-language version of TICCL, a powerful OCR post-correction tool. We have demonstrated the utility of the @PhilosTEI tool by applying it to a multilingual, multi-script corpus of important 18th to 20th century European philosophical texts.
    Original languageEnglish
    Title of host publicationCLARIN-NL in the Low Countries
    EditorsJan Odijk, Arjan van Hessen
    Place of PublicationLondon
    PublisherUbiquity Press, London
    Chapter32
    Pages379-392
    Number of pages13
    ISBN (Electronic)9781911529255
    ISBN (Print)9781911529248
    DOIs
    Publication statusPublished - 28 Dec 2017

    Fingerprint

    Philosopher
    Optical Character Recognition
    Language
    Open Source
    World Wide Web
    Software
    Computational
    Historical Periods

    Keywords

    • History and philosophy of science and technology
    • History of ideas and intellectual history
    • Software for humanities
    • Textual and linguistic corpora
    • TICCL
    • OCR post-correction
    • @PhilosTEI

    Cite this

    Betti, A., Reynaert, M., & van den Berg, H. (2017). @PhilosTEI: Building Corpora for Philosophers. In J. Odijk, & A. van Hessen (Eds.), CLARIN-NL in the Low Countries (pp. 379-392). London: Ubiquity Press, London. https://doi.org/10.5334/bbi.32
    Betti, Arianna ; Reynaert, Martin ; van den Berg, Hein. / @PhilosTEI : Building Corpora for Philosophers. CLARIN-NL in the Low Countries. editor / Jan Odijk ; Arjan van Hessen. London : Ubiquity Press, London, 2017. pp. 379-392
    @inbook{cfaa58b45aed40808aec4125424cb95c,
    title = "@PhilosTEI: Building Corpora for Philosophers",
    abstract = "For philosophers to be able to take a computational turn in their field, especially if that field relies heavily on historical material, it is crucial to be able to build high-quality, easily and freely accessible corpora in a sustainable format composed from multi-language, multi-script books from different historical periods. At the moment, corpora matching these needs are virtually non-existent. Within the CLARIN-NL project @PhilosTEI, we have addressed the problem of building this kind of corpora by developing an open-source, web-based, user-friendly workflow from textual images to TEI, based on state-of-the-art open-source OCR software Tesseract, and a multi-language version of TICCL, a powerful OCR post-correction tool. We have demonstrated the utility of the @PhilosTEI tool by applying it to a multilingual, multi-script corpus of important 18th to 20th century European philosophical texts.",
    keywords = "History and philosophy of science and technology, History of ideas and intellectual history, Software for humanities, Textual and linguistic corpora, TICCL, OCR post-correction, @PhilosTEI",
    author = "Arianna Betti and Martin Reynaert and {van den Berg}, Hein",
    year = "2017",
    month = "12",
    day = "28",
    doi = "10.5334/bbi.32",
    language = "English",
    isbn = "9781911529248",
    pages = "379--392",
    editor = "Jan Odijk and {van Hessen}, Arjan",
    booktitle = "CLARIN-NL in the Low Countries",
    publisher = "Ubiquity Press, London",

    }

    Betti, A, Reynaert, M & van den Berg, H 2017, @PhilosTEI: Building Corpora for Philosophers. in J Odijk & A van Hessen (eds), CLARIN-NL in the Low Countries. Ubiquity Press, London, London, pp. 379-392. https://doi.org/10.5334/bbi.32

    @PhilosTEI : Building Corpora for Philosophers. / Betti, Arianna; Reynaert, Martin; van den Berg, Hein.

    CLARIN-NL in the Low Countries. ed. / Jan Odijk; Arjan van Hessen. London : Ubiquity Press, London, 2017. p. 379-392.

    Research output: Chapter in Book/Report/Conference proceedingChapterScientificpeer-review

    TY - CHAP

    T1 - @PhilosTEI

    T2 - Building Corpora for Philosophers

    AU - Betti, Arianna

    AU - Reynaert, Martin

    AU - van den Berg, Hein

    PY - 2017/12/28

    Y1 - 2017/12/28

    N2 - For philosophers to be able to take a computational turn in their field, especially if that field relies heavily on historical material, it is crucial to be able to build high-quality, easily and freely accessible corpora in a sustainable format composed from multi-language, multi-script books from different historical periods. At the moment, corpora matching these needs are virtually non-existent. Within the CLARIN-NL project @PhilosTEI, we have addressed the problem of building this kind of corpora by developing an open-source, web-based, user-friendly workflow from textual images to TEI, based on state-of-the-art open-source OCR software Tesseract, and a multi-language version of TICCL, a powerful OCR post-correction tool. We have demonstrated the utility of the @PhilosTEI tool by applying it to a multilingual, multi-script corpus of important 18th to 20th century European philosophical texts.

    AB - For philosophers to be able to take a computational turn in their field, especially if that field relies heavily on historical material, it is crucial to be able to build high-quality, easily and freely accessible corpora in a sustainable format composed from multi-language, multi-script books from different historical periods. At the moment, corpora matching these needs are virtually non-existent. Within the CLARIN-NL project @PhilosTEI, we have addressed the problem of building this kind of corpora by developing an open-source, web-based, user-friendly workflow from textual images to TEI, based on state-of-the-art open-source OCR software Tesseract, and a multi-language version of TICCL, a powerful OCR post-correction tool. We have demonstrated the utility of the @PhilosTEI tool by applying it to a multilingual, multi-script corpus of important 18th to 20th century European philosophical texts.

    KW - History and philosophy of science and technology

    KW - History of ideas and intellectual history

    KW - Software for humanities

    KW - Textual and linguistic corpora

    KW - TICCL

    KW - OCR post-correction

    KW - @PhilosTEI

    U2 - 10.5334/bbi.32

    DO - 10.5334/bbi.32

    M3 - Chapter

    SN - 9781911529248

    SP - 379

    EP - 392

    BT - CLARIN-NL in the Low Countries

    A2 - Odijk, Jan

    A2 - van Hessen, Arjan

    PB - Ubiquity Press, London

    CY - London

    ER -

    Betti A, Reynaert M, van den Berg H. @PhilosTEI: Building Corpora for Philosophers. In Odijk J, van Hessen A, editors, CLARIN-NL in the Low Countries. London: Ubiquity Press, London. 2017. p. 379-392 https://doi.org/10.5334/bbi.32