The project proposed here can be characterized as a preparatory project and aims to produce a blueprint for the construction of a 500-million-word corpus of contemporary written Dutch. This will entail the design of the corpus and the development (or adaptation) of protocols, procedures and tools that are needed for sampling data, cleaning up, converting file formats, marking up, annotating, post editing, and validating the data. In order to support these developments, a 50-million-word pilot corpus will be compiled, parts of which will be enriched with linguistic annotations.
The pilot corpus is intended to demonstrate the feasibility of the approach. It will provide the necessary testing ground on the basis of which feedback can be obtained about the adequacy and practicability of various annotation schemes and procedures, and the level of success with which tools can be applied. Moreover, it will serve to establish the usefulness of this type of resource and annotations for different types of HLT research and the development of applications.
The Danish Center for Sprogteknologi (CST) will undertake the evaluation of the protocols and procedures. At the end of the project, the pilot corpus together with all other results obtained within the project will be made available through the Flemish-Dutch HLT Agency (TST-centrale).