Data processing
The corpus texts are transcribed and annotated with part-of-speech tags, morphological information and lemmas. The transcription and the annotation are based on guidelines which have been developed in the project and elaborated during the corpus creation. These guidelines are only available in German. Each corpus version contains the guidelines that were current at the time of the publication (cf. the version ReN 1.1).
The most recent guidelines are available here:
- Guidelines for the transcription (PDF)
- Annotation guidelines
Transcription
The texts of the project “Reference Corpus Middle Low German/ Low Rhenish (1200-1650)” are being collected as full texts or as text parts in a volume of about 20,000 words. They are transcribed to the letter. Abbreviations are marked, the beginning of lines, columns, pages and sheets is always tagged. Punctuation marks as well as upper and lower case letters are transcribed in the way they are used in the manuscripts and prints.
Subsequently to the transcription, preparatory work for the grammatical annotation is made. This includes the definition of sentence boundaries and the normalisation of the separate and compound spelling of words (pre-editing).
The transcripts can be found using the document browser in ANNIS (here you can find a short introduction to ANNIS). Depending on the text loading the transcripts might take long. This is a known problem and we are currently working on a solution.
Annotation
In the project the grammatical annotation is composed of a part-of-speech (PoS) tag and information about the inflectional morphology. Both is made semi-automatically, i.e. the results of an automatic tagger are corrected manually.
For the purpose of comparable search requests in the other projects of the “Corpus of Historical German Texts” (“Old German”, “Middle High German” and “Early New High German”), the tagset of ReN uses HiTS as a template. HiTS is a tagset for historical stages of German (Dipper et al. 2013) (PDF), that is based on STTS (Stuttgart-Tübingen-Tagset).
Furthermore, the data of the project “Reference Corpus Middle Low German/ Low Rhenish (1200-1650)” is lemmatised. The lemmatisation is made computer-assisted on the basis of a lemma list, which has been digitalised in Münster.
The manual correction of the annotation is accomplished with the in Bochum developed tool CorA (Bollmann et al. 2014) (PDF).