Faites connaître cette offre !
Reference : UMR5596-FRASEI0-001
Workplace : LYON 07
Date of publication : Monday, January 07, 2019
Type of Contract : FTC Scientist
Contract Period : 36 months
Expected date of employment : 15 March 2019
Proportion of work : Full time
Remuneration : approx. 2,500 € gross/month
Desired level of education : PhD
Experience required : Indifferent
The researcher will be centrally involved in creating a cross-linguistic corpus of spontaneously spoken language (at least 10,000 words per languages) stemming from fieldwork-based documentations of about 50 small and often endangered languages (DoReCo). S/he will also investigate universal vs. language-specific patterns in the temporal distribution of morphemes regarding (i) information rate in terms of morphemes per second and (ii) the number of morphemes in inter-pausal units to gain insights into cognitive constraints on language use. The work will be carried out in collaboration with project partners in Germany, where the corpora will be automatically time-aligned at the phoneme level, and with the creators of the original corpora; project work will be supported by two student assistants.
1) Corpus creation (months 1-12) from original annotation files into a TEI-conformant format, including:
- Specification of a subset of TEI fields used for DoReCo data
- Development of two-way conversion scripts to im-/export data in the DoReCo TEI format from and to: the EAF-format used by the ELAN software, the NXT-format of the Switchboard Corpus, CSV tab-delimited text, and the toolbox/shoebox format (*.txt / *.tbt),
- incorporation of metadata (information on speaker, collector and annotators names, date and location of recording, etc.) in the TEI files
- Carrying out file conversions into TEI, during which (i) annotation tiers (transcription, translation, morpheme breaks, etc.) will be consistently labelled; (ii) inconsistencies will be resolved regarding, e.g., morpheme-gloss associations.
- Archiving all annotation files with persistent identifiers in the NAKALA repository
- Close collaboration with project partners in Germany (where the time-alignment will be carried out), with the creators of the original corpora and with other project members; supervision of two student assistants.
2) Carry out, in collaboration with other project members, two cross-linguistic studies on information rate and packaging using the DoReCo corpus (months 13-36): One study on information rate in terms of morphemes per second and another on the number of morphemes in inter-pausal units. Present results at scientific conferences and in two co-authored publications.
- experience in corpus linguistics, computational linguistics and language archiving
- experience in or knowledge on (cross-linguistic) research on information rate and packaging
- A prior knowledge of XML/TEI is not necessary but would be an advantage.
- A knowledge of French is not necessary but would be an advantage
The position is part of a joint German/French project funded by DFG and ANR called “Cross-linguistic phonetics and morphology using a time-aligned multilingual reference corpus built from documentations of 50 languages: Big data on small languages” (DoReCo). In Lyon, the project PI is Frank Seifart, in collaboration with François Pellegrino and Laurent Romary. In Germany, the project PI is Manfred Krifka in collaboration with Susanne Fuchs. The German sister project will study phonetic lengthening depending on the nature of segments and as indicators of phrase boundaries. It is an important goal of the project that the DoReCo reference corpus is made available for future research.
We talk about it on Twitter!