Redefining Estonian parts of speech: a corpus-driven approach
The project aims at a contribution to the unsolved problems related to the categorization of parts of speech in Estonian. The focus is on nominal adverbs, nominal adpositions, and nominal and verbal adjectives, the classes posing most categorization problems for linguists and lexicographers. The theoretical goal of this project is to identify the extent and the status of the partly productive non-compositional patterns and to work out a comprehensive account of the fuzzy areas. The methodological goal of the project is to develop a corpus search application, especially for lexicographic purposes, supporting the lexicographer’s judgment about the lexical category of a word.
Team
- Geda Paulsen, Geda.Paulsen@eki.ee
- Ene Vainik, Ene.Vainik@eki.ee
- Maria Tuulik, Maria.Tuulik@eki.ee
- Ahti Lohk, Ahti.Lohk@eki.ee
Tools
The distribution index calculator
We have extended the scope of the calculator to four corpora (2013–2021). The tool is available here: http://teenus.eki.ee/d-index/
Activities
We concentrated on the morphosyntactic profile of adjectives and worked out parameters that would help to determine the relative adjectiveness of an ambiform. The algorithm applying the parameters on corpus search is described here.
We have created a lexicographic tool to detect ambiforms emerging as potentially independent lexemes, the distribution index calculator.
We have organized an online workshop titled Quantitative answers to qualitative questions? The challenge of ambiguity in corpus data. The presentations of the workshop can be seen here.
We participated in the seventh biennial conference on electronic lexicography, eLex 2021 with two papers: Distribution Index Calculator and Catching lexemes. The case of Estonian noun-based ambiforms. In addition, we took part in the nineteenth EURALEX International Congress with the presentation Typology of lexical ambiforms in Estonian. We continue the development of the PoS evaluator and the study of ambiforms.
Our goals for this year were:
- to elaborate the typology of the Estonian ambiforms based on the data gathered from lexicographic databases and the metalexicographic survey;
- to ascertain the distributional organization of ambiforms; statistical modeling of different types of PoS combinations. The main question we seek answers to is: When becomes frequency significant in PoS classification?
- to develop a prototype of a corpus-driven application forming a statistical estimation of an ambiform’s morphological distribution.
We have carried out a study about the needs of the Estonian lexicographers in respect of the parts of speech. The most problematic areas have been charted.
We have created a database, which includes in its current state ca 3500 ambiguous (in respect of PoS) cases, which we call ambiforms.
Related publications
- Vainik, Ene; Paulsen, Geda; Lohk, Ahti (2022). Distributsiooniindeksi kalkulaator eesti keele jaoks. DOI: 10.15155/3-00-0000-0000-0000-08D15L.
- Tuulik, Maria; Vainik, Ene; Paulsen, Geda; Lohk, Ahti (2022). Kuidas ära tunda adjektiivi? Korpuskäitumise mustrite analüüs. Eesti Rakenduslingvistika Ühingu aastaraamat = Estonian papers in applied linguistics, 18, 279−302. DOI: 10.5128/ERYa18.16.
- Paulsen, Geda; Tuulik, Maria; Lohk, Ahti; Vainik, Ene (2022). From verbal to adjectival. Evaluating the lexicalization of participles in an Estonian corpus. Slovenščina 2.0, 10 (1), 65−97. DOI: 10.4312/slo2.0.2022.1.65-97.
- Vainik, Ene; Paulsen, Geda; Lohk, Ahti (2021). Käändevormist sõnaks: mida näitab sagedus? Eesti Rakenduslingvistika Ühingu aastaraamat = Estonian papers in applied linguistics, 17, 285–307. dx.doi.org/10.5128/ERYa17.16
- Paulsen, Geda; Vainik, Ene; Lohk, Ahti; Tuulik, Maria (2021). Catching lexemes. The case of Estonian noun-based ambiforms. Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference.: eLex 2021 conference: Post-editing lexicography; 5–7 July 2021, virtual. Ed. Kosem, I., Cukr, M., Jakubíček, M., Kallas, J., Krek, S. & Tiberius, C. Brno: Lexical Computing CZ, s.r.o, 288−311.
- Vainik, Ene; Lohk, Ahti; Paulsen, Geda (2021). The Distribution Index Calculator for Estonian. Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference.: eLex 2021 conference: Post-editing lexicography; 5–7 July 2021, virtual. Ed. Kosem, I., Cukr, M., Jakubíček, M., Kallas, J., Krek, S. & Tiberius, C. Brno: Lexical Computing CZ, s.r.o, 121−138.
- Paulsen, Geda 2021. Connectives and order. A Tiernet analysis of the Estonian causative connective seeläbi ‘through that’. In: Heikkola, L., Paulsen, G., Wojciechowicz, K., Rosenberg, J. (Eds.). Språkets funktion: Juhlakirja Urpo Nikanteen 60-vuotispäivän kunniaksi, 138−167. Turku: Åbo Akademi University Press.
- Lohk, Ahti; Vainik, Ene; Paulsen, Geda; Rebane, Martin; Bond, Francis (2021). Extended Clusters of Vertical Polysemy: An Explorative Study of Eleven Wordnets. Estonian papers in applied linguistics, 17, 193−210. dx.doi.org/10.5128/ERYa17.11
- Paulsen, Geda; Vainik, Ene; Tuulik, Maria 2020. Sõnaliik leksikograafi töölaual: sõnaliikide roll tänapäeva leksikograafias [On word classes in contemporary lexicography: The lexicographers’ view]. Estonian Papers in Applied Linguistics, 16, 177−202. DOI: http://dx.doi.org/10.5128/ERYa16.11
- Vainik, Ene; Paulsen, Geda; Lohk, Ahti 2020. A typology of lexical ambiforms in Estonian. Proceedings of XIX EURALEX Congress: Lexicography for Inclusion, Vol. 1. Ed. Gavriilidou, Z.; Mitsiaki, M.; Fliatouras, A. Alexandroupolis, Greece: Democritus University of Thrace, 119−130.
- Vainik, Ene; Tuulik, Maria; Koppel, Kristina 2020. Comparison of collocations and word associations in Estonian from the perspective of parts of speech. Slovenščina 2.0, 8(2), 139−167. DOI: https://doi.org/10.4312/slo2.0.2020.2.139-167
- Bond, Francis; Morgado da Costa, Luis; Goodman, Michael Wayne; McCrae, John P.; Lohk, Ahti 2020. Some issues with building a multilingual wordnet. LREC 2020 Conference Proceedings: 12th International Conference on Language Resources and Evaluation, May 11-16, 2020, Marseille, France. Ed. Calzolari, Nicoletta; Béchet, Frédérick; Blache, Philippe; et al. Marseille, France: The European Language Resources Association (ELRA), 3189−3197.
- Tuulik, Maria 2020. Eesti temperatuuriadjektiivide polüseemiamallid. Eesti Rakenduslingvistika Ühingu aastaraamat Estonian Papers in Applied Linguistics, 16, 223−240. DOI: http://dx.doi.org/10.5128/ERYa16.13.
- Paulsen, Geda 2019. Sõnaliigipiiridest kollokatsioonide vaatenurgast: erikäändelised noomenadverbid [Word class boundaries and collocations: The Estonian nominal adverbs in special cases]. Eesti Rakenduslingvistika Ühingu aastaraamat Estonian Papers in Applied Linguistics, 15, 121−137. DOI: http://dx.doi.org/10.5128/ERYa15.07
- Vainik, Ene; Brzozowska, Dorota 2019. The use of positively valued adjectives and adverbs in Polish and Estonian casual conversations. Journal of Pragmatics, 153, 103−115. DOI: https://doi.org/10.1016/j.pragma.2019.02.001
- Lohk, Ahti; Orav, Heili; Vare, Kadri; Bond, Francis; Vaik, Rasmus 2019. New polysemy structures in Wordnets induced by vertical polysemy. Proceedings of the Tenth Global Wordnet Conference: July 23-27, 2019, Wroclaw (Poland). Ed. Fellbaum, Christiane; Vossen, Piek; Rudnica, Ewa; Maziarz, Marek; Piasecki, Maciej. Wrocław: Wrocław University of Science and Technology, 394−403.
- Paulsen, Geda; Vainik, Ene; Tuulik, Maria; Lohk, Ahti 2019. The Lexicographer’s Voice: Word Classes in the Digital Era. Electronic lexicography in the 21st century. Proceedings of eLex 2019 conference, 319−337.
Funded by the Estonian Research Council (project No PSG227) 2019−2022