Skip to main content

Redefining Estonian parts of speech: a corpus-driven approach

The project aims at a contribution to the unsolved problems related to the categorization of parts of speech in Estonian. The focus is on nominal adverbs, nominal adpositions, and nominal and verbal adjectives, the classes posing most categorization problems for linguists and lexicographers. The theoretical goal of this project is to identify the extent and the status of the partly productive non-compositional patterns and to work out a comprehensive account of the fuzzy areas. The methodological goal of the project is to develop a corpus search application, especially for lexicographic purposes, supporting the lexicographer’s judgment about the lexical category of a word. 


  • We have carried out a study about the needs of the Estonian lexicographers in respect of the parts of speech. The most problematic areas have been charted.
  • We have created a database, which includes in its current state ca 3500 ambiguous (in respect of PoS) cases, which we call ambiforms.


Our goals for this year were:

  • to elaborate the typology of the Estonian ambiforms based on the data gathered from lexicographic databases and the metalexicographic survey;
  • to ascertain the distributional organization of ambiforms; statistical modeling of different types of PoS combinations. The main question we seek answers to is: When becomes frequency significant in PoS classification?
  • to develop a prototype of a corpus-driven application forming a statistical estimation of an ambiform’s morphological distribution.



  • We concentrated on the morphosyntactic profile of adjectives and worked out parameters that would help to determine the relative adjectiveness of an ambiform. The algorithm applying the parameters on corpus search is described here.


  • Paulsen, Geda 2019. Sõnaliigipiiridest kollokatsioonide vaatenurgast: erikäändelised noomenadverbid [Word class boundaries and collocations: The Estonian nominal adverbs in special cases]. Eesti Rakenduslingvistika Ühingu aastaraamat Estonian Papers in Applied Linguistics, 15, 121−137. DOI:
  • Vainik, Ene; Brzozowska, Dorota 2019. The use of positively valued adjectives and adverbs in Polish and Estonian casual conversations. Journal of Pragmatics, 153, 103−115. DOI:
  • Lohk, Ahti; Orav, Heili; Vare, Kadri; Bond, Francis; Vaik, Rasmus 2019. New polysemy structures in Wordnets induced by vertical polysemy. Proceedings of the Tenth Global Wordnet Conference: July 23-27, 2019, Wroclaw (Poland). Ed. Fellbaum, Christiane; Vossen, Piek; Rudnica, Ewa; Maziarz, Marek; Piasecki, Maciej. Wrocław: Wrocław University of Science and Technology, 394−403.

See also:



  • Paulsen, Geda; Vainik, Ene; Tuulik, Maria 2020. Sõnaliik leksikograafi töölaual: sõnaliikide roll tänapäeva leksikograafias [On word classes in contemporary lexicography: The lexicographers’ view]. Estonian Papers in Applied Linguistics, 16, 177−202. DOI: 
  • Vainik, Ene; Paulsen, Geda; Lohk, Ahti 2020. A typology of lexical ambiforms in Estonian. Proceedings of XIX EURALEX Congress: Lexicography for Inclusion, Vol. 1. Ed. Gavriilidou, Z.; Mitsiaki, M.; Fliatouras, A. Alexandroupolis, Greece: Democritus University of Thrace, 119−130. 
  • Vainik, Ene; Tuulik, Maria; Koppel, Kristina 2020. Comparison of collocations and word associations in Estonian from the perspective of parts of speech. Slovenščina 2.0, 8(2), 139−167. DOI:
  • Bond, Francis; Morgado da Costa, Luis; Goodman, Michael Wayne; McCrae, John P.; Lohk, Ahti 2020. Some issues with building a multilingual wordnet. LREC 2020 Conference Proceedings: 12th International Conference on Language Resources and Evaluation, May 11-16, 2020, Marseille, France. Ed. Calzolari, Nicoletta; Béchet, Frédérick; Blache, Philippe; et al. Marseille, France: The European Language Resources Association (ELRA), 3189−3197.
  • Tuulik, Maria 2020. Eesti temperatuuriadjektiivide polüseemiamallid. Eesti Rakenduslingvistika Ühingu aastaraamat Estonian Papers in Applied Linguistics, 16, 223−240. DOI:


  • Vainik, Ene; Paulsen, Geda; Lohk, Ahti (2021). Käändevormist sõnaks: mida näitab sagedus? Eesti Rakenduslingvistika Ühingu aastaraamat = Estonian papers in applied linguistics, 17, 285–307.  
  • Paulsen, Geda; Vainik, Ene; Lohk, Ahti; Tuulik, Maria (2021). Catching lexemes. The case of Estonian noun-based ambiforms. Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference.: eLex 2021 conference: Post-editing lexicography; 5–7 July 2021, virtual. Ed. Kosem, I., Cukr, M., Jakubíček, M., Kallas, J., Krek, S. & Tiberius, C. Brno: Lexical Computing CZ, s.r.o, 288−311.
  • Vainik, Ene; Lohk, Ahti; Paulsen, Geda (2021). The Distribution Index Calculator for Estonian. Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference.: eLex 2021 conference: Post-editing lexicography; 5–7 July 2021, virtual. Ed. Kosem, I., Cukr, M., Jakubíček, M., Kallas, J., Krek, S. & Tiberius, C. Brno: Lexical Computing CZ, s.r.o, 121−138.
  • Lohk, Ahti; Vainik, Ene; Paulsen, Geda; Rebane, Martin; Bond, Francis (2021). Extended Clusters of Vertical Polysemy: An Explorative Study of Eleven Wordnets. Estonian papers in applied linguistics, 17, 193−210. 
kalku poster


Funded by the Estonian Research Council (project No PSG227) 


Kas leidsid, et sisu on kasulik?

Sinu tagasiside on meieni jõudnud. Aitäh!