Workshop on Parts of Speech
Quantitative answers to qualitative questions? The challenge of ambiguity in corpus data
Date: 18 May 2021
Abstract submission deadline: 25 April 2021
Notification of acceptance: 30 April 2021
Contact: sonaliigid@gmail.com
Location: Institute of the Estonian Language (Tallinn). Virtual: https://zoom.us/j/98800620442
The Institute of the Estonian Language hosts a one-day online workshop on Parts-of-Speech called “Quantitative answers to qualitative questions? The challenge of ambiguity in corpus data”. The workshop is organized by the Parts-of-Speech working group of the institute.
When analyzing large amounts of linguistic data, we cannot manage without the specification of parts of speech. In the workshop on PoS, we would like to discuss topics related to the role of PoS in corpus analysis and other digital tools processing language. The questions we address are among others:
- What kinds of problems arise in connection with PoS when creating automatic systems? How do languages differ in respect to the PoS role in corpus analysis?
- As PoS combine syntactic, morphological, and semantic information, what would be the best way to unite these levels for the best possible outcome when dealing with the PoS in corpus analysis and lexicographic work?
- What are the ways to preprocess the corpus data facilitating the specification of the lexical category of a word?
We invite researchers working in the area of large corpora processing, parsing of morphologically rich languages, word sense disambiguation, e-lexicography, etc. to participate in the workshop and to contribute to the discussions related to the role of PoS in language technology.
Form of presentations: 20 minutes for presentation and 10 minutes for discussion.
- The workshop will take an online format; the presentations may be pre-recorded or direct.
- Abstract for presentation (max. 250 words) should be sent to sonaliigid@gmail.com.
- The abstract should include up to 5 keywords, names, E-mail addresses, and affiliations of contributing authors.
- The workshop language is English.
Keynote speakers:
Kairit Sirts (Institute of Computer Science at University of Tartu, NLP Research Group):
What do neural models know about PoS tags?
Miloš Jakubíček (Lexical Computing Ltd, Sketch Engine, NLP Centre at Masaryk University):
Morphology is an open problem of NLP
Program
11.00 AM (GTM+3:00 Helsinki)
Opening words and introduction to WoPoS
11.15
Keynote speech:
Morphology is an open problem of NLP
Miloš Jakubíček (Lexical Computing Ltd, Sketch Engine, NLP Centre at Masaryk University)
Chair: Kristina Koppel
12.15
Part-of-Speech tagging and lemmatization of Estonian learner language
Kais Allkivi-Metsoja (School of Digital Technologies, Tallinn University)
Kaisa Norak (School of Digital Technologies, Tallinn University)
Chair: Ene Vainik
12.45
The Creation of Siberian Ingrian Finnish Speech Corpus: Part-of-Speech Tagging
Ivan Ubaleht (Omsk State Technical University, Russia)
13.15 – 14.00 BREAK
14.00
About quantitative answers to the questions of lexical categorization
Geda Paulsen, Ene Vainik, Maria Tuulik, Ahti Lohk
Chair: Kais Allkivi-Metsoja
14.30
Towards developing a statistic of case form emancipation: D-index and its calculus
Ene Vainik, Ahti Lohk, Geda Paulsen, Maria Tuulik
15.00
How to capture the PoS-prototypical morphosyntactic behaviour: the case of Estonian adjectives
Maria Tuulik, Ene Vainik, Ahti Lohk, Geda Paulsen
15.30
Keynote speech:
What do neural models know about PoS tags?
Kairit Sirts (Institute of Computer Science at University of Tartu, NLP Research Group)
Chair: Geda Paulsen
16.30 – 17.00
Discussion and conclusion
Keynote speech:
Morphology is an open problem of NLP
Miloš Jakubíček
Lexical Computing Ltd, Sketch Engine, NLP Centre at Masaryk University
Morphology was one of the first NLP areas ever researched, with first attempts at part-of-speech tagging dating decades ago. In theory it is kind of a problem solved: for very many languages state-of-the-art accuracy of PoS tagging is reported over 95 %. Despite this, taggers are often found difficult to use in practice, both as for their accuracy as well as other parameters such as interoperability or speed. In my talk I will try to explain where this discrepancy comes from, claiming that, as in my many other NLP tasks, it relates to the way tasks like PoS tagging are typically evaluated.
Keynote speech:
What do neural models know about PoS tags?
Kairit Sirts
Institute of Computer Science at University of Tartu, NLP Research Group
Deep neural models obtain state-of-the-art results in many NLP tasks, including automatic PoS tagging. In this talk, I will first shortly explain the basic building blocks of the neural models and introduce the most important high-level model architectures used for processing textual data. Then we will look at how the neural network based PoS tagging models are trained, how good are their predictions but also where do the models err. Finally, I will offer some insights based on recent literature about what the models internally learn about PoS information even when they have not been explicitly trained to predict PoS tags.
Part-of-Speech tagging and lemmatization of Estonian learner language
Kais Allkivi-Metsoja
School of Digital Technologies, Tallinn University, kais@tlu.ee
Kaisa Norak
School of Digital Technologies, Tallinn University, kaisa.norak@tlu.ee
Non-standard language varieties pose a challenge for existing natural language processing (NLP) tools, mostly developed based on standard language. Our presentation explores the accuracy of PoS tagging and lemmatization of texts written by learners of Estonian as a second language.
Automated analysis of learner language is useful for second language acquisition research, e.g., modelling the language use at different proficiency levels. It is also needed for building language learning applications that give individual feedback to the learner. Thereby, it is vital to seek linguistic tagging solutions with a minimal error rate.
We have compared the performance of two off-the-shelf NLP toolkits, EstNLTK and Stanza, in PoS tagging and lemmatizing A2–C1-level Estonian learner texts. Stanza (formerly StanfordNLP) is a neural pipeline for multilingual text analysis. EstNLTK uses the morphological analyzer and lemmatizer Vabamorf which combines rule-based and statistical models. The test corpus contained approx. 2,000–3,000 tokens representing each proficiency level – 9,431 tokens in total.
The tagging accuracy increased with the proficiency level, varying between 96%–99% for PoS tagging and 92%–98% for lemmatization. At the levels B1, B2 and C1, Stanza performed better in PoS tagging. Lemmatization accuracy did not differ significantly, although Vabamorf’s output included ambiguities problematic for fully automated linguistic analysis.
We will discuss:
– advantages and disadvantages of the compared taggers in learner language analysis;
– causes of PoS tagging and lemmatization errors which often co-occur;
– ways to improve automated tagging: normalizing texts with a spelling corrector, and using a key-value dictionary based on frequent lemmatization errors.
Keywords: learner language, lemmatization, PoS tagging, tagger evaluation, Estonian language
The Creation of Siberian Ingrian Finnish Speech Corpus:
Part-of-Speech Tagging
Ivan Ubaleht
Omsk State Technical University, Russia, ubaleht@gmail.com
Siberian Ingrian Finnish is a language based on the Lower Luga Ingrian Finnish and the Lower Luga Izhorian varieties with influences of Estonian and Russian. This language is used by the descendants of the settlers from the Rosona river area, this area is also known as Estonian Ingria. This language has been existing in Siberia for over 200 years. Daria Sidorkevich researched and documented Siberian Ingrian Finnish in 2008-2014 [1,2]. Natalia Kuznetsova [3] and Mehmet Muslimov continue to research this language to the present.
We started to create the open Siberian Ingrian Finnish speech corpus in 2019. This corpus based on our own audio data from our expeditions and telephone interviews with speakers [4subsection 3.1]. Approximately 5 hours of audio data and part of annotations are available on GitHub and licensed under a Creative Commons Attribution 4.0 license (CC BY 4.0) [5].
Currently, we manually annotate the files with speech data, using ELAN [6]. We started the process of part-of-speech tagging using ELAN too. The part-of-speech tags are located on a separate tier in the ELAN file. This tier is associated with the words’ tier and with several tiers for storing morphological information. On the other hand, the part-of-speech tier is associated with the tier of phrase structure trees. Thus, in our corpus, the part-of-speech tags are associated with words, with morphological information and with syntactic structure. We are planning to create a treebank and some applications for Siberian Ingrian Finnish and for other Lower Luga dialects based on these annotations.
Keywords: Open Speech Corpora, Siberian Ingrian Finnish, Lower Luga Ingrian Finnish, Lower Luga Ingrian, Part-of-Speech Tagging
References:
1. Sidorkevich, D.V. (2014). Yazyk ingermanlandskih pereselentsev v Sibiri (struktura, dialektnye osobennosti, kontaktnye yavleniya), [Язык ингерманландских переселенцев в Сибири (структура, диалектные особенности, контактные явления)], Diss. ILIRAN.
2. Sidorkevich, D.V. (2011). On domains of adessive-allative in Siberian Ingrian Finnish, In Proceedings of Institute for Linguistic Studies Vol. 7, pp. 575-607.
3. Kuznetsova, N. ( 2016). Evolution of the non-initial vocalic length contrast across the Finnic varieties of Ingria and adjacent areas, Linguistica Uralica, Vol. 52, pp. 1-25.
4. Ubaleht, I. (2021, March). Lexeme: The Concept of System and the Creation of Speech Corpora for Two Endangered Languages. In Proceedings of the Workshop on Computational Methods for Endangered Languages Vol. 2, pp. 20-23.
5. The repository of Siberian Ingrian Finnish speech corpus https://github.com/ubaleht/SiberianIngrianFinnish
6. Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H. (2006). ELAN: A Professional Framework for Multimodality Research, In Proceedings of Language Resource and Evaluation 2006, pp. 1557–1559.
About quantitative answers to the questions of lexical categorization
Geda Paulsen
Ene Vainik
Maria Tuulik
Ahti Lohk
Institute of the Estonian Language
Parts of speech provide a categorial frame not only in theoretical linguistics but also in applied linguistics as lexicography and language technology. Even if PoS-tagging is not explicitly needed in the compilation of a dictionary, the categorization issues are there in the background of the lexicographic work (Paulsen et al 2019). The automatic PoS-tagging (performed by the corpus processing systems) resources, in particular the rule-based systems, depend on the work of lexicographers, since the automatic corpus processing tools base the morphological and PoS analysis on the information available in existing dictionaries (see e.g. Orasmaa et al. 2016; Milintsevich & Sirts 2020). The automatic and non-automatic (based on lexicographers’ decision) PoS-labelling processes are hence intertwined and even interdependent, as modern lexicographers rarely work without the support of corpus data.
In our presentation, we discuss the PoS-tagging and language technological issues from the perspective of Estonian lexicographers. The main questions are:
– What are the most problematic areas for lexicographers regarding the PoS-categorization?
– How to decide about the PoS of a word (form) in cases of decategorization?
– How can language technology ease the lexicographic work in PoS-tagging?
The presentation will give an overview of the research results PoS working group of the Institute of the Estonian Language.
Keywords: parts of speech; morphology; decategorization; lexicography; language technology, Estonian
References:
Milintsevich, Kirill; Sirts, Kairit (2020). Lexicon-Enhanced Neural Lemmatization for Estonian. In: Human Language Technologies – The Baltic Perspective (158−165). IOS Press. (Frontiers in Artificial Intelligence and Applications). DOI: 10.3233/FAIA200618.
Orasmaa, S,. Petmanson, T., Tkatšenko, A., Laur, S. & Kaalep, H-J. (2016). EstNLTK – NLP Toolkit for Estonian. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Calzolari N., Khalid Choukri, Declerck, T., Grobelnik, M., Maegaard, B., Mariani, J., Moreno, A., Odijk J. & Stelios P. (eds). Portorož, Slovenia: ELRA, 2460−2466.
Paulsen, G., Vainik, E., Tuulik, M. & Lohk, A. 2019. The lexicographer’s voice: word classes in the digital era. Proceedings of eLex 2019 conference. 1−3 October 2019, Sintra, Portugal. Brno: Lexical Computing CZ, s.r.o., pp. 319–337.
Towards developing a statistic of case form emancipation:
D-index and its calculus
Ene Vainik
Ahti Lohk
Geda Paulsen
Maria Tuulik
Institute of the Estonian Language
In Estonian, one can meet word forms like sõnul (word-PL-ADE ’by (someone’s) saying’, südamest (heart-ELA ’sincerely’), which look like case forms of nouns (sõna ’word’, süda ’heart’) but can behave like function words in the usage. There are two dimensions of lexicographic ambiguity with such word forms: Frist, what is their lexicographic status (i.e., should they be included in a dictionary or not; if included then as headwords or subheadwords?) and, secondly, what is the PoS affiliation of those forms? In our presentation, we will focus only on the first type of ambiguity. We will try to answer the question: how frequent is frequent enough in order to treat a case form as an autonomous lexeme and to consider its inclusion in the dictionary? We will explain our idea to estimate the degree of relative overrepresentation of such case forms by comparing their distribution rate in corpus to the normative rate of case forms in general.
Keywords: form distribution; parts of speech; morphology; lexicography; language technology, Estonian
References:
Blensenius, K. & von Martens, M. (2019). Improving Dictionaries by Measuring Atypical Relative Word-form Frequencies. Proceedings of eLex 2019 conference. 1−3 October 2019. Sintra, Portugal. Brno: Lexical Computing CZ, s.r.o., pp. 660–675.
Grünthal, R. (2003). Finnic Adpositions and Cases in Change. Suomalais-Ugrilaisen Seuran toimituksia 244. Helsinki: Finno-Ugrian Society.
Tavast A., Koppel K., Langemets M., Kallas J. (2020) Towards the Superdictionary: Layers, Tools and Unidirectional Meaning Relations. Proceedings of XIX EURALEX Congress: Lexicography for Inclusion, Vol. 1. Ed. Gavriilidou, Z.; Mitsiaki, M.; Fliatouras, A. Alexandroupolis, Greece: Democritus University of Thrace, 215−223.
Vainik, E.; Paulsen, G. & Lohk, A. (2021). Käändevormist sõnaks: mida näitab sagedus? [From inflected form to a word: the role of frequency]. Accepted by Estonian Papers in Applied Linguistics, 17.
How to capture PoS-prototypical morphosyntactic behaviour:
the case of Estonian adjectives
Maria Tuulik
Ene Vainik
Ahti Lohk
Geda Paulsen
Institute of the Estonian Language
The survey of Estonian lexicographers (Paulsen et al 2019) showed the need for a new digital tool that would facilitate word class identification for ambiguous cases. According to the ideas suggested by the respondents, the solution could be a corpus-driven application presenting statistics with regard to the morphosyntactic distribution of an ambiguous word. 26% of the problematic examples of a “slippery” word class highlighted by the lexicographers were adjectives. In the case of adjectives, the Estonian lexicographers emphasized the difficulty of determining if a verb participle has sufficient adjectival use to be included in a dictionary as an adjective (existing morphological tagging cannot help lexicographers as all v-participles are already tagged as adjectives).
In the present study we examine the morphosyntactic features characteristic of the adjective class and explore what kind of parameters can be tested in the corpora and enable the differentiation of adjectives from other word classes. To what extent are the morphosyntactic features of prototypical adjectives (e.g. occurrence before noun, agreement) specific to the adjective word class? Could these features be used to help distinguish word classes in ambiguous cases in corpora?
In the presentation, we introduce the results of our pilot study. We provide an overview of the test results of five parameters. In the study we analysed 12 groups of 10 words each. The test groups and test words are chosen by hand, with consideration given to the problematic cases outlined by the lexicographers. We compare different types of adjectives as well as different word classes. Since adjectives can behave differently in the corpora depending on if they are, for example, adjective-nouns (e.g. teismeline ’teenage/teenager’), indeclinable adjectives (e.g. katoliku ’catholic’), adjective-adverbs (e.g. alasti ’naked/nakedly’), these groups are analysed separately. The first groups are attempting to determine the features of prototypical adjectives, meaning the statistical baseline of the adjective class; the latter groups try to capture the differences between word classes.
For parameter testing, we used the largest existing corpus of the Estonian Language – Estonian National Corpus 2019. The future aim of the work is to create a digital tool which, using defined parameters, would show a word’s deviation from prototypical word class representatives.
References:
Paulsen, Geda; Vainik, Ene; Tuulik, Maria; Lohk, Ahti (2019). The Lexicographer’s Voice: Word Classes in the Digital Era. Electronic lexicography in the 21st century. Proceedings of eLex 2019 conference., 319−337.
A source of ambiguity: the network of combined PoS categories in Estonian