Beserman multimedia corpus


This is the start page of the multimedia (audio and video) corpus of the language of the Besermans, an indigenous people living mainly in the northwest of Udmurtia. Beserman belongs to the Permic branch of the Uralic family and is closely related to Udmurt.

Details Search

You can find Udmurt corpora here.

Beserman multimedia corpus

Toggle navigation

Beserman language

The language of the Beserman belongs to the Permic branch of Uralic languages. It is spoken by about 2000 people, who live mainly in the northwest of Udmurtia. Unfortunately, the number of speakers is rapidly decreasing, as the transmission of the language to the younger generation stopped completely between 2000 and 2005.

Beserman has traditionally been regarded as a supradialect (dialectal group, narechiye) of the Udmurt language (as well as the only dialect within this supradialect). The linguistic difference between Beserman and Udmurt is small, especially if Beserman is compared to the Northern Udmurt dialects. Nevertheless, the Besermans distinguish their language from Udmurt and consider it an important factor of national identity. Beserman is de facto recognized in Udmurtia as a language different from Udmurt. The Day of Beserman language and writing is celebrated in Udmurtia on October 21. There is no official Beserman orthography at the moment. Those who write in Beserman use slightly different spellings, generally based on the Udmurt Cyrillic script. So far, two books have been published in Beserman: Vortčʼa madʼjos (by Vyacheslav Ar-Sergi and Rafail Dyukin) and Pičʼi princ (The Little Prince by Antoine de Saint-Exupéry, translated by Rafail Dyukin).

All morphological grammatical categories are expressed suffixally and agglutinatively, only indefinite and negative pronouns have prefixes. There are no traces of vowel harmony in Beserman, which is presumed to have existed in Proto-Uralic. Nominal grammatical categories include number, case and possessiveness. Verbs distinguish four morphological tenses (direct and evidential past, present and future) and index the person and number of the subject. The direct object is marked by the nominative or the accusative, depending on animateness, referential status, and other factors (differential object marking). The word order in the clause is relatively free, SOV being the default one (subject – direct object – verb).

Corpus characteristics

Language Beserman (previously classified as a dialect of Udmurt); Russian (code-switching and some utterances by linguists)
Size The corpus contains full transcripts of recordings, including fragments in Russian. The volume of the corpus as of March 2025 is:
- only words in Beserman by native speakers, not counting code-switching: 235 thousand words;
- all words by native speakers: 256 thousand words;
- total size, including utterances of Udmurt speakers, Besermans who are not native speakers, and linguists: 289 thousand words.
By default, only utterances by Beserman native speakers are searched.
Texts Aligned transcripts of audio and video recordings. These were mostly recorded during field trips to the village of Shamardan, which began in 2003. A few recordings made in several villages in the first half of the 2000s were provided by Nadezhda Lyukina.
40% of the texts (in terms of word count) are free dialogues, 35.9% are dialogs recorded during experiments on referential communication, 24% are monologues (mainly interviews in which the linguist acts as a listener, but also narratives about events or oral translations from Russian), 0.1% are songs.
94% of texts were recorded in Shamardan, the rest were recorded in Vorcha, Pyshkizh, Ozhyar, Bagurt and Yezhgurt Pichinka.
Annotation
  • Translations of sentences into Russian, including comments necessary to understand the context.
  • Translations of sentences into English. Translations are made with the help of automatic translator DeepL based on the Russian translation. At the moment only a small part of the translations are manually verified.
  • Automatic morphological annotation (lemmatization, part of speech, all inflectional categories) with uniparser_beserman_lat, 97% of word forms have at least one analysisonly words that do not contain digits or Latin characters are counted. Since the analyzer is rule-based, there is homonymy, i.e. one word form can have several different parsing options.
  • Partial disambiguation using Constraint Grammar rules.
  • Annotation of Russian loanwords.
  • Annotation of several lexical/semantic classes: animateness/humanness, body parts, means transport, different classes of proper names.
  • Annotation of the transitivity of verbs and (partially) their subcategorization frames.
  • Glossing.
  • Translations of lemmas into Russian and English.
Metadata
  • title
  • date (at least the year) of recording
  • place of recording
  • genre and subgenre
  • speaker codes
  • codes of the linguists who participated in recording and transcribing
  • sex of the speaker
  • birth place of the speaker
  • birth year of the speaker

The Latin-based transcription system, used in the transcripts due to a tradition established in our field trips and enabled by default, is somewhat different from the standard ones. Characters with diacritics can be typed using the virtual keyboard. You can choose another transcription system in the corpus settings (the button with the cogwheel on the top left): UPA (Uralic phonetic alphabet / Finno-Ugric transcription), Cyrillic and IPA (International Phonetic Alphabet). Transliteration into these systems is automatic and, as a consequence, may contain inaccuracies.

Utterances in Russian and fragments of utterances, which the corpus authors considered code-switching, are transcribed in Russian in standard Russian orthography.

Below we present the correspondence between the transcription system used in the corpus, UPA (in the variant traditionally used in Udmurt studies), IPA and Cyrillic-based phonetic transcription (also in the variant traditionally used in Udmurt studies).

Consonants

CorpusUPAIPACyrillics
mmmм
nnnн
ńн'
pppп
bbbб
tttт
dddд
tʼ / t́т'
dʼ / d́д'
kkkк
gggг
sssс
zzzз
ššʂш
žžʐж
šʼśɕс'
žʼźʑз'
wwў
fffф
xχxх
jjjй
rrrр
lllл
lʼ / ĺл'
cct͡sц
čʼčʼ / č́t͡ɕч
(č)(č)(t͡ʂ)(ӵ)
ǯʼǯʼ / ǯ́d͡ʑӟ
(ǯ)(ǯ)(d͡ʐ)(ӝ)

In other palatalized consonants, which marginally occur in spontaneous Russian loanwords, palatalization is also marked with the character ʼ.

Vowels

CorpusUPAIPACyrillics
aaaа
əə̑ʌъ
oooо
uuuу
ɤɘӧ
eeeэ
ɨɨы
iiiи

The sections of the recordings that we were unable to unambiguously decipher are labeled as [нрзб] (Russian abbreviation for inaudible). Fragments where we are not completely sure of the transcription are taken in brackets in the transcription. False starts (unfinished words) are marked with the = sign. They are displayed in the search hits, but cannot be searched. In some places, names and other personal data have been replaced by <NAME> for privacy reasons. In translations, information missing in the original is enclosed in brackets (if inserted as part of a sentence) or in parentheses.

What is a corpus?

A language corpus is a collection of texts in a language, provided with additional linguistic information (annotation) and a search engine.

— Who needs corpora?

First of all, corpora are needed by linguists, i.e., researchers studying specific languages or language in general. The search engine and annotation of corpora are designed in such a way that they can be used for linguistic queries like “find all nouns in the genitive case” or “find all forms of the word pənə before verbs”. In addition, corpora can be useful for language teachers (for example, they can be used to find examples for exercises), as well as for language learners and heritage speakers.

— Can the corpus be used as a library?

The default mode of one's intercation with the corpus is different: you define a query—search for a word, phrase or construction—and the corpus returns all sentences in which the searched items occur. By default, the search hits are returned in a randomized order. If needed, the context of each example can be expanded, i.e., its neighboring sentences can be shown. Nevertheless, there is a button with a book above each example, which will open the full text.

— Can the corpus be used as a dictionary?

Not really. Each Beserman word in the corpus has translations into Russian and English. However, this is only auxiliary information for those who do not speak Beserman. The translations of words in the corpus are designed to be short, they do not reflect all nuances of meaning and do not contain carefully selected usage examples.

— What is morphological annotation and how is it done?

The corpus presented here has lemmatization and morphological annotation. Lemmatization means that for each word form its lemma, i.e. initial form (citation form, as found in a dictionary), is specified. Morphological annotation means that for each word form its grammatical characteristics are specified: part of speech, number, case, tense, etc. The annotation was performed automatically with the help of a piece of software called morphological analyzer. The analyzer uses a manually compiled grammatical dictionary and a formalized description of Beserman inflection. The analyzer together with the dictionary is freely distributed and available on github. Unfortunately, the use of rule-based annotation means, first, that words missing in the dictionary will remain unannotated, and second, that there can be ambiguous analyses. For example, when the analyzer sees the form kare, it cannot understand whether it is the 1sg possessive form of the word kar (“my town”), the lative-case form of the same word (“to town”), or a form of the verb karənə “do” in general. The ambiguity is partially removed by manually created contextual rules. Russian sentences (translations) were annotated automatically using the mystem analyzer.

Annotation

Lemmatization

The lemma for nouns, relational nouns, pronouns and adjectives is the morphologically unmarked form, i.e. the non-possessive singular nominative form. The lemma for verbs is the infinitive.

Word forms containing productive derivations are lemmatized without these derivations if the corresponding lemma exists. For nouns, these are the proprietives on -o and on -em and the caritive attributivizer on -tem. For example, šʼašʼkajo 'with flower / flowers' is considered a form of the lexeme šʼašʼka 'flower' and is marked as a noun. For verbs, these are the iterative (-əl/-lʼlʼa), the detransitive (-(i)šʼk) and the productive causative (on -(ə)t, but not on -et and not in -t in verbs of the non-a conjugation), as well as the multiplicative (-ja) when it follows a causative.

Tagset

Grammatical values expressed in each word are indicated with tags. Below is a complete list of tags used for annotating words in Beserman in alphabetical order (within each of the three categories) with explanations.

Parts of speech

  • A — adjective
  • ADV — adverb
  • CONJ — conjunction
  • IDEO — ideophone
  • INTERJ — interjection
  • N — noun
  • NUM — numeral
  • PART — particle
  • POST — postposition (uninflected)
  • PRED — predicative
  • PRO — pronoun
  • RELN — relational noun (inflected postposition)
  • V — verb

Lexical / semantic classes, unproductive derivation

  • I — 1st conjugation (not in -a)
  • II — 2nd conjugation (in -a)
  • PN — proper noun (subclass of nouns)
  • act_prs_0 — verb whose actional characteristics prevent it from being used in the present tense, apart from habitual contexts and historical present (only annotated for a sample of verbs)
  • act_prs_mp — verb whose only actional interpretation in the present tense is MP (multiplicative process) (only annotated for a sample of verbs)
  • act_prs_p — verb whose only actional interpretation in the present tense is P (process) (only annotated for a sample of verbs)
  • act_prs_s — verb whose only actional interpretation in the present tense is S (state) (only annotated for a sample of verbs)
  • act_pst_es — verb whose only actional interpretation in the past tense is ES (entering a state) (only annotated for a sample of verbs)
  • act_pst_es_mp — verb whose only actional interpretations in the past tense are ES (entering a state) and MP (multiplicative process) (only annotated for a sample of verbs)
  • act_pst_es_mp_ep — verb whose only actional interpretations in the past tense are ES (entering a state), MP (multiplicative process) and EP (entering a process) (only annotated for a sample of verbs)
  • act_pst_es_p — verb whose only actional interpretations in the past tense are ES (entering a state) and P (process) (only annotated for a sample of verbs)
  • act_pst_es_s — verb whose only actional interpretations in the past tense are ES (entering a state) and S (state) (only annotated for a sample of verbs)
  • act_pst_mp — verb whose only actional interpretation in the past tense is MP (multiplicative process) (only annotated for a sample of verbs)
  • act_pst_p — verb whose only actional interpretation in the past tense is P (process) (only annotated for a sample of verbs)
  • act_pst_s — verb whose only actional interpretation in the past tense is S (state) (only annotated for a sample of verbs)
  • anim — animate noun
  • body — body part
  • famn — last name or cognomen that does not coincide with the personal name of one's ancestor
  • hum — human noun
  • impers — impersonal verb
  • indef — indefinite pronoun in -ke or olo-/o-
  • indef_ke — indefinite pronoun in -ke
  • indef_olo — indefinite pronoun in olo-/o-
  • intr — intransitive verb
  • nation — noun that denotes a nation
  • neg — negative verb (an element from a small closed list; part of the negative construction that expresses tense, person and, sometimes, number of the subject)
  • oblin — oblinative (adjective in -ešʼ, which means 'covered / smeared in X')
  • occupation — noun that denotes an occupation or a societal role
  • patrn — patronymic
  • persn — personal (first) name
  • refl — reflexive pronoun
  • rel_adj — relational adjective
  • rus — lexical borrowing from or through Russian
  • supernat — noun that denotes a supernatural beingSuch a category inevitably arises when classifying nouns by animateness/humanness. Since it is not clear whether to classify such cases as human-denoting nouns, we introduce a separate category for them, thus leaving the choice to the user.
  • time_meas — noun that denotes a time measurement unit
  • topn — place name
  • tr — transitive verb
  • transport — noun that denotes transport
  • with_dat — verb with a dative argument (only annotated for a sample of verbs)
  • with_el — verb with an elative argument (only annotated for a sample of verbs)
  • with_gen2 — verb with an argument in the second genitive (only annotated for a sample of verbs)
  • with_inf — verb with an infinitive argument (only annotated for a sample of verbs)
  • with_ins — verb with an instrumental argument (only annotated for a sample of verbs)
  • with_lat — verb with a lative argument (only annotated for a sample of verbs)
  • with_loc — verb with a locative argument (only annotated for a sample of verbs)

Inflection and productive derivation

  • 1 — 1st person of verbs
  • 1pl — 1st person plural possessive
  • 1sg — 1st person singular possessive
  • 2 — 2nd person of verbs
  • 2pl — 2nd person plural possessive
  • 2sg — 2nd person singular possessive
  • 3 — 3rd person of verbs
  • 3pl — 3rd person plural possessive
  • 3sg — 3rd person singular possessive
  • acc — accusative case
  • adv — adverbial case (in -ja)
  • advloc — locative adverbial numeral ('in N places', -etʼ)
  • advtemp — temporal adverbial numeral ('N days', -oj)
  • app — approximative (the case in -lanʼ)
  • attr — any productive attributivizer (noun-to-adjective derivation)
  • car — caritive (the case in -tek, a.k.a. abessive)
  • car_attr — caritive attributivizer (-tem)
  • case_comp — case compounding
  • caus — causative
  • comp — comparative/attenuative clitic =ges
  • cond — conditional (subjunctive) mood
  • cvb — general converb (-(ə/i)sa)
  • cvb_ku — converb of simultaneity in -ku (considered Udmurt, but sporadically occurs in texts)
  • cvb_lim — limitative converb ('until / rather than X') in -tčʼožʼ
  • cvb_neg — negative converb (-tek)
  • cvb_onja — converb of simultaneity in -(o)nʼnʼa-
  • cvb_sim — any converb of simultaneity
  • dat — dative case
  • deb — debitive (finite form) or debitive participle in -(o)no
  • delim — delimitative derivation of nouns and numerals ('in some period') in -skən
  • detr — detransitive (valency-decreasing derivation in -(i)šʼk; a.k.a. passive)
  • egr — egressive (the case in -išʼen)
  • el — elative (the case in -əšʼ/-išʼ)
  • exhst — exhaustive / aggregative numeral ('all N', -na)
  • fut — future tense
  • gen — genitive case
  • gen2 — 2nd genitive case (in -lešʼ/-ləšʼ, a.k.a. ablative)
  • imp — imperative
  • inf — infinitive
  • inf_cess — cessative infinitive (-(e/ə)məšʼ, from NMLZ-EL)
  • ins — instrumental
  • iter — iterative (verbal derivation in -əl/-lʼlʼa, a.k.a. frequentative)
  • lat — lative (the case in -e/-ə, a.k.a. illative)
  • loc — locative (the case in -ən, a.k.a. inessive)
  • mult — multiplicative (verbal derivation in -ja; only productive when following causative)
  • nmlz — any nominalization (or homonymous participle)
  • nmlz_em — nominalization in -(e)m (or homonymous past participle)
  • nmlz_neg — negative nominalization in -(ə/e)mte (or homonymous negative participle)
  • nmlz_on — nominalization in -(o)n (or homonymous habitual/purposive participle)
  • nom — nominative case (including unmarked direct objects, a.k.a. unmarked accusative)
  • ord — ordinal numeral
  • pl — plural
  • poss_comp — outer possessive suffix in forms with case compounding
  • prol — prolative (the case in -tʼi)
  • prop_em — proprietive derivation in -jem
  • prop_o — proprietive derivation in -o
  • prs — present tense
  • pst — (default/direct) past tense
  • pst2 — second (evidential) past tense
  • ptcp_act — active participle (in -(i/ə)šʼ)
  • ptcp_act_neg — negative active participle (in -(i/ə)šʼtem)
  • ptcp_hab_neg — negative habitual/purposive participle (in -(o)ntem)
  • rcs — recessive (the case in -lašʼen)
  • res — resultative (finite verbal form in -(e/ə)mən)
  • sg — singular
  • term — terminative (the case in -ožʼ)

The tagset for the Russian language (in Russian translations) can be found in the Russian National Corpus.

Authors

Starting in 2003, the corpus texts were recorded and transcribed in the field by numerous participants of the field trips. The overwhelming majority of the corpus texts (about 80%) were recorded by Maria Usacheva and Timofey Arkhangelskiy (in some cases, together with other linguists). They, as well as Maria Berseneva, a native speaker of Beserman, prepared the vast majority of transcriptions and translations of the texts into Russian. Olga Biryuk, Ruslan Idrisov, Maria Cheremisinova, Nikolai Filippov and Iuliia Zubova have also significantly contributed to the recording and transcription of the texts. Timofey Arkhangelskiy provides technical support for the corpus and is responsible for correcting earlier transcriptions. Sound-alignment (ELAN) of texts that were transcribed before 2015 and did not have one was performed by Marina Pankova. Most of the alignment of the remaining texts with sound was done by Timofey Arkhangelskiy.

Acknowledgments

The authors of the corpus express their deep gratitude to the Beserman community and to all consultants from the village of Shamardan. The creation of this corpus would not have been possible without their decades-long participation and engagement.

The collection and processing of the corpus data was partially funded by grants, in particular, Russian State Foundation for the Humanities #16-24-17003 “Integral analysis of the nominal group in Finno-Ugric languages: maintenance of referentiality and encoding of the information structure of utterance” and Russian Foundation for Basic Research #20-512-14003 “Linguistic diversity in the Volga-Kama Sprachbund. Typology of grammatical phenomena and language contacts”.

The preparation of the version of the corpus published in 2025 was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant — project no. 428175960 (Timofey Arkhangelskiy).

Contact


If you have any questions, would like to propose a collaboration, or have noticed an error in the corpus, please email Timofey Arkhangelskiy. In addition, you may use the freely available Beserman morphological analyzer and corpus platform tsakorpus at your discretion.