Beserman multimedia corpus

Discovered a unique new phenomenon based on a couple of corpus examples? Read this: We do our best to transcribe recordings accurately, and we regularly double-check transcripts and correct any inaccuracies noted. Nevertheless, typos and errors in the transcripts are guaranteed to be present. If you find a word, morphological form or construction in the corpus that has never been mentioned in papers and descriptions, there is always a chance that it is an error. If you only found a few examples, be sure to listen to the corresponding sound and make sure the transcription accurately reflects what is being said. If in doubt, contact us, we will help you.

Beserman language

The language of the Beserman belongs to the Permic branch of Uralic languages. It is spoken by about 2000 people, who live mainly in the northwest of Udmurtia. Unfortunately, the number of speakers is rapidly decreasing, as the transmission of the language to the younger generation stopped completely between 2000 and 2005.

Beserman has traditionally been regarded as a supradialect (dialectal group, narechiye) of the Udmurt language (as well as the only dialect within this supradialect). The linguistic difference between Beserman and Udmurt is small, especially if Beserman is compared to the Northern Udmurt dialects. Nevertheless, the Besermans distinguish their language from Udmurt and consider it an important factor of national identity. Beserman is de facto recognized in Udmurtia as a language different from Udmurt. The Day of Beserman language and writing is celebrated in Udmurtia on October 21. There is no official Beserman orthography at the moment. Those who write in Beserman use slightly different spellings, generally based on the Udmurt Cyrillic script. So far, two books have been published in Beserman: Vortčʼa madʼjos (by Vyacheslav Ar-Sergi and Rafail Dyukin) and Pičʼi princ (The Little Prince by Antoine de Saint-Exupéry, translated by Rafail Dyukin).

All morphological grammatical categories are expressed suffixally and agglutinatively, only indefinite and negative pronouns have prefixes. There are no traces of vowel harmony in Beserman, which is presumed to have existed in Proto-Uralic. Nominal grammatical categories include number, case and possessiveness. Verbs distinguish four morphological tenses (direct and evidential past, present and future) and index the person and number of the subject. The direct object is marked by the nominative or the accusative, depending on animateness, referential status, and other factors (differential object marking). The word order in the clause is relatively free, SOV being the default one (subject – direct object – verb).

Corpus characteristics

Language	Beserman (previously classified as a dialect of Udmurt); Russian (code-switching and some utterances by linguists)
Size	The corpus contains full transcripts of recordings, including fragments in Russian. The volume of the corpus as of March 2025 is: - only words in Beserman by native speakers, not counting code-switching: 235 thousand words; - all words by native speakers: 256 thousand words; - total size, including utterances of Udmurt speakers, Besermans who are not native speakers, and linguists: 289 thousand words. By default, only utterances by Beserman native speakers are searched.
Texts	Aligned transcripts of audio and video recordings. These were mostly recorded during field trips to the village of Shamardan, which began in 2003. A few recordings made in several villages in the first half of the 2000s were provided by Nadezhda Lyukina. 40% of the texts (in terms of word count) are free dialogues, 35.9% are dialogs recorded during experiments on referential communication, 24% are monologues (mainly interviews in which the linguist acts as a listener, but also narratives about events or oral translations from Russian), 0.1% are songs. 94% of texts were recorded in Shamardan, the rest were recorded in Vorcha, Pyshkizh, Ozhyar, Bagurt and Yezhgurt Pichinka.
Annotation	Translations of sentences into Russian, including comments necessary to understand the context. Translations of sentences into English. Translations are made with the help of automatic translator DeepL based on the Russian translation. At the moment only a small part of the translations are manually verified. Automatic morphological annotation (lemmatization, part of speech, all inflectional categories) with uniparser_beserman_lat, 97% of word forms have at least one analysisonly words that do not contain digits or Latin characters are counted. Since the analyzer is rule-based, there is homonymy, i.e. one word form can have several different parsing options. Partial disambiguation using Constraint Grammar rules. Annotation of Russian loanwords. Annotation of several lexical/semantic classes: animateness/humanness, body parts, means transport, different classes of proper names. Annotation of the transitivity of verbs and (partially) their subcategorization frames. Glossing. Translations of lemmas into Russian and English.
Metadata	title date (at least the year) of recording place of recording genre and subgenre speaker codes codes of the linguists who participated in recording and transcribing sex of the speaker birth place of the speaker birth year of the speaker

The Latin-based transcription system, used in the transcripts due to a tradition established in our field trips and enabled by default, is somewhat different from the standard ones. Characters with diacritics can be typed using the virtual keyboard. You can choose another transcription system in the corpus settings (the button with the cogwheel on the top left): UPA (Uralic phonetic alphabet / Finno-Ugric transcription), Cyrillic and IPA (International Phonetic Alphabet). Transliteration into these systems is automatic and, as a consequence, may contain inaccuracies.

Utterances in Russian and fragments of utterances, which the corpus authors considered code-switching, are transcribed in Russian in standard Russian orthography.

Below we present the correspondence between the transcription system used in the corpus, UPA (in the variant traditionally used in Udmurt studies), IPA and Cyrillic-based phonetic transcription (also in the variant traditionally used in Udmurt studies).

Consonants

Corpus	UPA	IPA	Cyrillics
m	m	m	м
n	n	n	н
nʼ	ń	nʲ	н'
p	p	p	п
b	b	b	б
t	t	t	т
d	d	d	д
tʼ	tʼ / t́	tʲ	т'
dʼ	dʼ / d́	dʲ	д'
k	k	k	к
g	g	g	г
s	s	s	с
z	z	z	з
š	š	ʂ	ш
ž	ž	ʐ	ж
šʼ	ś	ɕ	с'
žʼ	ź	ʑ	з'
w	u̯	w	ў
f	f	f	ф
x	χ	x	х
j	j	j	й
r	r	r	р
l	l	l	л
lʼ	lʼ / ĺ	lʲ	л'
c	c	t͡s	ц
čʼ	čʼ / č́	t͡ɕ	ч
(č)	(č)	(t͡ʂ)	(ӵ)
ǯʼ	ǯʼ / ǯ́	d͡ʑ	ӟ
(ǯ)	(ǯ)	(d͡ʐ)	(ӝ)

In other palatalized consonants, which marginally occur in spontaneous Russian loanwords, palatalization is also marked with the character ʼ.

Vowels

Corpus	UPA	IPA	Cyrillics
a	a	a	а
ə	ə̑	ʌ	ъ
o	o	o	о
u	u	u	у
ɤ	e̮	ɘ	ӧ
e	e	e	э
ɨ	i̮	ɨ	ы
i	i	i	и

The sections of the recordings that we were unable to unambiguously decipher are labeled as [нрзб] (Russian abbreviation for inaudible). Fragments where we are not completely sure of the transcription are taken in brackets in the transcription. False starts (unfinished words) are marked with the = sign. They are displayed in the search hits, but cannot be searched. In some places, names and other personal data have been replaced by <NAME> for privacy reasons. In translations, information missing in the original is enclosed in brackets (if inserted as part of a sentence) or in parentheses.

What is a corpus?

A language corpus is a collection of texts in a language, provided with additional linguistic information (annotation) and a search engine.

— Who needs corpora?

First of all, corpora are needed by linguists, i.e., researchers studying specific languages or language in general. The search engine and annotation of corpora are designed in such a way that they can be used for linguistic queries like “find all nouns in the genitive case” or “find all forms of the word pənə before verbs”. In addition, corpora can be useful for language teachers (for example, they can be used to find examples for exercises), as well as for language learners and heritage speakers.

— Can the corpus be used as a library?

The default mode of one's intercation with the corpus is different: you define a query—search for a word, phrase or construction—and the corpus returns all sentences in which the searched items occur. By default, the search hits are returned in a randomized order. If needed, the context of each example can be expanded, i.e., its neighboring sentences can be shown. Nevertheless, there is a button with a book above each example, which will open the full text.

— Can the corpus be used as a dictionary?

Not really. Each Beserman word in the corpus has translations into Russian and English. However, this is only auxiliary information for those who do not speak Beserman. The translations of words in the corpus are designed to be short, they do not reflect all nuances of meaning and do not contain carefully selected usage examples.

— What is morphological annotation and how is it done?

The corpus presented here has lemmatization and morphological annotation. Lemmatization means that for each word form its lemma, i.e. initial form (citation form, as found in a dictionary), is specified. Morphological annotation means that for each word form its grammatical characteristics are specified: part of speech, number, case, tense, etc. The annotation was performed automatically with the help of a piece of software called morphological analyzer. The analyzer uses a manually compiled grammatical dictionary and a formalized description of Beserman inflection. The analyzer together with the dictionary is freely distributed and available on github. Unfortunately, the use of rule-based annotation means, first, that words missing in the dictionary will remain unannotated, and second, that there can be ambiguous analyses. For example, when the analyzer sees the form kare, it cannot understand whether it is the 1sg possessive form of the word kar (“my town”), the lative-case form of the same word (“to town”), or a form of the verb karənə “do” in general. The ambiguity is partially removed by manually created contextual rules. Russian sentences (translations) were annotated automatically using the mystem analyzer.

Annotation

Lemmatization

The lemma for nouns, relational nouns, pronouns and adjectives is the morphologically unmarked form, i.e. the non-possessive singular nominative form. The lemma for verbs is the infinitive.

Word forms containing productive derivations are lemmatized without these derivations if the corresponding lemma exists. For nouns, these are the proprietives on -o and on -em and the caritive attributivizer on -tem. For example, šʼašʼkajo 'with flower / flowers' is considered a form of the lexeme šʼašʼka 'flower' and is marked as a noun. For verbs, these are the iterative (-əl/-lʼlʼa), the detransitive (-(i)šʼk) and the productive causative (on -(ə)t, but not on -et and not in -t in verbs of the non-a conjugation), as well as the multiplicative (-ja) when it follows a causative.

Tagset

Grammatical values expressed in each word are indicated with tags. Below is a complete list of tags used for annotating words in Beserman in alphabetical order (within each of the three categories) with explanations.

Parts of speech

A — adjective
ADV — adverb
CONJ — conjunction
IDEO — ideophone
INTERJ — interjection
N — noun
NUM — numeral
PART — particle
POST — postposition (uninflected)
PRED — predicative
PRO — pronoun
RELN — relational noun (inflected postposition)
V — verb

Lexical / semantic classes, unproductive derivation

I — 1st conjugation (not in -a)
II — 2nd conjugation (in -a)
PN — proper noun (subclass of nouns)
act_prs_0 — verb whose actional characteristics prevent it from being used in the present tense, apart from habitual contexts and historical present (only annotated for a sample of verbs)
act_prs_mp — verb whose only actional interpretation in the present tense is MP (multiplicative process) (only annotated for a sample of verbs)
act_prs_p — verb whose only actional interpretation in the present tense is P (process) (only annotated for a sample of verbs)
act_prs_s — verb whose only actional interpretation in the present tense is S (state) (only annotated for a sample of verbs)
act_pst_es — verb whose only actional interpretation in the past tense is ES (entering a state) (only annotated for a sample of verbs)
act_pst_es_mp — verb whose only actional interpretations in the past tense are ES (entering a state) and MP (multiplicative process) (only annotated for a sample of verbs)
act_pst_es_mp_ep — verb whose only actional interpretations in the past tense are ES (entering a state), MP (multiplicative process) and EP (entering a process) (only annotated for a sample of verbs)
act_pst_es_p — verb whose only actional interpretations in the past tense are ES (entering a state) and P (process) (only annotated for a sample of verbs)
act_pst_es_s — verb whose only actional interpretations in the past tense are ES (entering a state) and S (state) (only annotated for a sample of verbs)
act_pst_mp — verb whose only actional interpretation in the past tense is MP (multiplicative process) (only annotated for a sample of verbs)
act_pst_p — verb whose only actional interpretation in the past tense is P (process) (only annotated for a sample of verbs)
act_pst_s — verb whose only actional interpretation in the past tense is S (state) (only annotated for a sample of verbs)
anim — animate noun
body — body part
famn — last name or cognomen that does not coincide with the personal name of one's ancestor
hum — human noun
impers — impersonal verb
indef — indefinite pronoun in -ke or olo-/o-
indef_ke — indefinite pronoun in -ke
indef_olo — indefinite pronoun in olo-/o-
intr — intransitive verb
nation — noun that denotes a nation
neg — negative verb (an element from a small closed list; part of the negative construction that expresses tense, person and, sometimes, number of the subject)
oblin — oblinative (adjective in -ešʼ, which means 'covered / smeared in X')
occupation — noun that denotes an occupation or a societal role
patrn — patronymic
persn — personal (first) name
refl — reflexive pronoun
rel_adj — relational adjective
rus — lexical borrowing from or through Russian
supernat — noun that denotes a supernatural beingSuch a category inevitably arises when classifying nouns by animateness/humanness. Since it is not clear whether to classify such cases as human-denoting nouns, we introduce a separate category for them, thus leaving the choice to the user.
time_meas — noun that denotes a time measurement unit
topn — place name
tr — transitive verb
transport — noun that denotes transport
with_dat — verb with a dative argument (only annotated for a sample of verbs)
with_el — verb with an elative argument (only annotated for a sample of verbs)
with_gen2 — verb with an argument in the second genitive (only annotated for a sample of verbs)
with_inf — verb with an infinitive argument (only annotated for a sample of verbs)
with_ins — verb with an instrumental argument (only annotated for a sample of verbs)
with_lat — verb with a lative argument (only annotated for a sample of verbs)
with_loc — verb with a locative argument (only annotated for a sample of verbs)

Inflection and productive derivation

1 — 1st person of verbs
1pl — 1st person plural possessive
1sg — 1st person singular possessive
2 — 2nd person of verbs
2pl — 2nd person plural possessive
2sg — 2nd person singular possessive
3 — 3rd person of verbs
3pl — 3rd person plural possessive
3sg — 3rd person singular possessive
acc — accusative case
adv — adverbial case (in -ja)
advloc — locative adverbial numeral ('in N places', -etʼ)
advtemp — temporal adverbial numeral ('N days', -oj)
app — approximative (the case in -lanʼ)
attr — any productive attributivizer (noun-to-adjective derivation)
car — caritive (the case in -tek, a.k.a. abessive)
car_attr — caritive attributivizer (-tem)
case_comp — case compounding
caus — causative
comp — comparative/attenuative clitic =ges
cond — conditional (subjunctive) mood
cvb — general converb (-(ə/i)sa)
cvb_ku — converb of simultaneity in -ku (considered Udmurt, but sporadically occurs in texts)
cvb_lim — limitative converb ('until / rather than X') in -tčʼožʼ
cvb_neg — negative converb (-tek)
cvb_onja — converb of simultaneity in -(o)nʼnʼa-
cvb_sim — any converb of simultaneity
dat — dative case
deb — debitive (finite form) or debitive participle in -(o)no
delim — delimitative derivation of nouns and numerals ('in some period') in -skən
detr — detransitive (valency-decreasing derivation in -(i)šʼk; a.k.a. passive)
egr — egressive (the case in -išʼen)
el — elative (the case in -əšʼ/-išʼ)
exhst — exhaustive / aggregative numeral ('all N', -na)
fut — future tense
gen — genitive case
gen2 — 2nd genitive case (in -lešʼ/-ləšʼ, a.k.a. ablative)
imp — imperative
inf — infinitive
inf_cess — cessative infinitive (-(e/ə)məšʼ, from NMLZ-EL)
ins — instrumental
iter — iterative (verbal derivation in -əl/-lʼlʼa, a.k.a. frequentative)
lat — lative (the case in -e/-ə, a.k.a. illative)
loc — locative (the case in -ən, a.k.a. inessive)
mult — multiplicative (verbal derivation in -ja; only productive when following causative)
nmlz — any nominalization (or homonymous participle)
nmlz_em — nominalization in -(e)m (or homonymous past participle)
nmlz_neg — negative nominalization in -(ə/e)mte (or homonymous negative participle)
nmlz_on — nominalization in -(o)n (or homonymous habitual/purposive participle)
nom — nominative case (including unmarked direct objects, a.k.a. unmarked accusative)
ord — ordinal numeral
pl — plural
poss_comp — outer possessive suffix in forms with case compounding
prol — prolative (the case in -tʼi)
prop_em — proprietive derivation in -jem
prop_o — proprietive derivation in -o
prs — present tense
pst — (default/direct) past tense
pst2 — second (evidential) past tense
ptcp_act — active participle (in -(i/ə)šʼ)
ptcp_act_neg — negative active participle (in -(i/ə)šʼtem)
ptcp_hab_neg — negative habitual/purposive participle (in -(o)ntem)
rcs — recessive (the case in -lašʼen)
res — resultative (finite verbal form in -(e/ə)mən)
sg — singular
term — terminative (the case in -ožʼ)

The tagset for the Russian language (in Russian translations) can be found in the Russian National Corpus.

Authors

Starting in 2003, the corpus texts were recorded and transcribed in the field by numerous participants of the field trips. The overwhelming majority of the corpus texts (about 80%) were recorded by Maria Usacheva and Timofey Arkhangelskiy (in some cases, together with other linguists). They, as well as Maria Berseneva, a native speaker of Beserman, prepared the vast majority of transcriptions and translations of the texts into Russian. Olga Biryuk, Ruslan Idrisov, Maria Cheremisinova, Nikolai Filippov and Iuliia Zubova have also significantly contributed to the recording and transcription of the texts. Timofey Arkhangelskiy provides technical support for the corpus and is responsible for correcting earlier transcriptions. Sound-alignment (ELAN) of texts that were transcribed before 2015 and did not have one was performed by Marina Pankova. Most of the alignment of the remaining texts with sound was done by Timofey Arkhangelskiy.

Acknowledgments

The authors of the corpus express their deep gratitude to the Beserman community and to all consultants from the village of Shamardan. The creation of this corpus would not have been possible without their decades-long participation and engagement.

The collection and processing of the corpus data was partially funded by grants, in particular, Russian State Foundation for the Humanities #16-24-17003 “Integral analysis of the nominal group in Finno-Ugric languages: maintenance of referentiality and encoding of the information structure of utterance” and Russian Foundation for Basic Research #20-512-14003 “Linguistic diversity in the Volga-Kama Sprachbund. Typology of grammatical phenomena and language contacts”.

The preparation of the version of the corpus published in 2025 was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant — project no. 428175960 (Timofey Arkhangelskiy).