A Database of Person Marking in South-Central Trans-Himalayan

Sandra Auderset; Hunter L. Brown; Jonathan Reich; Pascal Gerber; Muhammad Zakaria; Linda Konnerth

doi:10.5334/johd.505

1 Context and motivation

Research on how languages represent and mark the category of person has a long tradition in linguistics and continues to attract scholars working in different subfields, such as language documentation, typology, and historical linguistics. Cross-linguistically, person is most often expressed through free pronouns, nominal possessive markers, and verbal person indexes. Typological approaches have largely focused on paradigmatic and structural features (cf. Cysouw (2003, 2011); Nichols (2017); Siewierska (2010), among others) rather than actual forms. Studies on the latter tend to focus on pronouns and intransitive verb forms, and on person markers in isolation (Bliss & Ritter, 2009; Seržant, 2021). Person markers are also often considered diachronically stable: pronouns are included in Swadesh and other regional word lists, such as CALMSEA (Matisoff, 1978), and bound indexes as inflectional markers have been found to have high inheritance stability cross-linguistically (Seržant, 2021). This has, however, not been systematically investigated, also due to a lack of structured and diverse data on this topic. In general, comparative research on person marking is often limited to well-documented languages with simple person marking systems. Thus, we still know relatively little about the ways in which person markers change over time. This is especially true of complex systems, which are often found in underdocumented languages.

1.1 Innovation through high resolution: A microtypological database

The PMST database is designed to facilitate research on the systematic comparison of person markers of different types across related languages, to understand how they function synchronically and change over time.

The overarching goal is to better understand the diversification of person markers through the high resolution afforded by the comparison of inherited and innovative constructions in related languages. We thus employ what can be referred to as a microtypological approach. We achieve high data quality through a broadly collaborative framework as we work with a large number of language experts, including both native speaker linguists and community-external experts. These collaborations enable us to collect data on variation, double-check data from published sources, and achieve a well-informed morphological segmentation of the collected forms. This is all the more crucial considering the nature of the data: SC languages exhibit complex indexation patterns, especially in transitive forms, outlined in 1.2. The database thus also makes a valuable documentary contribution, since comprehensive person marking data are missing for most SC languages in the existing literature.

1.2 Person marking in South-Central languages

The South-Central (a.k.a. “Kuki-Chin”) branch of the Trans-Himalayan (a.k.a. Sino-Tibetan) language family comprises around 50 languages, which are relatively closely related but exhibit great diversity in their person marking systems. Person is typically marked by free pronouns and verbal indexes for S (sole argument of an intransitive verb), A (actor argument of a transitive verb), and P (undergoer argument of a transitive verb). There are both preverbal markers and postverbal markers. The postverbal markers, which are inherited from Proto-Trans-Himalayan (DeLancey, 2010), are less widespread than the preverbal markers. In languages with both sets, they are often in complementary distribution: the postverbal set may be restricted to negative constructions, subordinated or intransitive clauses, or an informal register (DeLancey, 2023; Henderson, 1965); cf. (1) from Ranglong where 1sg is expressed by a preverbal marker in the affirmative and by a postverbal marker in the negative. In transitive paradigms, the two sets co-occur in some languages, indexing A and P arguments, respectively.

(1)	a.	ka-se
		1sg-go
		‘I am going’
	b.	se-mak-ung
		go-neg-1sg
		‘I am not going’	Ranglong

The systems also differ with respect to the presence or absence of markers for specific persons and person configurations. Indexation of speech-act participant (first and second person; SAP) P arguments draws on a great variety of constructions with various innovative markers. Consider Anal Naga in (2-a), where the configuration of 2sg A and 1sg P is indexed through a first person P prefix and a second person suffix not specified for syntactic role. Compare Monsang in (2-b), where the same constellation of arguments receives inverse and second person A marking; and Lamkang in (3), where the inverse variably combines with either second person A or first person P.

(2)	a.	kaː-tà-tì-nʉ́
		1:p-touch-2-nfut
		‘you (SG) touch me’	Anal Naga
	b.	m̩̀-m̥ú-náː-tʃɘ̀
		inv-see-ipfv:tr-2
		‘you (SG) saw me’	Monsang
(3)	a.	a-t-déé
		2-inv-see
		‘you (SG) see me’
	b.	m-t-déé
		1-inv-see
		‘you (SG) see me’	Lamkang

The South-Central languages thus provide an excellent testing ground for studying the dynamics of person marking as they showcase diverse indexation patterns while typically employing cognate or transparently innovated forms.

2 Methods and design principles

The primary aim of the PMST database is to explore distributions and shifts across person forms, so we collect both personal pronouns and person-indexed intransitive and transitive verbs. We first give an overview of the current language sample (2.1), and then move on to data collection methods (2.2) and annotation practices (2.3).

2.1 Current sample

South-Central Trans-Himalayan is comprised of approximately 50 languages spoken throughout the higher-elevation regions of the Indo-Bengali-Burmese borderlands. Our sample includes eight South-Central varieties, listed in Table 1. All but Hyow and Pangkhua belong to the tentative Northwestern subgroup (called “Old Kuki” in the older literature, e.g. Grierson (1904); Shafer (1952)), while Hyow and Pangkhua belong to the Peripheral (> Southern > Southeastern) and Central (> Core Central) branches respectively. All three major subgroups of Peterson’s classification (2017) are thus represented by at least one language in the sample. The geographical location of these languages is presented in Figure 1.

Table 1

Languages included in the first release of PMST, with identifiers, group affiliation, collaborators and sources. Languages are ordered by group. Within Northwestern, languages are ordered by how closely related they are assumed to be following the impressionistic subgroupings in Konnerth (2022).

LANGUAGE	GLOTTOCODE	ISOCODE¹	GROUP	COLLABORATORS & SOURCES
Ranglong	rang1271	(rnl)	Northwestern	Hunter Brown, Jessi Tara
Chiru	chir1283	cdf	Northwestern	Mechek Sampar Awan; Awan (2019)
Anal Naga	anal1239	anm	Northwestern	Pavel Ozerov; Thotson Langhu; Ozerov (2019)
Monsang	mons1234	nmh	Northwestern	Linda Konnerth, Koninglee Wanglar
Lamkang	lamk1238	lmk	Northwestern	Shobhana Chelliah, Rex Rengpu Khullar; Chelliah et al. (2019)
Hmar	hmar1241	hmr	Northwestern	Marina Infimate
Pangkhua	pank1249	pkh	Central	Mohammed Zahid Akter; Akter (2024)
Hyow	khya1239	(csh)	Southern	Muhammad Zakaria; Zakaria (in press)

Location and group affiliation of the sample languages. The inset shows the location of the detailed map within South(east) Asia.

Different kinds of sources were available for different varieties, and coverage of person marking is not comparable across existing materials (grammars, sketches, topical articles, etc.). The majority of SC varieties are best characterized as highly understudied. This is beginning to change, but it remains true of Northwestern: until recently, little descriptive, let alone comparative, work on these languages existed in published form. The two non-Northwestern languages, Pangkhua and Hyow, have only recently come to be comparatively well described, with a grammar of the former in print (Akter, 2024) and one of the latter forthcoming (Zakaria, in press). This diversity of sources is illustrated by Ranglong and Monsang, for which the authors (Brown and Konnerth, respectively) drew on their own fieldwork and had forms checked by native speaker collaborators. For Lamkang, the forms were extracted from published sources in consultation with language experts (Chelliah et al., 2019), while for Hmar, unpublished materials provided by the main researchers on that variety constituted the primary source. When we encountered unclear or potentially missing forms or other apparent errors in the data, native speaker collaborators were consulted directly. Consultations were conducted by the authors and two student assistants via e-mail, WhatsApp, and video calls. Collaboration with language experts is thus a crucial element of our work, enhancing the reliability and consistency of the data (cf. Table 1).

The current sample includes only varieties for which data collection was completed at the time of writing, and so constitutes a convenience sample. We are currently in the process of collecting data from an additional 18 languages from different subgroups, such that future releases of the database will cover a larger and more balanced sample of SC varieties.

2.2 Data collection

We had two goals in mind when delimiting which forms to collect for the database: first, to provide a comprehensive representation of person marking across paradigms for each language; and second, to collect the same types of forms in order to ensure inter-language comparability. To this end, we collected pronouns and intransitive and transitive verb forms with all known person distinctions for each language (cf. 1.2). As a useful compromise between comprehensiveness and comparability, we collected verb forms for four tense-polarity configurations (non-future/future × affirmative/negative). We collected additional subparadigms within these parameters only if the respective forms exhibit differences in terms of person marking, such as the generic non-future forms in Anal Naga. We tried to capture all subparadigms for which this is the case. We do not include reflexive scenarios of transitive verbs, as these are expressed with separate, intransitive constructions. Since the focus of PMST is on person forms, we did not attempt to systematically capture lexeme- and verb stem-based alternations, or inflectional classes that do not affect person markers.

We did include morphophonological variation that directly affects the forms of person markers, such as prefixes with copy vowels or homorganic nasal assimilation, and we included different verb stems in the source forms to capture these processes. We list the lexical verb stem only in the source form and represent it with a ‘Σ’ in the phonological and orthographic representations; but this abstraction notwithstanding, the forms we collected were full verbal forms as given in the sources or provided by collaborators.

2.3 Annotation and standardization of the data

In order to make the collected data comparable and reusable, we standardized and annotated forms based on the guidelines developed for the PMST database. These are explained in more detail in a manual for data curators, which can be found in the supplementary materials (at PMST-Meta/manual_curators.pdf, cf. section 5).

2.3.1 Paradigms and scenarios

For the purposes of the PMST database, a paradigm is defined as a set of forms with different scenarios exhibiting the same features in terms of part of speech (verb vs. pronoun), transitivity, polarity, and tense (if applicable). A subparadigm is a subset of the aforementioned, conditioned by additional grammatical features. In Anal Naga, for example, there are two sets of forms within the non-future paradigm, generic and non-generic (Ozerov, 2019). These are treated as subparadigms.

We use the term ‘scenario’ to refer to the ‘constellation of arguments’ of a given clause, following Witzlack-Makarevich et al. (2016). For pronouns, scenario refers to the person and number of the referent. For example, a 1sg→2sg scenario means that the referent of a first person singular A argument acts on the referent of a second person singular P argument.

The current version of PMST thus contains nine paradigms (not counting subparadigms) for each dataset: pronouns, and non-future and future affirmative and negative forms for intransitive and transitive verbs.

2.3.2 Standardization to IPA

The entries were first collected either in a practical orthography (if one exists), a language-specific version of the IPA, or a combination of both, depending on the sources. To ensure comparability, we provide a standardized phonological representation of each source form in IPA. This conversion was carried out with the qlcData package (Cysouw, 2024) in R (R Core Team, 2025) using orthography profiles. In most cases, the conversion to IPA was straightforward. Tones are represented with Chao’s tone numbers (Chao, 1930) instead of diacritics because this facilitates comparison. Unfortunately, consistent tone analyses are not yet available for most tonal languages of the sample. To represent the verb stem, we use an upper case sigma (Σ) following common practice in Trans-Himalayan studies. We use a plus sign to represent any kind of morph(eme) boundary, as a principled study of constituency levels is outside the scope of the current project. Language-specific details regarding standardization can be found in the README files and data sheets included with each set (readme.md and docs/data_sheet.md) and the orthography profiles in the supplementary materials (PMST-Meta/orthography_profiles).

2.3.3 Variation and tags

Within scenarios and paradigms as defined in 2.3.1, we often find multiple forms due to language-internal variation, which we aim to cover as exhaustively as possible. We use tags to annotate forms and categorize the tags by the type of variation they capture. This information is summarized in Table 2.

Table 2

Overview of tags used to annotate variation.

TAG	CATEGORY OF TAG	DESCRIPTION
default	paradigm_tag	unmarked form (most general, most frequent, etc.) or form that has no other tag
pragm_marked	paradigm_tag	pragmatically conditioned variant
hort	paradigm_tag	form is a hortative
emph	paradigm_tag	form is from an emphatic paradigm
unspec_var	paradigm_tag	variant of (yet) unspecified distribution
generic_nf	tense_tag	generic non-future form
non_generic_nf	tense_tag	non-generic non-future form
past	tense_tag	past tense form
optional_plural	overabundance_tag	form that does not contain a marker for plural
optional_third	overabundance_tag	form that does not contain a marker for third person
optional_future	overabundance_tag	form that does not contain a marker for future tense
variable_order	order_tag	form contains morphemes that can variably order
special_stem	morphanalysis_tag	form has a special stem form in particular cells of a paradigm
tone_alt_stem	morphanalysis_tag	form exhibits a tone alternation triggered by the stem
morphophon	morphanalysis_tag	form exhibits morphophonological process(es)
copy_v	phonanalysis_tag	form has a copy vowel in at least one morpheme
dialect_var	variants_tag	form from other dialect
sociolect_var	variants_tag	form from other sociolect

In accordance with Paralex principles, each form must receive at least one tag and cannot receive more than one tag per category (see 3). The categories broadly correspond to the level to which the variant refers, such as the (sub)paradigm, whole forms, and morphemes within forms. This is explained in more detail below; concrete examples can be found in the manual (PMST-Meta/manual_curators.pdf).

The ‘default’ tag constitutes an anchor point for the other tags. The forms with this tag are thought to be the ‘default’ for a specific paradigm, hence it is categorized as a ‘paradigm’ tag. Because the documentation status and availability of source materials vary widely across languages, this ‘default’ form is not always commensurate between languages. In order to maximize comparability, we used the following criteria in deciding which forms should be tagged as ‘default’:

The ‘default’ is the form used in most contexts and is pragmatically neutral, either according to native speaker linguists or source materials.
If 1) cannot be determined, then the ‘default’ is the morphologically ‘expected’ form based on other forms in the paradigm or the language.
If neither 1) nor 2) can be determined, the ‘default’ is the form that has no other tags.

Other variation related to paradigmatic structure also classified under the ‘paradigm’ tag includes pragmatically marked forms, emphatic forms, hortatives, and other types of (as yet) unspecified variation.

Some grammatical categories exhibit optional marking – that is, the marker can be present or absent in a specific scenario and tense-polarity configuration, thus producing an overabundance of forms within a paradigm. This is found for plural marking of arguments, third persons, and future markers and is categorized as ‘overabundance’. In SC languages, we expect these categories to be marked, which is why the form with the marker receives the ‘default’ tag and the form without receives the ‘optional’ tag.

In some SC languages, there are various non-future subparadigms along tense-aspect parameters, such as the generic and non-generic forms in Anal Naga. These forms are tagged with the relevant tense label and categorized as ‘tense’ tags.

Another set of tags concerns phonological and morphological details or analytical observations concerning specific forms or morph(eme)s within them. A special case concerns the order of morph(eme)s, which can vary within a scenario. The form that shows the non-default order is tagged as ‘variable order’ and this tag is categorized as relating to ‘order’. Other such alternations are subsumed under the ‘morphanalysis’ and ‘phonanalysis’ categories, respectively, and include morphophonological processes such as special stem forms or tone alternations triggered by the stem, and purely phonological processes such as copy vowels.

Finally, there can be variation due to speakers being from different villages (dialects), categorized as ‘variants’ tags.

3 Dataset description and structure of the database

This section describes the technical implementation of the PMST database. PMST is a collection of individual datasets that can be combined with each other, since they are built according to common design principles. Curating each language as an individual set ensures higher data quality and more extensive cross-checks; and this modular approach allows for the database to be expanded with additional sets on a rolling schedule, which facilitates collaboration with language experts.

The PMST database implements the Paralex standard,² which was established in order to facilitate sharing and comparing paradigm data. It is a flexible standard following FAIR (Wilkinson et al., 2016) and CARE (Carroll et al., 2020) principles that can be adjusted and extended to fit the data and research questions at hand.

The working versions of the datasets are hosted as separate repositories on GitHub as part of the organization of the Department of Linguistics at the University of Bern.³ Once a set is ready for publishing, we release a (new) version on GitHub and publish it to Zenodo. In Zenodo, the PMST sets are part of the ‘PMST-Database’⁴ and ‘Paralex’⁵ communities. The ‘PMST-Database’ community was established specifically for this project, and ensures that all datasets that belong to the database can be retrieved from one place. This workflow is schematically illustrated in Figure 2. The description provided in Table 3 applies to all currently published PMST datasets on Zenodo. Table 4 provides an overview of the datasets included in the first release of PMST.

Schematic overview of workflow and the connection between the working versions of the datasets on GitHub and the published versions on Zenodo. A, B, C represent individual languages.

Table 3

Dataset description.

Repository name	Zenodo
Object name	PMST-Database
Repository location	All PMST datasets can be found at https://zenodo.org/communities/pmst/. For DOIs of individual datasets, please consult Table 4.
Format names	csv, json, md, yml
Creation dates	2023-12-27 to 2025-12-10
Publication date	The datasets pertaining to the first release of PMST were published between 2025-12-01 to 2025-12-10.
License	CC-BY-SA 4.0

Table 4

Languages (=datasets) included in the first release of PMST, with the number of forms, the number of scenarios,⁶ and their DOI.

LANGUAGE	FORMS	SCENARIOS	ZENODODOI
Anal Naga	311	184	10.5281/zenodo.17881855
Chiru	674	165	10.5281/zenodo.17779437
Hmar	267	163	10.5281/zenodo.17779055
Hyow	917	352	10.5281/zenodo.17788529
Lamkang	298	158	10.5281/zenodo.17780049
Monsang	336	163	10.5281/zenodo.17865713
Pangkhua	149	145	10.5281/zenodo.17866617
Ranglong	255	142	10.5281/zenodo.17778036

3.1 Dataset structure

In the following, we provide an overview of the implementation of the Paralex standard and present the organization of the datasets. The outline follows the Paralex documentation⁷ and focuses on implementations and details that go beyond or deviate from what is described and exemplified in the standard. Figure 3 shows the currently existing modules and their relationships within each dataset.

Overview of database modules and their relations. Two-way arrows indicate direct links between files, e.g., the forms file can be joined with the cells file via the cell/cell identifier which appears in both files. One-way arrows indicate subset relations, e.g., each phoneme in the phon_form columns appears in the sound file separately.

To facilitate comparative analyses across PMST sets, each file in each data set contains an additional column with the language identifier (language_id). This ensures that each row can be associated with the correct language when combining files. An R script for combining all the modules into one dataset is provided in the supplementary materials (cf. section 5). Each dataset contains the following modules:

Documentation: Each dataset is accompanied by a data sheet outlining how and by whom the data was gathered, annotated, and analyzed. It is meant as a guide for the dataset and an aid in the interpretation of the data. More general information can be found in the README file.

Metadata: The metadata is provided in json according to the Paralex standard and contains key specifications of the dataset and the files contained in it.

Forms: The forms file is the focal point of each dataset. Each row documents a single inflected form and includes: a unique identifier (form_id), the scenario (cell), the verb stem or a placeholder (lexeme), the source form exactly as found in the source materials (source_form), the form in orthography (orth_form), the form in IPA (phon_form), the segmented orthographic form (analysed_orth_form), the segmented IPA form (analysed_phon_form), various columns ending in _tag coding variation, and the source as a cite key (source) with page numbers (page) where applicable.

Morphs: This file (which is not part of the standard) lists all morphs that appear in the forms file in IPA (morph_id). For each morph, there is a list of cells (in_cells) and forms (in_forms) the morph appears in.

Tags: The tags file provides an overview of the tags used to index variants in the dataset. This file is derived from the tag columns in the forms file and can be linked to it via the tag identifiers. For each tag, it contains the following information: a unique identifier for each tag (tag_id), the type or category of the tag (tag_column_name), and a brief explanation of what the tag means (comment).

Cells: The cells file lists scenarios that appear in the forms file. It is derived from the cell column of the forms file and can be linked via that same column. It contains a unique identifier for each scenario (cell_id), the part of speech the form belongs to (POS), and a gloss for each scenario (label).

Features: The feature file presents the feature-value combinations found in the dataset. It is derived from the cells file and contains: a unique identifier for each value (value_id) and a label (label), the feature or category it belongs to (feature), and the value as it appears in the UniMorph schema (Sylak-Glassman, 2016).

Lexemes: The lexemes file lists the verb stems present in the source forms and their meanings. It contains: a unique identifier for each lexeme (lexeme_id) with which it can be linked to the forms file, the part of speech (POS), a list of all occurring forms of the lexeme (label), and the meaning (meaning).

Graphemes: The grapheme file provides a list of graphemes appearing in the source materials (grapheme_id). It is used together with the sounds file for generating the phon columns in the forms file.

Sounds: The sounds file contains each phoneme that appears in the forms file with phonological features. It is thus derivable from the phon columns in the forms file. For each sound, the following is given: a unique id for the sound corresponding to the IPA representation used in the forms file (sound_id), the corresponding IPA representation of the sound in CLTS and PHOIBLE (sound_clts, sound_phoible), the sound id as specified in CLTS (List et al., 2024) and PHOIBLE (Moran & McCloy, 2019), followed by the features as specified in PHOIBLE.

Sources: A BibTeX file contains the references of sources we consulted.

4 Applications

The PMST database can be used to explore a variety of research questions concerning synchronic and diachronic distributions of verbal person indexes and pronouns. These include, but are not limited to: charting differential developments of person markers, comparing intransitive and transitive indexation patterns, and examining interactions of person markers with other grammatical categories. More broadly, the database contributes to the wider discourse on person marking patterns in the world’s languages by providing high-quality data on languages with “hierarchical” characteristics (DeLancey, 2017). It will allow us to better understand how compositional or idiosyncratic the different person scenarios can be in such languages.

There has been growing interest in the evolution of paradigm complexity (cf. Bank (2017); Herce & Bickel (2025)) and this dataset can also be used to explore such questions in a quantitative way. Due to its modular design, the database is amenable to single-language studies as well as cross-linguistic comparison. In the following, we illustrate two use cases of the database in its initial form. First, we showcase how distributional profiles of forms within a language can serve to improve description and provide avenues for further research. Then we show how the datasets can be used in conjunction to analyze microtypological patterns of person forms.

4.1 Distributional profiles of morphs per language

The PMST sets provide a starting point for an empirically based description of person forms, especially in the absence of annotated corpora for SC languages. We demonstrate this here by illustrating the distribution of morphs in one language using a distributional profile, a visualization of the overall distribution of morphs derived from the dataset. It summarizes patterns into a format that allows for cross-checking of database entries and provides a point of departure for comparison of morphs and their distributions within and across languages, as well as for bottom-up morphemic analysis. Figure 4 shows a distributional plot for the Ranglong dataset, which should be read top to bottom for each morph. Each bar represents a morph appearing either before or after the verb stem, or, in the case of pronouns, appearing as a free form. Each panel shows the distribution of the morph across grammatical categories such as tense-aspect and polarity, person, and number.

Distributional profile of morphs in the Ranglong dataset. The top panel shows the distribution across tense-aspect and polarity values. The middle panel shows the distribution across person configurations. The bottom panel shows the distribution across number categories. Stripes are used for elements appearing before the verb stem and circles for those appearing after.

We illustrate how the distributional profile can be used to pinpoint the properties of individual morphs by comparing three of them, keej-, -me, and -u, across the three panels. From the top panel, we can infer that keej- and -me are restricted to a single paradigm each (future-affirmative and nonfuture-negative, respectively), while -u occurs across all four tense-polarity configurations. From the middle panel, we see that keej- and -me occur only where S/A is 1st person and P is either absent or 2nd/3rd person, meaning both index 1.S/A. -u occurs in every scenario except 1st person intransitive, and so must be a non-person, non-tense-polarity marker. Finally, from the bottom panel, we see that keej- occurs only where the S/A is singular, while -me occurs only where the S/A is plural. -u occurs in all scenarios where at least one argument is plural. Putting it all together, we can state that keej- marks 1sg.sa.fut.aff, -me marks 1pl.sa.nfut.neg, and -u is a plural marker that can pluralize any argument but a 1st person S/A in any tense-polarity configuration. We are thus able to arrive at three important facts about Ranglong verbs: 1) argument role and tense-polarity configuration are cumulatively expressed via person indexes; 2) either or both arguments of verbs with additive plural marking may be semantically plural, and the distinction is context-dependent; and 3) there is an asymmetry in number marking such that 1st person S/As are pluralized using portmanteau indexes, whereas all other arguments are pluralized by a number-neutral index plus a general plural marker.

This information can also be used to identify allomorphs and propose a morphemic analysis. For example, the postverbal morphs iŋ and uŋ are segmentally similar and appear in the same contexts. It is thus very likely that they are allomorphs of one morpheme. Of course, the plot is not meant to replace careful descriptive work, but to complement it and provide a data-driven way for fine-tuning the analysis. Perhaps more importantly, it facilitates comparison of cognate morphemes, whose functions and distributions often vary widely between languages.

4.2 Comparing the length of forms across scenarios

Studies on person marking patterns often assume a fundamental split between SAPs and 3^rd person forms (cf. Benveniste (1946); Siewierska (2010)), which is expected to be amplified in languages characterized as exhibiting ‘hierarchical’ indexation patterns (like SC languages) where SAPs are systematically favored morphologically over 3^rd person forms. There are many ways in which this split could manifest, and so it is not immediately clear how to test such claims. One way of operationalizing this is to compare the length of SAP vs. third person forms measured in number of phonemes, the assumption being that SAP forms would be systematically longer than third person forms. This is based on current proposals regarding person forms in Proto-South-Central, where third person forms are reconstructed as shorter or zero (cf. DeLancey (2015)).

Figure 5 summarizes length measured as number of phonemes across scenarios in the current sample languages in transitive affirmative scenarios. Despite being generally described as ‘hierarchical’, the patterns in individual languages differ markedly. For example, forms for local scenarios (SAP→SAP) are not consistently and substantially longer than all other forms, i.e. mixed (SAP→3 and 3→SAP) and non-local (3→3). Although we observe a decline in length from local to mixed to non-local for most languages, there are various individual patterns, and the differences in overall length are smaller than expected. Only Ranglong and Pangkhua exhibit the expected cline. Even within local scenarios, various patterns emerge: In Monsang, Hmar, and Hyow 1→2 forms are longer, while in Chiru and Pangkhua 2→1 forms are longer. In other cases, the forms show no substantive length differences (cf. Ranglong, Anal Naga, and Lamkang). The inclusive forms,⁸ which structurally represent a separate person category in SC rather than a type of first person plural (see also (Ozerov, 2019, 28)), largely pattern with other forms of the same scenario type. An exception to this tendency is found in Lamkang, where the 3→incl form is the longest, and in Hyow, where 3→incl patterns lengthwise with 3→3.

Length of verb forms (minus the lexical stem) in phonemes of transitive affirmative scenarios aggregated per scenario and language. The dot indicates the average; the whiskers show the range. Languages are arranged by subgroup and relatedness (cf. Table 1).

This comparison shows that a simple SAP vs. 3 split is not evident in most languages when considering the length of forms. This invites further, more detailed research on the topic, also because even languages that are assumed to be closely related (such as Anal Naga and Lamkang, cf. (Ozerov, 2019, 26)) exhibit major differences in their patterning.

5 Conclusion

To summarize, this paper introduces the design principles of the PMST database and demonstrates their implementation in the first release. The main contributions of the database in its current version lie in its modular organization (one dataset per language) and the explicit documentation of person marking with fine-grained annotation of variation. The data were collected in close collaboration with various language experts, including native speaker collaborators. The first release covers eight languages and represents an initial convenience sample rather than a genealogically or geographically balanced survey. Future work will involve expanding the language sample and improving balance across subgroups. The database further provides a basis for descriptive, microtypological, and diachronic studies, including a systematic assessment of transitive person marking patterns and the reconstruction of person markers within the South-Central branch. More broadly, the PMST workflow illustrates how paradigm-based datasets can be curated to support both language-internal analysis and controlled comparison across closely related varieties. By integrating collaborative, expert-driven data collection with a transparent and modular database design, the approach is particularly well suited to projects on underdocumented languages. The workflow is compatible with existing open standards and can be adapted for similar studies targeting inflectional systems beyond person marking.

Supplementary Files

The code for producing the map and figures, some additional metadata on the languages, and the orthography profiles can be found on GitHub at https://github.com/isw-unibe-ch/PMST-Meta. This repository also hosts the manual for data curators.

Abbreviations and Glosses

1	first person
2	second person
3	third person
A	A (actor) argument of a transitive predicate
INV	inverse
IPFV	imperfective
NEG	negation
NFUT	non-future
P	P (undergoer) argument of a transitive predicate
S	sole argument of an intransitive predicate
SAP	speech-act participant (first and second person)
SC	South-Central (branch of Trans-Himalayan)
SG	singular
TR	transitive

Notes

[1] The ISOcodes given in brackets are strictly speaking incorrect, but ISOcodes are necessary for Paralex validation (see section 3). That given for Hyow actually represents Asho Chin, of which Hyow is considered to be a dialect, though they are not readily mutually intelligible. That given for Ranglong actually refers to a group of distinct varieties (‘Halam’), many of which exhibit low mutual intelligibility with Ranglong.

[2] https://www.paralex-standard.org/.

[3] https://github.com/isw-unibe-ch.

[4] https://zenodo.org/communities/pmst/.

[5] https://zenodo.org/communities/paralex/.

[6] The number of scenarios varies, as the languages may or may not have clusivity or dual number; and the number of forms varies with the amount of variation recorded.

[7] https://www.paralex-standard.org/standard/.

[8] The terms ‘inclusive’ and ‘exclusive’ are used for making a distinction between person forms that include the addressee(s) and the speaker, and such that do not include the addressee.

Acknowledgements

We would like to express our thanks to native speaker collaborators/linguists Marina Infimate (Hmar), Mechek Sampar Awan (Chiru), Koninglee Wanglar (Monsang), Thotson Langhu (Anal Naga), and Jessi Tara Ranglong (Ranglong) for assisting us with source form collection and correction. Pavel Ozerov collaborated with us on the curation of the Anal Naga dataset. Many thanks also to Shobhana Chelliah and Rex Rengpu Khullar for their support with the Lamkang data. Zahid Akter’s help with the Pangkhua data is also gratefully acknowledged.

We also thank student assistants Sarah Schmid, Philipp Theiler, and Antonia Pauly at the University of Bern for their assistance with data entry and figure adjustments.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

SA: Conceptualization, Data Curation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing; HLB: Data Curation, Writing – original draft, Writing – review & editing; JR: Data Curation, Writing – original draft, Writing – review & editing; PG: Writing – original draft, Writing – review & editing; MZ: Data Curation, Writing – review & editing; LK: Conceptualization, Data Curation, Funding Acquisition, Supervision, Writing – original draft, Writing – review & editing.