
Figure 1
Liturgy defines positions for chant in each day of the year. Different communities of practitioners might, however, use different chants in some of these positions. Each manuscript, therefore, documents how a particular ecclesiastical community placed chants in the liturgy. This assignment is communicated via rubrics, instructions in red ink that indicate specifics such as feasts, services during the day or indications of genre. Comparing chants in the same liturgical position in multiple manuscripts reveals some variety: the Invitatory antiphon (rubric parts highlighted by pink rectangles) for the feast Vigilia Nativitatis Domini (rubric parts highlighted in blue), with chant Hodie scietis in (a), (c) and (d); Levate capita vestra in (b), Prope est jam dominus in (e); and Christus adveniet nobis in (f).

Figure 2
A simplified schema of our contribution.

Figure 3
Cataloguing chant in the Cantus ecosystem. An expert identifies chants in a manuscript (left panel), and creates their database records (middle panel, top). The key step in creating a chant record is assigning the Cantus ID: identifying which unit of chant repertoire is on the page (pink process). Besides transcribing the text, liturgical expertise is needed, as one must correctly interpret abbreviated notes in the manuscript – rubrics – to identify the liturgical position and function of the chant (dark green process); together with the text of the chant, this allows one to select the correct Cantus ID among the ‘master records' in the Cantus Index (right panel, top). A link to the source record (middle panel, bottom) and page (folio) within the source is added (light blue process). Once a record with a Cantus ID is added to a database in the Cantus ecosystem, the Cantus Index federated search mechanism (right panel, bottom) will discover the record (dark purple process). Descriptions of individual fields mentioned in the figure can be found in Tables 1 and 2. (Screenshots from the given URLs have been adjusted for readability.)
Table 1
Chants fields overview. The asterisk (*) indicates required fields.
| Field | Description |
|---|---|
| chantlink* | URL link directly to the chant entry in the external database. Unique ID. |
| incipit* | The opening words of the chant. |
| cantus_id* | The Cantus ID associated with the chant (e.g. 007129a). |
| mode | Mode of the chant. |
| siglum* | Abbreviation for the source manuscript or collection (e.g. A‑ABC Fragm. 1), ideally RISM. |
| position | Order of the chant in the office (first, second, etc.). |
| folio* | Folio information for the chant. |
| sequence | The order of the chant on the folio. |
| feast | Feast or liturgical occasion when the chant is used. |
| feast_code | Additional identifier unifying feasts with multiple spellings. The values are meaningful in Cantus Index. |
| genre | Genre of the chant, such as antiphon (A), responsory (R), etc.25 |
| office | The liturgy in which the chant is used, such as Matins (M) or Lauds (L). |
| srclink* | URL link to the source in the external database. |
| melody_id | The Melody ID associated with the chant (e.g. 001216m1). Rarely used. |
| full_text | Full text of the chant. |
| melody | Melody encoded in Volpiano. |
| db* | Abbreviation of the source database. |
| image | URL link to an image of the manuscript page. |

Figure 4
Overview of support for source metadata among Cantus Index database front‑ends. Lightest green indicates support under a differently named field; darkest green indicates fields that were selected to be included in CantusCorpus v1.0.
Table 2
Sources fields overview. The asterisk (*) indicates required fields.
| Field | Description |
|---|---|
| title | Manuscript name (may use siglum). |
| siglum* | Abbreviation for the source manuscript, possibly RISM. |
| century | Text identifying the century of the source. |
| provenance | Place of origin or use of the source. |
| srclink* | URL link to the source in the external database. Unique ID. |
| cursus | Secular or Monastic cursus of the source. |
| num_century | Integer representation of a century. |
Table 3
Basic quantitative values of the chants part of the CantusCorpus v1.0 dataset.
| Chant records in chants.csv | Number |
|---|---|
| All | 888,010 |
| With Volpiano melody | 60,588 |
| With Volpiano melody of 20+ notes | 44,625 |
Table 4
Basic quantitative values of the sources part of the CantusCorpus v1.0 dataset.
| Source records in sources.csv | Number |
|---|---|
| All | 2,278 |
| All with 100+ chants | 508 |
| Those with provenance value | 1,606 |
| Those with century value | 2,240 |
| Those with cursus value | 345 |
Table 5
Overview of data distribution among source databases. The symbol # is used as an abbreviation for ‘number of’. Abbreviations of database codes can be found in Subsubsection 4.1.1. The column annotated with # sources (100+) contains the number of sources with more than 100 chant records associated with them.
| Source DB code | # chants | # CIDs | # unique CIDs | # sources | # sources (100+) |
|---|---|---|---|---|---|
| CD | 429,982 | 30,350 | 14,662 | 231 | 166 |
| MMMO | 212,231 | 17,479 | 7,503 | 426 | 151 |
| CSK | 22,539 | 7,201 | 212 | 542 | 12 |
| FCB | 36,103 | 7,889 | 534 | 30 | 29 |
| CPL | 30,433 | 7,666 | 143 | 27 | 17 |
| PEM | 32,738 | 9,184 | 538 | 305 | 25 |
| SEMM | 104,678 | 23,103 | 11,625 | 487 | 81 |
| HCD | 11,278 | 5,374 | 54 | 10 | 9 |
| A4M | 2,738 | 2,006 | 12 | 142 | 3 |
| HYM | 5,290 | 680 | 323 | 83 | 20 |

Figure 5
Simplified schema of the PyCantus data model (‘content’ attributes only). Full UML model can be found in Supplementary File S2.
