Content-Based Search Tools for Large Sets of Print and Manuscript Music

Jürgen Diet; Janosch Umbreit

doi:10.5334/tismir.293

1 Introduction

The Bavarian State Library is an important contributor to the landscape of digital musical collections in several regards: It houses an internationally acclaimed collection of music, of which a significant portion is digitally accessible via the Munich Digitization Center (MDZ).^¹ Together with the Saxon State and University Library and supported by funds of the German Research Foundation (DFG), it hosts the Specialized Information Service Musicology, musiconn.^² Owing to these strong ties to physical and digital music collections, combined with the shared stewardship of a service intended to facilitate research for the musicological community, the library has had a long‑standing interest in harnessing content‑based search technologies for its services.

Despite researchers reporting good to very good results in a variety of test cases for individual optical music recognition (OMR) solutions, these projects are often limited to one specific corpus and cannot be applied satisfactorily to the full spectrum of music found in a major research library whose resources span medieval notation systems to Common Western Music Notation to modern or non‑European forms of music representation. Beyond the immediate problem of achieving acceptable large‑scale OMR results, these data must be integrated into library catalogs. A search interface must be built and implemented for users to formulate musical queries with algorithms that match and rank search results. Achieving these goals necessitates not only one very flexible OMR solution or perhaps several interchangeable or interactive programs but will also depend on host institutions that are able and willing to invest in further development and maintenance of indexing technologies and user interfaces.

In this paper, we describe the efforts undertaken at the Bavarian State Library to integrate content‑based music searching into its services. These efforts have developed over several years and are tied to the library’s contributions to two collaborative projects: the Répertoire International des Sources Musicales (RISM) project and the aforementioned Specialized Information Service Musicology. The RISM Catalog integrates the incipits recorded for the project into a general search interface for its records. In the case of musiconn, the newly developed tool musiconn.scoresearch serves as a prototype of musical content searching based on OMR data generated from a selection of the library’s own digital collections.

Our paper covers the process of building the indexed corpora. It reports on decisions concerning OMR technologies, indexing, search strategies, and interface design. It also summarizes observations from our own experiences and information gathered from user feedback over the years. We discuss ways in which simplified OMR data can make musical searches both more resilient and more interoperable, as well as challenges that arise from integrating musical content searching with more traditional library catalogs.

The methodologies we describe are not newly developed but rather the result of the gradual development of the field, as detailed in Section 2. In giving insight into the application of these methods, we hope to offer an educational resource showcasing the solutions we have so far implemented, as well as the lessons we have learned along the way, for the benefit of other institutions who are interested in extending their capabilities for content‑based music search. Furthermore, we hope that the experiences we detail can offer impulses for MIR researchers and students, in particular concerning the application of MIR methods to broader information‑retrieval contexts and their place in the user experience in the context of a catalog system. Finally, we want to share our work and the corpora on which it builds in order to invite further collaboration, as well as independent research based on these freely available assets.

As this contribution targets three groups of potential audiences—practitioners, researchers, and students— it covers sections ranging from a broad overview of the field to detail‑oriented implementation. Students who are new to MIR may want to trace the development of the field described in Section 2 and consult the fundamental methods of musical query–matching outlined in Section 4. Librarians interested in designing similar systems may find Sections 5 and 6, which deal with considerations of interface design and user experience, most instructive, while the more detailed technical information of Sections 4.1 and 4.2 is geared toward those with a technical background seeking concrete information about the tools we use and develop.

2 Related Work

Since its first introduction by Stephen Downie (Downie, 1999), retrieval of pieces of musical information by means of n‑gram matching (see Section 4) has become the de facto standard method for processing musical search queries. The observation underlying this powerful method is that, while musical information extends over several features, or facets, as Downie calls them, for the purpose of matching a search query to a list of results, it is often sufficient to reduce this complex information to a series of pitch progressions, or intervals. These stepwise progressions can then be compared quite efficiently, analogous to methods first developed for text comparison. In 2012, Hankinson et al. first presented a prototypical application that combined the n‑gram–based search introduced by Downie with an OMR pipeline that, again, took cues from the progress that had by then been made in the field of large‑scale text recognition (Hankinson et al., 2012). In doing so, they formulated the goal of making large‑scale content‑based searches of musical texts possible via automated music recognition in combination with search matching and indexing technology. In large parts, this combination of technologies finds its application in the process described here.

The work of Hankinson et al. took place as part of the Single Interface for Music Score Searching and Analysis (SIMSSA^³) initiative headed by Ichiro Fujinaga (Fujinaga et al., 2014), which, over the course of its 10‑year development, has made many contributions to MIR research, both in the fields of OMR (Fujinaga and Vigliensoni, 2019) and computer‑aided music analysis. See, for example, the ELVIS project, which uses n‑grams to formulate queries over chord progressions (Antila and Cumming, 2014), or the ‘Cantus Ultimus’ project,^⁴ which builds on the Cantus Database and implements an OMR‑powered search functionality for selected plainchant manuscripts (Helsen et al., 2014).

While OMR technology and the hope to facilitate content‑based searching of large music collections have thus been stated goals of MIR researchers for years, with important strides already made in the field, it still proves difficult to implement at a scale that accommodates large varied collections in a sustainable manner. The application of n‑gram matching proposed by Downie and its combination with OMR as first described by Hankinson et al. is still a solid foundation for much work, but even large and accomplished projects such as ‘Cantus Ultimus’ apply them primarily to curated, homogeneous corpora.

Some recent projects have given further important impulses to the discipline in applying large‑scale music recognition to the digitized holdings of libraries beyond well‑established research projects: The F‑Tempo Project (Crawford et al., 2023)^⁵—it, too, is associated with the SIMSSA initiative—offers a robust search tool based on page‑wise comparison of early music prints across several participating libraries. The OmniOMR project (Hajič jr. et al., 2023), a collaboration between the Charles University, Prague, and the Moravian Library, Brno, which seeks to add OMR‑powered search to several collections held at a large research library, provides a highly relevant and recent point of comparison. Furthermore, some long‑standing contributors to the field, such as the Cantus Database^⁶ or RISM Online^⁷ offer mature content‑based search of vast inventories of full chants and incipits respectively based on human data entry.

3 Corpus

A substantial portion of the Bavarian State Library holdings is already digitized and available online. The digital collections of the entire library currently encompass more than three million items (digitized manuscripts, prints, music, maps, photographs, newspapers, and magazines). Ninety‑seven percent of these digitized items have been processed with OCR software and are therefore searchable as full‑text versions.

A content‑based search for the digitized music prints and music manuscripts is also a high priority and goal. To this end, the Bavarian State Library has generated OMR data since 2016, resulting in a substantial corpus of OMR data for its digitized music prints. Over several projects funded by the DFG, the Bavarian State Library has digitized and generated OMR data for the following collections:

A complete edition of Ludwig van Beethoven’s works (Breitkopf & Härtel, 1862–1865)
A complete edition of Georg Friedrich Händel’s works (edited 1858–1902 by Friedrich Chrysander)
A complete edition of Franz Liszt’s works (ed. 1907–1936 by Carl Alexander)
A complete edition of Felix Mendelssohn Bartholdy’s works (ed. 1874–1877 by Julius Rietz)
A complete edition of Franz Schubert’s works (Breitkopf & Härtel, 1884–1897)
A complete edition of Robert Schumann’s works (ed. 1879–1893 by Clara Schumann)
The first and second series of ‘Monuments of German Musical Art’ (‘Denkmäler deutscher Tonkunst’) (Breitkopf & Härtel, 1892–1931)
A printed production archive from the historical archive of the music publisher B. Schott’s Söhne, Mainz

Although the subset of music prints from the Bavarian State Library processed with OMR amounts to approximately 160,000 pages, this represents only a small portion of the Music Department’s holdings. The library owns around 500,000 music prints. Approximately 20,000 (4%) of these have been digitized, and about 3,000 of the digitized music prints have been processed with OMR. This corresponds to a mere 0.6% of all 500,000 music prints.

In a current project that began in 2024 and will continue until 2026, the Bavarian State Library will increase its OMR data by an additional 350,000 pages from approximately 2,000 early music prints dating from the 16^th and 17^th centuries, made up predominantly of polyphonic works. In addition, 133 hand‑written choir books and other manuscript sources from the same period will be processed.

Another large corpus of musical data that is searchable on a content level via a search interface developed by the Bavarian State Library are the almost 2.5 million incipits in the RISM database. The RISM database consists of around 1.5 million musical sources (mostly music manuscripts, but also historic music prints) that have been entered by many music librarians and musicologists worldwide.^⁸ Since 2010, the Bavarian State Library has developed and hosted the RISM Catalog,^⁹ which now offers content‑based searching for these RISM incipits.

4 OMR Technologies, Indexing, and Searching

The primary goal of the OMR technologies in use at the Bavarian State Library is to make the library’s vast music collections more accessible to its users. Therefore, they fall under the definition of ‘Search’ in the taxonomy suggested by Calvo‑Zaragoza et al. (2020), the second of four possible levels of complexity ranging from Metadata Extraction to Structured Encoding. As such, these OMR applications can do without certain complexities, which are required for structured encoding of full scores. However, OMR solutions must be highly scalable and allow for very efficient processing of search queries.

Since the goal of each of the applications is to guide users to an entry in the library catalog (or integrated catalogs of other institutions) and, in the case of digitized collections, perhaps point directly to a specific page in a digitized document, all musical data must ultimately be stored in searchable indices, which, in our case, are managed with the Apache Solr platform.^¹⁰ Building on the method proposed by Downie, these indices store the data both as sequences of pitches and sequences of intervals through which queries can be formulated by the end‑ user. These queries are processed as sequences of n‑grams: they are broken up into overlapping chunks ranging from three to eight notes each that are then compared to the indexed musical data, which is represented in the same manner as overlapping sequences of notes (see Figure 1).^¹¹

Three overlapping n‑grams of size three capturing a melodic progression.

The reduction of musical information to pitches and intervals has important consequences for the requirements of the OMR technology stack and the way users interact with the data. Because the goal of the search application is score retrieval, the musical information is greatly simplified. Each staff is processed on its own, and only the highest note in chords is considered. Neither indexing nor search queries take rhythm into consideration. While these reductions in musical information may seem drastic, they ensure efficient query processing and keep the user interface simple. Furthermore, reducing the complexity of the processed data also makes the system more robust for end‑users. As Padilla et al. (2015) observe in their experiments comparing multiple sources of OMR data corresponding to a single work, wrong rhythms and missing notes and rests are among the most frequent sources of errors, an observation we can confirm from our own experience with correcting OMR data.

Given the underlying requirement to integrate OMR data into a uniform search interface, the final output of the musical data is predetermined, but the paths toward this output differ depending on the corpus. At present, three distinct sources of musical data and three corresponding pipelines flow into search indices, maintained by the Bavarian State Library (see Figure 2). Two of them, the RISM Catalog and the first implementation of musiconn.scoresearch, are actively running services, while the third, the repertoire extension of scoresearch, is currently being implemented.

Different document‑processing pipelines leading to the same indexing technology.

4.1 musiconn.scoresearch

4.1.1 SmartScore

Our first implementation of an OMR workflow in 2017 was designed to deal with a selection of printed music editions from the 19^th century and to yield searchable strings of pitches and intervals. It relied on the commercially available application SmartScore.^¹² Given the substantial size of the corpus, the OMR process relied on batch‑processing that made use of the library’s servers to process an average of 70 images per hour and used SmartScore to generate MusicXML for each individual image. During the first iteration of the project, further processing of the MusicXML files included a reduction of the polyphony to the first voice per staff—presumed to be the highest‑pitched voice and to carry the melody—and its representation as a pitch sequence, from which an interval sequence could be inferred. The parsing from MusicXML to pitch sequences was accomplished with the help of Dorien Herremans’ musicXMLparserDH.^¹³

Since the inclusion of OMR data generated by MuRET necessitated the transition to an MEI‑based workflow, the data initially generated by SmartScore were later batch‑translated to MEI using Verovio and reindexed using a custom‑built parser (compare Section 4.1.2 for more details on the move from MusicXML to MEI).

4.1.2 MuRET

The planned addition of a collection of 16^th‑ and 17^th‑century sources more than twice the size of the initial musiconn.scoresearch corpus requires a complete reworking of the OMR tool chain. Since SmartScore is not equipped to handle mensural notation, the MuRET platform (which is hosted at the University of Alicante and is under active development at the time of writing) was chosen as the most suitable tool. Initially developed in tandem with the HISPAMUS project, this web‑based application is specifically designed to handle corpora of historic music (Rizo et al., 2018).^¹⁴

Unlike the recognition of Common Western Notation with SmartScore, MuRET allows for a training phase in which recognition is achieved by training on a representative sample of the data. This training can involve several rounds of human corrections. The image recognition and correction of its output are both handled via a web interface. For musiconn.scoresearch, this training was done using 21 sample sources consisting of 5,136 total images and was split into two phases. In the first phase, the initial 20 images were recognized and then manually corrected. In the second phase, 10% (evenly distributed) of the remaining images of each source were processed again, and the output was manually corrected once more. This iterative approach to refining the OMR model builds on experiences the team around MuRET gained through earlier projects that involved batch transcription of mensural notation, such as Spanish PolifonIA. Compare Rizo et al. (2024) for a detailed report on the project and the internal workflow of MuRET that could only be sketched out here.

Because MuRET outputs MEI XML rather than MusicXML, the decision was made to retool the data‑processing pipeline to support MEI and to translate and re‑index the existing OMR files. This translation will ensure better future compatibility and extensibility of the searchable corpus. It also allows for further refinements in the indexing of multiple voices, which can now be indexed independently (chords, however, are still being reduced to their highest note). The MEI files are processed by a custom‑built parser to yield the pitch and interval sequences for the search index.

The continued work on indexing and displaying the data at the heart of musiconn.scoresearch serves to highlight the significant demands for technical and domain expertise required by such a platform: improving MuRET to the point where it could yield satisfactory OMR data required the close coordination between efforts by the team around its developer, David Rizo, and musiconn, over the course of over 1.5 years of repeated training and correction of the models. Similarly, transitioning the MusicXML files generated by SmartScore to MEI files that would be compatible with the new process required by MuRET could only be achieved through close cooperation between the musiconn team, tasked with choosing and evaluating examples and test cases, and members of the library’s IT department who specialize on Java, as well as PHP and JavaScript development.

4.2 RISM

The almost 2.5 million manually entered incipits present in the RISM database form another important source of musical data. Unlike the OMR data in the musiconn.scoresearch project, these short snippets are not based on digitized images but instead were entered by hand using Plaine&Easie Code (see Figure 3).^¹⁵ They are monophonic from the beginning and therefore do not need to be collapsed into a single voice and can be transformed into a pitch sequence as they are. Designed to be one of many search options for the RISM database, these musical snippets are meant to be queried in connection with the rich metadata offered by RISM. As incipits, they offer only a glimpse at the initial intervals of a given composition. While the rhythmic information is recorded and used to display incipits correctly, it does not factor into the search functionality. This interval‑only search dates back to the earliest implementations of the incipit search, which, from the beginning, was meant to be limited because searches would be coupled with sophisticated metadata to filter matches.

*Plaine&Easie Code* including information on Clef, Key, and Time, as well as musical notation (both melodic and rhythmic).

By far the oldest dataset among those discussed here, the RISM incipits and the ways in which they can be searched reflect the evolution of the RISM project and music content searching in general over decades. The roots of the electronic incipit search date back to the CD‑ROM editions of the RISM Catalog in the 1990s.^¹⁶ This initial search, which took simple pitch letters without regard for their octave as its input, was gradually enriched to incorporate octave information, grace notes, and transposed matches, as well as substring matches from anywhere within the incipit. These later additions were ported over from musiconn.scoresearch, thus demonstrating how the two services can enrich each other, despite their significantly different project objectives, thanks to their shared data model and the significant reduction of musical complexity for the sake of query‑matching. The RISM Catalog purposefully retains this historically grown variety of search options to offer multiple entry points, each best suited for a different approach to searching music.

5 Search Interfaces

As with the encoding pipelines, the user interfaces for the RISM and musiconn.scoresearch services reflect the evolutionary process through which the content‑based music applications at the Bavarian State Library gradually developed. The current search offerings, from a user’s perspective, are in a state of transition, moving away from individually built solutions toward a search framework that places emphasis on modularity and reusability.

Since it is the institutional home of four specialized information services, of which musiconn is only one, the Bavarian State Library is actively involved in efforts to develop search interfaces that can be reused among different services and projects. Therefore, the institution’s content‑based search services are gradually being transitioned to the shared open‑source search interface VuFind.^¹⁷ While this transition is still under way, the plan is to eventually offer search interfaces powered by VuFind for both the RISM Catalog and musiconn.scoresearch, as well as other search options offered by musiconn. At the time of writing, the RISM Catalog has already been moved to its new VuFind home,^¹⁸ whereas the move of musiconn.scoresearch is scheduled once the integration of the new OMR data generated by MuRET is completed.

While the overall search interface is being overhauled to combine musical queries with textual metadata and a powerful faceted search, the interface for formulating music searches is the product of gradual and joint development of both the RISM Catalog and musiconn.scoresearch. It offers a point‑and‑click digital piano keyboard that can also be used via computer keyboard with the key rows mirroring the black and white keys of the piano (i.e., $D 5 = W$ , $D ♯ 5 = 3$ , $E 5 = E$ etc.). Furthermore, users can also enter notes as text directly in the search field, such as C5F5G5G#5 (see Figure 4). Audio playback of each note, as well as the real‑time display of the query, both in text and musical notation, give the user an intuitive way to ‘spell‑check’ their query. As explained in Section 4, rhythm is not taken into consideration; therefore, this search interface can capture all relevant information with relatively simple input. In both the RISM Catalog and in musiconn.scoresearch, users have the choice to search for exact or transposed matches and, among the RISM incipits, they can further specify whether to search for matches from the beginning of the incipits or anywhere within them. As described in Section 4.2, this most recent addition to the search functionality in the RISM Catalog could be transferred from the search algorithm developed for musiconn.scoresearch by Pulimootil Achankunju (2018).

Entering a musical query in the RISM Catalog. Note the letter ‘E’ over the E5, indicating the available keyboard entry method.

The development of search interfaces that make large collections of musical data accessible for end‑users thus remains an area of active development, both within the Bavarian State Library and in the community at large. Further improvements that could be implemented in future development cycles are, among others, giving users more control over the query‑matching strategy beyond the already‑existing options ‘strict’ and ‘transposed,’ such as detailed settings for dealing with octave jumps or the degree of fuzziness in query‑matching. As our work on improving the accessibility of musical information continues, we also hope to integrate these search interfaces into future projects. We plan to add an option for musical content search for material for which we have OMR data or manual editions from collaborators to the general search tool of the Specialized Information Service, musiconn.search.^¹⁹ This step will combine the tools and algorithms developed for musiconn.scoresearch with the rich metadata already provided by the library catalog, thus further integrating traditional library searches, full‑text searches (where possible), and emergent musical content searches.

6 Experiences

Feedback from users of the content‑based music search applications provided by the Bavarian State Library yields interesting insights about strengths and weaknesses of the applications in their current state. Some users find it useful to be able to search for part of a melody without having to specify any metadata. Two use cases of this approach are the search for adaptations of musical works by other composers and the identification of musical works that have anonymous composers.

As already mentioned in the previous chapter, the integration of the technology developed for musiconn.scoresearch with larger and more generic search interfaces is an important goal for us. While the RISM Catalog is already able to combine musical content searching with sophisticated metadata search, musiconn.scoresearch currently does not offer any metadata search options and only very limited filtering based on collection, composer, and date. The combination of musical content searching and a traditional metadata‑based search shows huge potential but also presents institutions with significant obstacles when such an implementation is attempted.

While some items in the catalog of the Bavarian State Library—in particular, medieval manuscripts—have very detailed information on their contents and contributors, many feature only generic information about year, place, and publisher, etc. This is particularly problematic in the case of collected works that might feature many composers from different time periods.

The availability of detailed metadata is crucial, particularly with regard to collected works and where it pertains to clear distinction between a (musical) work and its edition. For example, if users are searching for the melody ‘g′–g′–g′–e♭′’, looking to trace, e.g., the legacy of Beethoven’s famous ‘fate‑motif,’ they might be interested in restricting the search results to, say, the first 50 years after the symphony was first performed. However, in the current implementation of musiconn.scoresearch, the match for the melody in Beethoven itself is given with the date 1862–1865 because it inherits its metadata from the library record of the Breitkopf & Härtel edition that forms the basis for this specific OMR file.

Of course, this functionality is to be expected in the case of a library catalog and is well known in the case of full‑text searches. However, in the context of content‑based music searching, it highlights the need for careful management of user expectations and clear communication in the design of sorting and filtering options. While the act of searching for a musical quote intuitively implies that one is engaging with an abstract ‘work,’ the fact is that large swaths of library catalogs remain firmly rooted in describing a particular representation of the work, as well as its physical manifestation. In the case of looking for a Beethoven quote, this situation is easy to observe and factor in on a case‑by‑case basis. However, it becomes more troublesome in cases where users might attempt to search for, e.g., developments in musical style without focusing on a specific composer. In the case of collections or anthologies, the differences between the period where a work was first composed and performed and the date associated with the match in the catalog can be vast: A match with Samuel Scheidt’s (1587–1654) Tabulatura nova, for example, appears with the date 1892 (the year of publication for the volume of ‘Denkmäler deutscher Tonkunst’ containing it). In cases such as these, factoring in information about the author; the first publication; or, where available, information about the year of creation or first performance would significantly broaden the options for filtering and sorting matches, as well as for further processing exported lists of matches.

This problem is characteristic of the difference between OMR in the context of research projects based on a limited corpus and heritage institutions with vast collections and varying degrees of metadata quality. It also opens intriguing research questions, especially concerning the integration of OMR, OCR, and optical layout recognition. It stands to reason that the recent machine learning–driven advances in document processing will significantly increase the ability to integrate assignment of composers and text incipits with ‘raw’ OMR. This more granular parsing of metadata could, in combination with linked data technology and relevant authority files, make the contents of even complex collected works much more available to sorting, filtering, and other more sophisticated search operations. The further development of these technologies remains of crucial interest for libraries and archives with large and varied music collections, opening possibilities of increasing the accessibility of their holdings that would be out of reach, practically speaking, if they were to rely purely on human data entry.

7 Reproducibility and Open Data

We hope the account of the work we have done so far serves as encouragement for researchers and we want to facilitate further research by sharing both musical data and code for data processing. The resources for reproducing our setup or building on the data we generate can be grouped into different categories and are available via different channels:

Musical Data: Both the RISM incipits and the OMR files generated for musiconn.scoresearch are available as open data for further processing and research. All RISM data (including the approximately 2.5 million incipits in Plaine&Easie format) can be accessed and downloaded in various ways via the RISM Catalog.^²⁰ OMR files for each entry in a given list of search results can already be downloaded from musiconn.scoresearch, and larger sets of OMR data are available upon request.
OMR model and MEI parser: At time of writing, the scoresearch platform is still in transition, but once it is finished, we will publish the MEI parser, which sits at the heart of indexing the MEI files via the official GitHub account of the Bavarian State Library.^²¹ The set of public domain prints used for training and manually correcting the MuRET model, as well as the model itself, are managed by the University of Alicante and are available to interested parties upon request.
User Interface: Once the transition of all services described here to the VuFind platform is finished, we plan on releasing the code used for inputting and rendering the musical data as a contribution to the larger VuFind ecosystem. These additions, too, will be published on the BSB GitHub account.

All pieces of software will be made available under an appropriate open source license. We plan on further improving the availability of musical data as the project matures. Any updates will be published via the project homepage.

The Bavarian State Library and the Specialized Information Service Musicology actively welcome further research using the resources of the digital collections, the musical data, or both to further advance the field of optical music recognition.^²²

8 Conclusions

Our experiences with the implementation of content‑based music search in the context of a large research library can be summed up as follows: First, the presence of rather heterogeneous corpora (Common Western Music Notation and white mensural in our case, but further systems such as black mensural, neumes, or various tablatures would be natural next steps) necessitates that we rely on several independent OMR solutions and thus must ensure sufficient modularity in our data‑processing pipelines to accommodate these differences. Of course, it is possible that future advances in OMR software will facilitate the creation of ‘one‑stop shop’ solutions that produce satisfactory results. For now, however, the ability to train custom OMR modules and integrate them relatively seamlessly into our existing system thanks to shared formats such as MEI and the further reduction of musical complexity when it comes to indexing strikes a good balance between flexibility and ease of integration.

Second, in terms of indexing the contents, the significant simplification of the musical data with the goal of optimizing it for search matching rather than attempting to create a high‑fidelity structural encoding has proved to be a major advantage. The simplification smooths over some of the inevitable errors produced by the OMR. As stated in Section 7, the original OMR output is still available for further processing. Thus, steps toward creating a full structural encoding or other further processing of individual pieces or whole collections can still be taken by interested parties without involving the search tools described here.

Finally, implementing an interface featuring tools for searching, filtering, and sorting makes evident how vital, in practice, the combination of strong OMR and high‑quality, granular metadata for both physical sources and abstract works is. The RISM Catalog, building on decades of manual curation, already demonstrates the virtues of this combination with its expressive filtering options. With more institutions looking into options to make full‑text search of their musical collections available to their users, the further exploration of this meeting point between OMR, OCR, layout recognition, and the integration of data from authority files and specialized databases (such as RISM) becomes ever more relevant.

Notes

[1] https://www.digitale-sammlungen.de/en/music

[2] https://www.musiconn.de/en/

[3] https://simssa.ca/

[4] https://cantus.simssa.ca/

[5] https://search.f-tempo.org/

[6] https://cantusdatabase.org/. For a recent description of this project, see Lacoste (2022).

[7] https://rism.online/?mode=incipits. This search tool for the sources recorded by the RISM project is maintained by the RISM Digital Center. It is the sibling application to the RISM Catalog, maintained by the Bavarian State Library, which will be discussed in more detail in this paper.

[8] https://rism.info/

[9] https://opac.rism.info/main-menu-/kachelmenu/about

[10] https://solr.apache.org/

[11] By only storing n‑gram information about two musical features—pitches and intervals—we only implement two of the five features Hankinson et al. (2012) record in their implementation of OMR‑based n‑gram representation. In dealing solely with searching in the Liber Usualis they are, however, working with a much more limited and homogeneous dataset, which invites a more granular analysis of the musical data.

[12] SmartScore x² Professional V10.5.8

[13] https://bitbucket.org/dorienh/musicxmlparserdh/src/master/ The author kindly permitted the use of this parser for the musiconn. scoresearch project.

[14] MuRET employs a YOLO (‘you only look once’) model to detect individual staves on a page, followed by a convolutional recurrent neural network in combination with a connectionist temporal classification model to classify individual notation elements on a given staff.

[15] https://www.iaml.info/plaine-easie-code

[16] The Plaine&Easie Code itself dates back even further to the 1960s; see Brook (1965).

[17] https://vufind.org/vufind/. The modules created for dealing with music search in VuFind in the context of the projects described here will be made publicly available once development is finished.

[18] https://opac.rism.info

[19] https://www.musiconn.de/musiconnsearch/?lng=en This research portal combines relevant sections of online catalogs, currently from 17 sources, including e.g., the British Library, the Library of Congress, and the International Music Score Library Project (IMSLP).

[20] https://opac.rism.info/main-menu-/kachelmenu/data

[21] https://github.com/bsb-muenchen

[22] The project ‘Development of a Comprehensive Cloud‑Based Toolbox for Sheet Music Analysis’ (https://analyse.hfm-weimar.de), situated at the Hochschule für Musik Franz Liszt in Weimar, already requested part of the OMR data of musiconn.scoresearch and will use these data for the development of its cloud‑based toolbox.

Acknowledgments

musiconn is generously supported by the German Research Foundation (Deutsche Forschungsgemeinschaft) under the project number 249121324. We furthermore want to thank the IT team at the Bavarian State Library for their continuing work on the technical implementation of the content‑based search systems we have described here. In particular, we want to extend our thanks to Mr. Sanu Pulimootil Achankunju, who created the first version of musiconn.scoresearch, and to Ms. Magda Gerritsen who has worked tirelessly to ensure that the platform will remain extensible and modular in the future and whose technical insight proved most valuable for this article.

User feedback on the applications described here was gathered informally over the course of their design and implementation.

Competing Interests

We are not aware of any competing interests concerning our submission.

Content-Based Search Tools for Large Sets of Print and Manuscript Music

Full Article