1 Overview
Repository location
Context
The dataset comprises structured data derived from promotional materials for Korean classical vocal performances collected between January 2016 and February 2025. Korean Performing Arts Box Office Information System (KOPIS) is the primary national platform for collecting and providing performing arts information in Korea. Following the 2019 revision of the Performance Act (Korean Ministry of Government Legislation, 2026), performance organizers, ticket vendors, and production companies must register essential performance data in KOPIS. The promotional materials available through this platform — such as posters and brochures — are onetime records that condense program sequence, participant information, and artistic intention. As such, they are classified as performance ephemera that capture the cultural atmosphere, staging conditions, and audience engagement of a moment (Nurmikko-Fuller et al., 2016).
Scholarly attention to music ephemera began to grow in the 1980s as its cultural and research value gained recognition in library and archival contexts (Bashford, 2008). Early initiatives to document and standardize concert programs, noted by Ridgewell (2010), led to the Concert Programmes Project (CPP), which laid the foundation for systematic music ephemera research. Building on this, Nurmikko-Fuller et al. (2016) and Bangert et al. (2018) advanced semantic archival approaches via the In Concert and JazzCats projects by publishing performance data as Linked Data. More recently, Cowgill et al. (2020) and Bainbridge et al. (2023) developed participatory, community-driven models that broaden access to musical heritage. Similar discussions have emerged in Korean scholarship on performing arts archives, where researchers have examined the design of digital archive systems and data models for performance information (Jang et al., 2019; Jeong, 2021). However, while archival conventions have traditionally centered on physical ephemera, born-digital materials remain largely underexplored within contemporary music performance research. Such digital performance ephemera are highly transient—susceptible to redesign, modification, or deletion—and often present information exclusively as images.
Their mixed typographies and inconsistent language usage further limit machine readability and interoperability. To address this gap, we transformed image-based classical vocal performance ephemera into structured data using OCR and standardized MusicBrainz identifiers. While prior projects such as In Concert and JazzCats adopted Linked Open Data models, KoVox prioritizes verifiable, error-correctable structured data and uses a relational database as an intermediate structure for future conversion into Linked Open Data.
2 Method
Steps
Data Collection. The data collection process involved four primary source types: (1) KOPIS webpages, (2) promotional images (posters, brochures), (3) header metadata, and (4) text embedded within images. Each webpage was archived in MHTML format under the “Golden Copy” principle, encapsulating the HTML and all associated resources to preserve its visual and structural layout (Palme et al., 1999). Images extracted from these MHTML files fall into two categories: posters displayed at the top of webpages (also used as listing thumbnails) and brochure-style images embedded within sections that describe programs or performer profiles. Together, they form the complete set of visual materials, totaling 3,348 files in JPG, GIF, and PNG formats. Header-level metadata, consistently located at the top of each page, provides essential information, including the title, venue, date, and start time. These fields were compiled into a pandas DataFrame and exported as a CSV file, each linked to a unique identifier to ensure traceability across subsequent workflows.
OCR Processing. A major challenge in the data processing stage was that posters and brochures contained rich contextual information—such as performance descriptions, performer profiles, and program sequences—but this content existed only within non-machine-readable images. To address this, two OCR workflows were applied depending on the type of text. Long-form texts (e.g., introductions, performer profiles) were extracted using Apple Live Text,1 which reduces semantic distortion common in generative OCR models and minimizes typographical or factual errors. For program lists (e.g., work titles and composers), we employed ChatGPT-4o-based OCR using researcher-defined prompts to generate structured, column- aligned outputs for further processing. A 14-field schema was defined (e.g., performance order, work title, composer, accompanist), with missing values left empty to maintain structural consistency.
Relational Database Schema. We converted the initially collected CSV files into a relational database schema to represent the relational complexity inherent in musical performances. Its structure was essential for two main purposes. First, because homonyms are common in Korean names (Kim & Cho, 2013), each individual was assigned a unique identifier in the person table, while the participation table linked performers to specific program items within a performance. When identical names referred to different individuals, new identifiers were assigned; when biographical information indicated that the names referred to the same individual, their records were consolidated. Second, preserving each performance’s narrative flow required representing the sequential order of program items (Pasler, 1993). The program table records the chronological order of works and their composers, including intermissions, and thereby enables precise representation of each performance’s narrative structure. As shown in Figure 1, the finalized relational schema consists of five interconnected tables—performance (Table 1), work (Table 2), person (Table 3), program (Table 4), and participation (Table 5)—centered on the performance table and linking the remaining entities through program and participation relationships. Together, these tables capture the many-to-many relationships among performers, works, and performances while maintaining the chronological and contextual integrity of each event.

Figure 1
Entity-relationship diagram of the relational database schema.
Quality control
To ensure data integrity, we manually proofread and corrected all OCR-extracted texts before loading them into the database. When new records are added, primary and foreign keys across the core tables help to prevent inconsistent references. At the same time, we enhanced interoperability by linking a subset of main performers to International Standard Name Identifiers (ISNIs), where available, using the National Library of Korea’s ISNI service. Moreover, to address coverage gaps in KOPIS for the period 2016–2018—prior to mandatory registration following the 2019 Performance Act revision—we supplemented missing performances by cross-checking official websites of major venues, including the Seoul Arts Center, Youngsan Art Hall, Kumho Art Hall, and the Sejong Center.
Name inconsistencies in composers and works present an additional challenge due to variant notations and non-standardized entries (Pasler, 1993). For example, the composer Georg Friedrich Handel appeared under several variants, including G. F. Handel, G. F. Handel, Georg Frideric Handel, Georg Friedrich Handel, and George Frideric Handel. To ensure that references point to the correct entity and enable interoperability and reuse, we adopted MusicBrainz2 IDs (MBIDs). In the work table, the title-variant field stores the original work titles used during the initial API queries, whereas the remaining fields contain normalized metadata retrieved from the corresponding MBIDs. After automated matching, composers’ surnames were cross-checked and manually corrected.
Data Structure
Table 1
Performance.
| VARIABLE | TYPE | DESCRIPTION |
|---|---|---|
| performance_id | string | Unique identifier of the performance (primary key). |
| performance_date | date | Date of the performance in yyyy-mm-dd format. |
| performance_title | text | Title of the performance, as given in the official ticketing website. |
| venue_name | text | Name of the performance venue. |
| duration_minutes | int | Duration of the performance in minutes (integer values only). |
| performance_abstract | text | Description of the performance program. |
| start_time | time | Starting time of the performance in hh:mm format. |
| host_organization | text | Hosting organization. |
| sponsoring_organization | text | Sponsoring organizations; multiple values separated by commas. |
| mt20id | string | Standard performance identifier assigned by the KOPIS system. |
Table 2
Work.
| VARIABLE | TYPE | DESCRIPTION |
|---|---|---|
| work_id | string | Unique identifier of the work (primary key). |
| title_variant | text | Variant titles of the work, as collected from source materials; multiple values separated by commas. |
| mb_title | text | Official title of the work as registered in MusicBrainz. |
| mb_type | text | Genre or form of the work as defined by MusicBrainz. |
| mb_language | string | Lyric language in ISO 639-3 code as recorded in MusicBrainz. |
| mb_composer | text | Composer’s full name standardized in MusicBrainz. |
| mb_composer_birth_year | int | Composer’s birth year in yyyy format. |
| mb_composer_death_year | int | Composer’s death year in yyyy format. |
| mb_lyricist | text | Lyricist/poet as provided by MusicBrainz; multiple values allowed. |
| mb_arranger | text | Arranger of the work as recorded in MusicBrainz. |
| mbid | string | MusicBrainz work identifier in UUID format. |
| mb_parent_work_title | text | Title of the parent work if this is part of a larger composition. |
| mbid_parent_work | string | MusicBrainz parent work identifier in UUID format. |
Table 3
Person.
| VARIABLE | TYPE | DESCRIPTION |
|---|---|---|
| person_id | string | Unique identifier of the person (primary key). |
| person_name | text | Person’s name as recorded in source materials; original spelling preserved. |
| person_role | text | Role of the person within the performance; controlled vocabulary (e.g., main performer, accompanist, conductor, host). |
| person_medium | text | Voice type or instrument of the performer. |
| person_profile | text | Original biography or profile text from source materials. |
| person_isni | text | International Standard Name Identifier (ISNI) for the performer, when available, assigned via the National Library of Korea’s ISNI service. |
Table 4
Program.
| VARIABLE | TYPE | DESCRIPTION |
|---|---|---|
| program_item_id | string | Unique identifier of the program item (primary key). |
| performance_id | string | Identifier of the performance this item belongs to (foreign key). |
| work_id | string | Identifier of the work performed in this program item (foreign key). |
| program_order | int | Order of the item within the performance, starting from 1 and listed sequentially. |
| is_intermission | boolean | Indicates whether the item represents an intermission (TRUE/FALSE). |
3 Dataset description
Repository location
Zenodo
Repository name
KoVox Dataset
Object name
‘KoVox_RDB.zip’ containing:
KoVox_participation.csv
KoVox_performance.csv
KoVox_person.csv
KoVox_program.csv
KoVox_work.csv
‘KoVox_mhtml.zip’ containing MHTML files.
‘KoVox.db’ containing a pre-built SQLite relational database of the KoVox dataset.
‘KoVox_schema.sql’ containing the SQL schema.
Format names and versions
CSV; MHTML; SQLite; SQL
Creation dates
2025-01-13 to 2025-03-31.
Dataset creators
All of the authors listed in this article contributed to the creation of the dataset.
Language
Variable names in English; data mostly in Korean with multilingual.
License
CC-BY 4.0 International
Publication date
2025-10-17
4 Reuse Potential
Drawing on 1,319 recitals collected from KOPIS between 2016 and 2025, the KoVox dataset aggregates 5,177 distinct works by 518 composers, performed by 1,551 participants. To demonstrate the potential of the dataset in real-world use, we developed KoVox Museum,3 an interactive web interface built on the Vikus Viewer visualization system.4 The site enables users to explore vocal performances chronologically, filter events by composer, and search by performer, work, or venue, providing an immediate visual overview of how Korean classical vocal recitals have unfolded over the past decade. This interface may be of interest to artists and curators seeking practical insights for program planning and marketing. At the community level, it broadens public access to performance culture and strengthens local cultural engagement, thereby supporting the archive’s continued growth as new performances are added. Given this open access environment, the KoVox Museum displays publicly available promotional images only in low resolution and strictly for scholarly, critical, and non-commercial purposes, in accordance with fair use principles.
The KoVox Dataset supports a wide range of research and analytical uses. By examining programming trends across regions, venues, and generations, scholars can trace shifts in musical taste and observe the circulation of specific works and artists. The dataset further enables comparative studies of repertoire, performer networks, and visual presentation within Korea’s classical vocal scene and in cross-national comparison. As an example of what such analyses can reveal, the top 10% of works (518 pieces) account for 49.9% of the 18,985 programmed work instances recorded across all performance programs over the past decade. Frequently recurring pieces include Canciones clásicas españolas No.6, Mozart’s Exsultate, jubilate, and Schubert’s Gute Nacht, whereas many other works appear only once. Similarly, correlations between poster design and program content enable visual-cultural inquiry into how color, typography, and imagery signal artistic identity and audience expectations. Taken together, the structured metadata and image corpus support interdisciplinary inquiry into localization, globalization, and the evolving aesthetics of performance.
Limitations
The original poster and brochure images are not included in the dataset due to copyright concerns. Several key fields also exhibit high missing-value rates reflecting structural characteristics of performance information provision rather than data processing errors. The performance-abstract field is empty for approximately 95% of records, as providers rarely provide textual descriptions despite the availability of a free-text field (Kim & Lee, 2025). ISNI coverage is limited to a subset of main performers with reliable external authority records, while internal identifiers provide stable references within the dataset. Some performances without a matched mt20id predate 25 June 2019, the enforcement date of the Performance Act, and are therefore acceptable, whereas cases occurring after that date require further investigation. Finally, approximately one-third of works lack MusicBrainz identifiers (MBIDs), mainly due to limited coverage of contemporary works, Korean art songs, and non-work program elements such as intermissions.
Notes
[1] Apple Live Text is an on-device OCR technology based on Apple’s Vision/VisionKit framework, which enables direct extraction of text from photos and screenshots. Technical documentation is available on the Apple Developer website (https://developer.apple.com/documentation/vision/recognizing-text-in-images). (last accessed: 6 February 2025).
[2] MusicBrainz is an open, community-driven music metadata database providing structured information via a public API. Retrieved from https://musicbrainz.org/ (last accessed: 6 February 2026).
[3] Retrieved from https://happyhillll.github.io/ (last accessed: 6 February 2026).
[4] Retrieved from https://vikusviewer.fh-potsdam.de/ (last accessed: 6 February 2026).
Author Contributions
Minji Kim: conceptualization; data curation; methodology; writing – original draft; software; visualization; writing – review & editing.
Eunsoo Lee: conceptualization; project administration; methodology; data advice; funding acquisition; writing – review & editing.
