KoVox Dataset—A Relational Database of Korean Classical Vocal Performance Ephemera

Minji Kim; Eunsoo Lee

doi:10.5334/johd.417

1 Overview

Repository location

https://doi.org/10.5281/zenodo.18932029

Context

The dataset comprises structured data derived from promotional materials for Korean classical vocal performances collected between January 2016 and February 2025. Korean Performing Arts Box Office Information System (KOPIS) is the primary national platform for collecting and providing performing arts information in Korea. Following the 2019 revision of the Performance Act (Korean Ministry of Government Legislation, 2026), performance organizers, ticket vendors, and production companies must register essential performance data in KOPIS. The promotional materials available through this platform — such as posters and brochures — are onetime records that condense program sequence, participant information, and artistic intention. As such, they are classified as performance ephemera that capture the cultural atmosphere, staging conditions, and audience engagement of a moment (Nurmikko-Fuller et al., 2016).

Scholarly attention to music ephemera began to grow in the 1980s as its cultural and research value gained recognition in library and archival contexts (Bashford, 2008). Early initiatives to document and standardize concert programs, noted by Ridgewell (2010), led to the Concert Programmes Project (CPP), which laid the foundation for systematic music ephemera research. Building on this, Nurmikko-Fuller et al. (2016) and Bangert et al. (2018) advanced semantic archival approaches via the In Concert and JazzCats projects by publishing performance data as Linked Data. More recently, Cowgill et al. (2020) and Bainbridge et al. (2023) developed participatory, community-driven models that broaden access to musical heritage. Similar discussions have emerged in Korean scholarship on performing arts archives, where researchers have examined the design of digital archive systems and data models for performance information (Jang et al., 2019; Jeong, 2021). However, while archival conventions have traditionally centered on physical ephemera, born-digital materials remain largely underexplored within contemporary music performance research. Such digital performance ephemera are highly transient—susceptible to redesign, modification, or deletion—and often present information exclusively as images.

Their mixed typographies and inconsistent language usage further limit machine readability and interoperability. To address this gap, we transformed image-based classical vocal performance ephemera into structured data using OCR and standardized MusicBrainz identifiers. While prior projects such as In Concert and JazzCats adopted Linked Open Data models, KoVox prioritizes verifiable, error-correctable structured data and uses a relational database as an intermediate structure for future conversion into Linked Open Data.

2 Method

Steps

Data Collection. The data collection process involved four primary source types: (1) KOPIS webpages, (2) promotional images (posters, brochures), (3) header metadata, and (4) text embedded within images. Each webpage was archived in MHTML format under the “Golden Copy” principle, encapsulating the HTML and all associated resources to preserve its visual and structural layout (Palme et al., 1999). Images extracted from these MHTML files fall into two categories: posters displayed at the top of webpages (also used as listing thumbnails) and brochure-style images embedded within sections that describe programs or performer profiles. Together, they form the complete set of visual materials, totaling 3,348 files in JPG, GIF, and PNG formats. Header-level metadata, consistently located at the top of each page, provides essential information, including the title, venue, date, and start time. These fields were compiled into a pandas DataFrame and exported as a CSV file, each linked to a unique identifier to ensure traceability across subsequent workflows.
OCR Processing. A major challenge in the data processing stage was that posters and brochures contained rich contextual information—such as performance descriptions, performer profiles, and program sequences—but this content existed only within non-machine-readable images. To address this, two OCR workflows were applied depending on the type of text. Long-form texts (e.g., introductions, performer profiles) were extracted using Apple Live Text,¹ which reduces semantic distortion common in generative OCR models and minimizes typographical or factual errors. For program lists (e.g., work titles and composers), we employed ChatGPT-4o-based OCR using researcher-defined prompts to generate structured, column- aligned outputs for further processing. A 14-field schema was defined (e.g., performance order, work title, composer, accompanist), with missing values left empty to maintain structural consistency.
Relational Database Schema. We converted the initially collected CSV files into a relational database schema to represent the relational complexity inherent in musical performances. Its structure was essential for two main purposes. First, because homonyms are common in Korean names (Kim & Cho, 2013), each individual was assigned a unique identifier in the person table, while the participation table linked performers to specific program items within a performance. When identical names referred to different individuals, new identifiers were assigned; when biographical information indicated that the names referred to the same individual, their records were consolidated. Second, preserving each performance’s narrative flow required representing the sequential order of program items (Pasler, 1993). The program table records the chronological order of works and their composers, including intermissions, and thereby enables precise representation of each performance’s narrative structure. As shown in Figure 1, the finalized relational schema consists of five interconnected tables—performance (Table 1), work (Table 2), person (Table 3), program (Table 4), and participation (Table 5)—centered on the performance table and linking the remaining entities through program and participation relationships. Together, these tables capture the many-to-many relationships among performers, works, and performances while maintaining the chronological and contextual integrity of each event.

Entity-relationship diagram of the relational database schema.

Quality control

To ensure data integrity, we manually proofread and corrected all OCR-extracted texts before loading them into the database. When new records are added, primary and foreign keys across the core tables help to prevent inconsistent references. At the same time, we enhanced interoperability by linking a subset of main performers to International Standard Name Identifiers (ISNIs), where available, using the National Library of Korea’s ISNI service. Moreover, to address coverage gaps in KOPIS for the period 2016–2018—prior to mandatory registration following the 2019 Performance Act revision—we supplemented missing performances by cross-checking official websites of major venues, including the Seoul Arts Center, Youngsan Art Hall, Kumho Art Hall, and the Sejong Center.

Name inconsistencies in composers and works present an additional challenge due to variant notations and non-standardized entries (Pasler, 1993). For example, the composer Georg Friedrich Handel appeared under several variants, including G. F. Handel, G. F. Handel, Georg Frideric Handel, Georg Friedrich Handel, and George Frideric Handel. To ensure that references point to the correct entity and enable interoperability and reuse, we adopted MusicBrainz² IDs (MBIDs). In the work table, the title-variant field stores the original work titles used during the initial API queries, whereas the remaining fields contain normalized metadata retrieved from the corresponding MBIDs. After automated matching, composers’ surnames were cross-checked and manually corrected.

Data Structure

Table 1

Performance.

VARIABLE	TYPE	DESCRIPTION
performance_id	string	Unique identifier of the performance (primary key).
performance_date	date	Date of the performance in yyyy-mm-dd format.
performance_title	text	Title of the performance, as given in the official ticketing website.
venue_name	text	Name of the performance venue.
duration_minutes	int	Duration of the performance in minutes (integer values only).
performance_abstract	text	Description of the performance program.
start_time	time	Starting time of the performance in hh:mm format.
host_organization	text	Hosting organization.
sponsoring_organization	text	Sponsoring organizations; multiple values separated by commas.
mt20id	string	Standard performance identifier assigned by the KOPIS system.

Table 2

Work.

VARIABLE	TYPE	DESCRIPTION
work_id	string	Unique identifier of the work (primary key).
title_variant	text	Variant titles of the work, as collected from source materials; multiple values separated by commas.
mb_title	text	Official title of the work as registered in MusicBrainz.
mb_type	text	Genre or form of the work as defined by MusicBrainz.
mb_language	string	Lyric language in ISO 639-3 code as recorded in MusicBrainz.
mb_composer	text	Composer’s full name standardized in MusicBrainz.
mb_composer_birth_year	int	Composer’s birth year in yyyy format.
mb_composer_death_year	int	Composer’s death year in yyyy format.
mb_lyricist	text	Lyricist/poet as provided by MusicBrainz; multiple values allowed.
mb_arranger	text	Arranger of the work as recorded in MusicBrainz.
mbid	string	MusicBrainz work identifier in UUID format.
mb_parent_work_title	text	Title of the parent work if this is part of a larger composition.
mbid_parent_work	string	MusicBrainz parent work identifier in UUID format.

Table 3

Person.

VARIABLE	TYPE	DESCRIPTION
person_id	string	Unique identifier of the person (primary key).
person_name	text	Person’s name as recorded in source materials; original spelling preserved.
person_role	text	Role of the person within the performance; controlled vocabulary (e.g., main performer, accompanist, conductor, host).
person_medium	text	Voice type or instrument of the performer.
person_profile	text	Original biography or profile text from source materials.
person_isni	text	International Standard Name Identifier (ISNI) for the performer, when available, assigned via the National Library of Korea’s ISNI service.

Table 4

Program.

VARIABLE	TYPE	DESCRIPTION
program_item_id	string	Unique identifier of the program item (primary key).
performance_id	string	Identifier of the performance this item belongs to (foreign key).
work_id	string	Identifier of the work performed in this program item (foreign key).
program_order	int	Order of the item within the performance, starting from 1 and listed sequentially.
is_intermission	boolean	Indicates whether the item represents an intermission (TRUE/FALSE).

Table 5

Participation.

VARIABLE	TYPE	DESCRIPTION
performance_id	string	Identifier of the performance (foreign key).
program_item_id	string	Identifier of the program item performed (foreign key).
person_id	string	Identifier of the person participating in the performance (foreign key).

3 Dataset description

Repository location

Zenodo

Repository name

KoVox Dataset

Object name

‘KoVox_RDB.zip’ containing:

KoVox_participation.csv
KoVox_performance.csv
KoVox_person.csv
KoVox_program.csv
KoVox_work.csv

‘KoVox_mhtml.zip’ containing MHTML files.

‘KoVox.db’ containing a pre-built SQLite relational database of the KoVox dataset.

‘KoVox_schema.sql’ containing the SQL schema.

Format names and versions

CSV; MHTML; SQLite; SQL

Creation dates

2025-01-13 to 2025-03-31.

Dataset creators

All of the authors listed in this article contributed to the creation of the dataset.

Language

Variable names in English; data mostly in Korean with multilingual.

License

CC-BY 4.0 International

Publication date

2025-10-17

4 Reuse Potential

Drawing on 1,319 recitals collected from KOPIS between 2016 and 2025, the KoVox dataset aggregates 5,177 distinct works by 518 composers, performed by 1,551 participants. To demonstrate the potential of the dataset in real-world use, we developed KoVox Museum,³ an interactive web interface built on the Vikus Viewer visualization system.⁴ The site enables users to explore vocal performances chronologically, filter events by composer, and search by performer, work, or venue, providing an immediate visual overview of how Korean classical vocal recitals have unfolded over the past decade. This interface may be of interest to artists and curators seeking practical insights for program planning and marketing. At the community level, it broadens public access to performance culture and strengthens local cultural engagement, thereby supporting the archive’s continued growth as new performances are added. Given this open access environment, the KoVox Museum displays publicly available promotional images only in low resolution and strictly for scholarly, critical, and non-commercial purposes, in accordance with fair use principles.

The KoVox Dataset supports a wide range of research and analytical uses. By examining programming trends across regions, venues, and generations, scholars can trace shifts in musical taste and observe the circulation of specific works and artists. The dataset further enables comparative studies of repertoire, performer networks, and visual presentation within Korea’s classical vocal scene and in cross-national comparison. As an example of what such analyses can reveal, the top 10% of works (518 pieces) account for 49.9% of the 18,985 programmed work instances recorded across all performance programs over the past decade. Frequently recurring pieces include Canciones clásicas españolas No.6, Mozart’s Exsultate, jubilate, and Schubert’s Gute Nacht, whereas many other works appear only once. Similarly, correlations between poster design and program content enable visual-cultural inquiry into how color, typography, and imagery signal artistic identity and audience expectations. Taken together, the structured metadata and image corpus support interdisciplinary inquiry into localization, globalization, and the evolving aesthetics of performance.

Limitations

The original poster and brochure images are not included in the dataset due to copyright concerns. Several key fields also exhibit high missing-value rates reflecting structural characteristics of performance information provision rather than data processing errors. The performance-abstract field is empty for approximately 95% of records, as providers rarely provide textual descriptions despite the availability of a free-text field (Kim & Lee, 2025). ISNI coverage is limited to a subset of main performers with reliable external authority records, while internal identifiers provide stable references within the dataset. Some performances without a matched mt20id predate 25 June 2019, the enforcement date of the Performance Act, and are therefore acceptable, whereas cases occurring after that date require further investigation. Finally, approximately one-third of works lack MusicBrainz identifiers (MBIDs), mainly due to limited coverage of contemporary works, Korean art songs, and non-work program elements such as intermissions.

Notes

[1] Apple Live Text is an on-device OCR technology based on Apple’s Vision/VisionKit framework, which enables direct extraction of text from photos and screenshots. Technical documentation is available on the Apple Developer website (https://developer.apple.com/documentation/vision/recognizing-text-in-images). (last accessed: 6 February 2025).

[2] MusicBrainz is an open, community-driven music metadata database providing structured information via a public API. Retrieved from https://musicbrainz.org/ (last accessed: 6 February 2026).

[3] Retrieved from https://happyhillll.github.io/ (last accessed: 6 February 2026).

[4] Retrieved from https://vikusviewer.fh-potsdam.de/ (last accessed: 6 February 2026).

Author Contributions

Minji Kim: conceptualization; data curation; methodology; writing – original draft; software; visualization; writing – review & editing.

Eunsoo Lee: conceptualization; project administration; methodology; data advice; funding acquisition; writing – review & editing.

KoVox Dataset—A Relational Database of Korean Classical Vocal Performance Ephemera

Full Article

1 Overview

Repository location

Context

2 Method

Steps

Figure 1

Quality control

Data Structure

Table 1

Table 2

Table 3

Table 4

Table 5

3 Dataset description

Repository location

Repository name

Object name

Format names and versions

Creation dates

Dataset creators

Language

License

Publication date

4 Reuse Potential

Limitations

Notes

Author Contributions

Paradigm

My account