(1) Overview
Repository location
JOHD Dataverse: https://doi.org/10.7910/DVN/6MXQAP
Context
Until recently, language documentation has largely focused on collecting data on endangered languages in the field of historical linguistics (Campbell, 2016). The linguistic profiles of bilingual speakers in immigration settings are not well documented, although some attempts have been made to document language development in bilingual children (e.g. Yip & Matthews, 2007).
‘Narrative’ covers a wide range of human activities, including fiction, films, folktales, and interviews (Labov & Waletzky, 1967). The use of narratives as a research method extends from sociology to the humanities and arts. Rühlemann and O’Donnell (2012) proposed the concept of ‘narrative corpus’ and attempted to demonstrate that narrative corpora offer tremendous potential for narrative research in a wide range of linguistic disciplines, including corpus linguistics, pragmatics, conversation analysis, discourse analysis, and sociolinguistics. Narrative corpora can be applied to all areas of linguistics, and the advantage is that the data in the corpora are all naturalistic and thus more reliable than the examples composed by linguists. The current project aims to collect narrative speech data from late bilingual speakers, who exhibit distinctive linguistic features in the language acquisition process compared with simultaneous or early bilinguals, because there is already a full-fledged first language (L1) in place when they begin to learn their second language (L2).
The data for this corpus were collected in the context of a Ph.D. project on focus prosody in Mandarin-Cantonese bilingual speakers (Yang, 2022) and a research project on lexical tone acquisition in Mandarin-Cantonese bilingual speakers (Yang et al., 2023; Zou et al., 2024).
(2) Method
Material
The wordless picture book Frog, where are you? (Mayer, 1969), commonly used to collect narrative descriptions cross-linguistically (Berman & Slobin, 1994), was used to elicit spontaneous speech data. This book contains 24 pages and narrates the complete story of a boy and a dog looking for a frog.
Participants
Participants were divided into three groups. The target bilingual group comprised 73 Mandarin-speaking immigrants, while the Cantonese and Mandarin groups comprised 59 native speakers of Cantonese and 40 native speakers of Mandarin, respectively. The Mandarin-speaking immigrants were born and raised in northern China and spoke Mandarin as their L1. They all arrived in Hong Kong and began learning Cantonese after puberty (average onset age ± standard deviation: 23.2 ± 5.3). By the time of the recording, they had been living in Hong Kong for an average of 7.1 years.
Bilingual and Cantonese speakers narrated the story in both Cantonese and Mandarin, whereas monolingual Mandarin speakers narrated the story only in Mandarin. A total of 132 Cantonese stories (73 bilinguals and 59 natives) and 155 Mandarin stories (73 bilinguals, 40 natives, and 42 Cantonese speakers) were collected.
Steps
Data collection: Participants were first presented with printed and electronic versions of the book and asked to familiarise themselves with the story. Thereafter, they were asked to narrate the entire story at a self-paced speed, during which they were allowed to refer to the book. Moreover, they were encouraged to narrate the story page-by-page to avoid missing key information.
Speech transcription: The transcription from speech to text was first conducted using the ‘Transcribe’ function in Microsoft Word. To ensure that the conversion was performed accurately, our team members manually checked all transcriptions using audio files. Speech differs from written language in that it often includes repetition, self-correction, and fillers. In Cantonese and Mandarin speech, the use of particles is also a notable feature. All these issues were considered in the speech transcription step, and team members were instructed to record all information in the initial transcriptions, which were further annotated in Step 5 Annotation.
Word segmentation: The written forms of Chinese languages, including Cantonese and Mandarin, are represented as sequences of characters without space. Before further annotation, the Chinese texts were split into words. Two online platforms were used to automatically conduct word segmentation: the Graphical Cantonese Generator for Cantonese1 and the CkipTagger2 for Mandarin. After that, the team members manually checked all the segmentations for consistency. Ambiguous cases were discussed within the team before final decisions were made.
Dictionary preparation: To allow the computer programme CLAN (Computerized Language Analysis)3 (MacWhinney, 1996) to add a morphology (MOR) tier to the corpus, a dictionary was prepared for each language. Dictionaries for Mandarin and Cantonese are available in CLAN; however, further amendments and the addition of new entries were made to better fit our data.
Annotation: Annotation of the corpus was conducted following the standard CHAT (Codes for the Human Analysis of Transcripts) format in CLAN (MacWhinney, 2000). The tagging of parts of speech (POS), pronunciation, and English translation for each word were added automatically to a separate MOR tier with a command in CLAN. The MOR tier was manually checked by team members.
Ethical clearance
Data collection was performed in accordance with the Declaration of Helsinki and was approved by the Human Subjects Ethics Subcommittee of Hong Kong Polytechnic University (Reference number: HSEARS20190102001) and the Human Research Ethics Committee of Hong Kong Shue Yan University (Reference number: HREC 22-05 (M12)). All participants provided written informed consent before the recording sessions.
Quality control
All transcription and word segmentation in each file were checked by at least three team members and manually corrected if necessary. The MOR tier was added automatically, and a random selection of the MOR tier was reviewed by team members.
(3) Dataset Description
Repository name
JOHD Dataverse
Object name
HKNSC: Hong Kong Narrative Speech Corpus
Format names and versions
CHAT
Version 1
Creation dates
Start: 2021-01
End: 2025-02
Data structure
In the dataset, three versions of text are provided. The ‘raw’ folder contains transcribed text that has been checked but not processed further. The ‘processed’ folder contains text that has been further segmented into words. The ‘tagged’ folder includes text with the MOR tier. Under each folder, the files are organised according to the language and speaker group: ‘C’ for Cantonese, ‘M’ for Mandarin, ‘I’ for immigrants, and ‘N’ for native speakers. Each file represents one complete story.
In the current version, there are 176,967 tokens and 255,744 Chinese characters in total. Table 1 lists the token and character counts of each sub-corpus.
Dataset creators
Ginny Kin Ning Chan (transcription and segmentation, Hong Kong Shue Yan University), Lewis Ching Yat Cheung (annotation, Max Planck Institute for Psycholinguistics), Andy Chun Bun Cheung (annotation, Hong Kong Shue Yan University), Tony Chi Ho Fong (transcription, The Hong Kong Polytechnic University), Dong Han (data collection, Hong Kong Shue Yan University), Pak Yin Lee (transcription and segmentation, Hong Kong Shue Yan University), Yutu Li (transcription and segmentation, Hong Kong Shue Yan University), Cheuk Ying Ng (annotation, Hong Kong Shue Yan University), Aeon Ziwen Pan (transcription, segmentation, and annotation, The Chinese University of Hong Kong), Ka Ho Tao (transcription, The Hong Kong Polytechnic University), Yike Yang (data collection, transcription, segmentation, annotation, and supervision, Hong Kong Shue Yan University), Ho Fung Yeung (transcription and segmentation, Hong Kong Shue Yan University), Sichi Zhang (data collection, Hong Kong Shue Yan University), Yue Zou (segmentation and annotation, Hong Kong Shue Yan University).
Language
English, traditional Chinese (Cantonese), simplified Chinese (Mandarin)
License
CC BY-NC 4.0
Publication date
2025-02-01
(4) Reuse Potential
Files in HKNSC are in the CHAT format, which can be read and edited with the software CLAN or other text processing software such as Microsoft Word. The dataset can be used to explore topics related to linguistics and language acquisition. For example, Cantonese has a rich system of sentence-final particles (SFPs). However, it is unclear how Mandarin-speaking learners use SFPs in Cantonese, given the differences in the SFP systems of these two languages. The narrative data in the corpus thus serve as a valuable resource for examining this issue. Since POS information has been added to the MOR tier, it is possible to directly search for and extract all instances of SFPs in the corpus for further analysis. Moreover, future studies may examine the cross-linguistic inference of immigrants’ L1 Mandarin and L2 Cantonese at various linguistic levels (e.g. discourse markers, cohesive devices, and code-switching). Instances of code-switching have been marked in the MOR tier, but for other topics, such as discourse markers or cohesive device, manual annotations are required before analysis.
However, the current version of the database has certain limitations. Firstly, while there are 172 speakers and 287 stories in this version of HKNSC, the sample size may still be insufficient for large-scale investigations of language acquisition. In addition, the homogeneity of the immigrants’ language backgrounds (all Mandarin speakers) restricts the generalisability of the results from the corpus enquiry. Besides, due to limitations in time and manpower, no alignment between sentences and pictures has been created in this corpus. Further annotation is required for narrative or discourse analysis, such as cohesion. In future, we aim to collect data from immigrants with more diverse language backgrounds, increase the sample size of the corpus, and provide more detailed annotations for a wider range of research topics.
Notes
[1] Graphical Cantonese Generator for Cantonese: https://test.hambaanglaang.hk/.
[2] CkipTagger: https://ckip.iis.sinica.edu.tw/service/ckiptagger/.
[3] CLAN is downloadable via: https://dali.talkbank.org/clan/.
Acknowledgements
The author would like to acknowledge Chaak Ming Lau, Hak Kit Lung, Sam Tak Sum Wong, and Dazhen Wu for their helpful suggestions on the construction of the corpus and all the participants for their contribution to this project. Moreover, the author is grateful for the efforts of the team members as specified under the ‘Dataset creators’ section.
Competing Interests
The author has no competing interests to declare.
Author Contribution
Yike Yang: All roles.
