Extracting Publishing Data from the English Catalogue of Books

Anna Preus; Siddharth Bhogra; John Carlyle

doi:10.5334/johd.499

Full Article

(1) Context and Aims

Bibliographic data can reveal large-scale trends in textual production, circulation, and readership (So, 2020; Walsh & Antoniak, 2021; Saffold & Nishikawa, 2022; McGrath, 2023). Publishing data in particular offers insights into the infrastructure of literature: the institutions and political economies that produce novels and poetry anthologies, as well as the processes that shape readership and generate social and cultural value for literary works. However, data from within the publishing industry often remains invisible or illegible: sales data widely used to make publishing decisions today is concealed in proprietary databases and kept off limits from researchers (Walsh, 2022), while historical publishing data remains obscured in its own way across paper records, which are available only for select presses and are scattered in archives around the world. The challenges of access and legibility for publishing data motivate our work with the English Catalogue of Books (ECB).

The ECB was a yearly record of books issued in England and Ireland running from the mid-19th century to the mid-20th century. During much of this period, London was home to the largest English-language publishing industry in the world–rivalled only by New York– as presses capitalized on expanding domestic and international markets (Eliot, 1994). Across the 19th century, British publishers increasingly began marketing their works in areas under British imperial control (Joshi, 2002; Chatterjee, 2006; Low, 2016). Priya Joshi details Macmillan’s early efforts to access markets in South Asia in particular, characterizing them as “such a resounding success that virtually every major British publisher at some point in their business also started their own Foreign, Colonial, or Imperial library series” (2002, p. 94). The ECB documents this industry-wide growth in granular terms– press by press and title by title.

The aim of the ECB was to give a complete account of the books issued in England and Ireland each year, and, as such, the catalogue is a key resource for researchers interested in modern English and Irish print cultures and Global Anglophone publishing more broadly. The ECB was compiled and issued by the trade journal Publishers’ Circular and was based on self-reported publications by registered presses. The catalogues are thus not comprehensive but include records from the majority of book publishers in England and Ireland at the time, and they remain, as Troy J. Bassett (2008, p. 64) contends, “far and away the best” bibliographic catalogues of the period.

(2) Dataset description

The “English Catalogue of Books (1912–1922) Dataset” represents publishing data for over 99,400 titles issued in England and Ireland between 1912 and 1922 as reported by publishers and indexed by Publishers’ Circular between 1913 and 1923 (catalogues were issued the year following the year they covered). Our data is based on Optical Character Recognition (OCR)-generated text drawn from facsimiles of copies of the ECB held at Princeton University Libraries (1912–1918, 1920, 1922) and the New York Public Library (1919 and 1921) and made available through HathiTrust’s Digital Library.¹ The dataset includes the following fields for each work (where available in the original catalogues):

index: index of entry in the dataset

original_entry: unparsed entry drawn from OCR-generated text

author(s): author(s) of the book

title: title of the book

format: format of the book (folio, 4to., 8vo., etc.)

publisher: publisher of the book

date: date of publication of the book (month, year)

catalogue_year: issue of the ECB from which the entry is drawn

page_num: page from which the entry is drawn--as printed in the ECB

doc_page_num: page from which the entry is drawn--in digital facsimile

Repository location

https://doi.org/10.5061/dryad.rr4xgxdn3

Repository name

Dryad

Object name

English Catalogue of Books Dataset, 1912–1922

Format names and versions

english-catalogue-of-books_1912–1922_cleaned.csv (UTF-8 encoded). The file is provided as a comma-delimited CSV and follows standard CSV quoting conventions: fields containing commas are enclosed in double quotation marks (“), and embedded quotation marks within fields are represented by doubled quotation marks (“”). No separate escape character is used.

Creation dates

2022-06-01 to 2025-11-03

Dataset creators

Siddharth Bhogra (UW, co-author), John Carlyle (UW, co-author), Anna Preus (UW, co-author), Gali Alony (UW, research intern), Elissa Fong (UW, research intern), Ethan Hu (UW, research intern), Zhiming Huang (UW, research intern), Devin Short (UW, research assistant), Meha Singal (UW, research intern), Angela Statler (UW, research intern), Caesar Tuguinay (UW, research assistant), Camille Zahn (UW, research intern).

Language

English (UK)

License

CC0 1.0 Universal

Publication date

2025-11-03 [review version]

(3) Method

Our approach includes four broad steps: 1) extraction of bibliographic entries, 2) manual correction, 3) parsing of bibliographic entries, 4) regularization and accuracy testing.

(3.1) Extracting Entries

The ECB includes entries for each publication reported by each publisher each year. In addition to books, the catalogues include some serials (especially academic journals), but generally not ephemeral texts, such as newspapers or monthly magazines. The catalogues also include a significant number of official publications like parliamentary acts, reports on branches of the military, changes to the tax code, etc.

In the editions we focused on, the entries are organized in a 2-column format (Figure 1). The entries in each catalogue are listed in alphabetical order, and each book is included at least twice: once beginning with the author’s name (if there is an author), and once beginning with the book title. The author-first entries are longer and contain more information, and they are indicated visually in the catalogues with a bolded first word. We refer to these entries as “main entries” because they contain more detail than title-first entries –most importantly the publisher– and serve as the main record for each book.

A portion of a page from the 1912 *ECB*.

To gather information on individual books, we needed to extract only text included in these main entries. Our first step was splitting off the front and back matter in each volume, so we could focus exclusively on pages featuring catalogue entries. We then removed repeating page elements, like running headers and page numbers, which appeared in the plain text associated with each page scan. From there, we could split the text from each page of each catalogue into discrete entries using regular expressions. Each entry ends with a month abbreviation, followed by a year abbreviation, followed by whitespace, so we split the text on these patterns across all pages (Appendix 1). Although these elements are consistent in the printed catalogues, OCR transcription introduced significant variation into strings representing month and year, resulting in a high number of entries that were split incorrectly. To address this issue, we worked with a team to capture repeating variations appearing in month and year strings within each catalogue. These variations include common errors like the substitution of the capital letter “O” for the number “0,” or the substitution of the lowercase “l” for the number “1,” but they also include less predictable constructions, like “iz” as a substitute for “12”. The ECB’s format as an index containing concise information on thousands of books, and its extensive use of industry-specific abbreviations made the application of standard OCR correction packages challenging. Further, even small changes in the formatting of catalogue entries (like the addition of apostrophes before two-digit year strings starting in 1917) would cause cascading OCR variations. Each new issue required a process of writing and fine-tuning unique regular expressions that incorporated unique textual patterns. This represented a significant investment of labor and is one of the factors that narrowed our dataset to a limited number of issues.

Once we successfully split up the entries in each catalogue, we needed to differentiate between main, author-first entries, and non-main, title-first entries. In the catalogues, these entries are differentiated through formatting: specifically, bolded text at the beginning of each entry and indentation following the first line. However, our efforts to extract main entries based only on variations in typeface and layout proved unsuccessful, so we once again turned to textual patterns. Specifically, we looked for strings of capital letters, which in the ECB are used only for publisher names that appear at the end of main entries (Figure 2). In this way, a piece of information prioritized visually in the original catalogues provided a convenient foothold for extracting the most important pieces of information from texts in their digitized afterlives.

Entries from the 1912 *ECB* with publisher name highlighted in purple.

(3.2) Manual Correction

Once we extracted main entries from each catalogue, we created heuristics for common errors: namely, merged entries (entries that were not correctly split from each other); truncated entries (entries that were cut off at the beginning or end); overlapping entries (entries interspersed with each other due to the OCR model reading across columns); and mislabelled entries (secondary entries categorized as main entries). We then hand-corrected entries with these features– about 5% of the dataset or ~5,000 titles– cross-referencing with facsimiles of the original editions, and manually transcribing missing or incorrect text where necessary. This too represented a significant investment of time and labor and was another factor that caused us to restrict the number of catalogues in our dataset.

(3.3) Parsing of Bibliographic Entries

Once we had extracted a stable set of main entries across eleven catalogues, we parsed the information in each entry based on its standard format:

[Author(s)]–[Title]. [Format]. [Dimensions], [Pages], [Price]…[Publisher], [Month] [Year]

While we experimented with regular expression parsing for this task, we had better results with Large Language Model (LLM)-based methods. We used a lightweight Google Gemini model (Gemini 2.5 Flash) with few-shot prompting to parse the entries, structuring the prompt with specific instructions for the idiosyncratic formatting of entries based on knowledge of the catalogue and its organizing conventions (Appendix 1). The model was instructed to retain OCR errors and refrain from any corrections or copy editing, as well as to maintain the order of every element in the entry as it appeared. The idea was to limit LLM intervention and remain as true to the OCR-generated text as possible.

The model was run on a small subset of randomly selected entries and hand-picked entries that we knew to be problematic (~500), and the parsing was manually checked. The results were used to fine-tune the prompt for accuracy and handle special cases such as entries with multiple publishers, multiple months listed as part of the publication date, or additional miscellaneous information, such as the name of a book series, which was not part of a usual entry. We repeated this process of manual review and prompt revision several times before settling on a prompt and parsing the entire dataset.

(3.4) Regularization and Accuracy Testing

After extracting information from entries across catalogues for 1912 through 1922, we checked for data loss from LLM-based parsing. To ensure there was no data loss from LLM interference, we measured each parsed entry –stitched back together to resemble the original entry– against its corresponding, unparsed entry, flagging all entries that differed by one character or more. We checked for character value as well as position in string by measuring the Levenshtein distance between the parsed and original entries, and we flagged each entry according to a high threshold of one character difference, giving us 1,505 entries with data loss out of a total of 99,448 entries (~1.5%).² We manually corrected all flagged entries, reran the distance functions, and found the resultant dataset to have no data loss introduced by LLM parsing.

To evaluate how accurately the LLM parsed the information within each entry into relevant categories (author, title, publisher, etc.), we manually checked 1,000 parsed entries drawn at random from the dataset (~1% of entries), and we found issues with LLM parsing in less than 1% of the sample. Divergences from our human annotator mainly occurred in ambiguous cases, like cases of corporate authorship, where the line between author and title was unclear.³ For example:

Education (Bd. of)-Regulations, in force fr. Aug. 1, 1918, for technical schools, schools of art, and other forms of provision of further educa. tion in England and Wales. 2d. H.M. STATIONERY OFF, Oct. ‘18

In this entry, “Education (Bd. of)” could be interpreted as the author name or the beginning of a long title reordered to fit indexing conventions: “Board of Education Regulations, in force fr. Aug. 1, 1918…” The LLM interpreted it as the latter, and our human annotator interpreted it as the former. We also noted instances in which the original entry passed to the LLM was missing information, likely due to OCR errors. In these cases (~0.5% of our sample), LLM parsing was correct for the available information and did not introduce new errors.

Errors from the underlying OCR-generated text, however, persist in our dataset, and necessarily influence the amount of data we were able to extract. For example, we encountered instances in which portions of columns of text visible in the facsimiles were missing in the plain text files, and occasional cases in which whole pages were omitted entirely. Because of such inevitable consequences of digitization and data extraction processes, our dataset cannot represent a comprehensive view of the ECB, but it nevertheless captures a high proportion of books included in the catalogues. In each of the catalogues we focused on, the total number of books published during a given year is listed in an “Analysis of Books” section. From 1912–1922, over 113,000 titles were listed as having been published. The ECB dataset captures information on over 99,000 of these titles, or about 88% of the total books ECB cataloguers count as having been published in England and Ireland during this period.

(4) Results and discussion

Given this scale, the ECB dataset enables analysis of trends in early 20th-century publishing that would be difficult to track through other data sources. The relatively recent creation of large digital libraries present opportunities to conduct large-scale analyses of books issued over time (Jockers & Kirilloff, 2016; White & Zuccala, 2018; Underwood, 2019; Sobchuk & Šeļa, 2024). These repositories, however, offer retroactive views of the publishing industry, and sifting through them to understand print culture in a specific historical time and place is a difficult task, mediated by layers of collection, preservation, and metadata creation processes. The ECB by contrast represents a collective effort made in real time by a broad network of publishers to document their own industry, year by year, as well as the meticulous work of cataloguers, indexers, and typesetters who compiled and structured the records they submitted.

(4.1) Top Publishers

The book industry in England in the early 20th century centralized an immense amount of publishing power in London and the surrounding areas, but individual publishing houses were relatively small businesses by today’s standards. In the 1910s, even the most prolific British presses were issuing on the order of 400–600 titles per year, and most were issuing far fewer works than that. This meant publishers were making a bet on each title they took on, with every work representing a significant part of their overall yearly output.

The top five publishers during this period, in terms of titles listed in the ECB, were Hodder and Stoughton, His Majesty’s Stationery Office, Wyman and Sons, Macmillan, and Oxford University Press (Figure 3). All of these presses had grown significantly in the latter decades of the 19th century. Hodder and Stoughton, founded by Matthew Hodder and Thomas Stoughton, was initially known for publishing evangelical texts but later moved into issuing cheap yellowback fiction–a precursor to the modern paperback (Collin, 2004). The Stationery Office was the official publisher of the British Crown and both houses of parliament, issuing a range of government records and practical documents. Wyman and Sons, founded in the mid-19th century by Charles Wyman, was a prominent bookseller as well as publisher, holding contracts with railway companies for bookstalls at stations around England (Taylor & Taylor, 2014). Macmillan was also founded in the mid-19th century by Scottish brothers Daniel and Alexander Macmillan, quickly becoming one of the largest presses not only in England, but in the U.S., Canada, and India (James, 2002; Joshi, 2002). Finally, the centuries-old Oxford University Press, which was led during this period by Humphrey Milford, was increasingly issuing works for broad audiences while also expanding its overseas operations (Chatterjee, 2006; Louis, 2014).

Plot of top publishers in the *ECB* by number of publications, 1912–22.

These presses were all able to capitalize on a convergence of changes in technologies for book production, favorable (for them) imperial trade networks and economic policies, and growing English-language audiences around the world to increase their outputs and their profits in the early 20th century. Four of these imprints (all but Wyman) continue to exist today, and remarkably –given the current state of conglomerate publishing– three of them remain independent: the Stationery Office, Oxford University Press, and Macmillan.

(4.2) Top Authors

The list of most prolific authors from this period, in terms of unique editions listed in the ECB, makes for interesting reading: canonical literary giants Shakespeare and Dickens predictably appear at the top of the list (in second and fourth place, respectively), but the rest of the top authors are relatively unknown today and are largely genre fiction writers (Figure 4).

Plot of top authors in the *ECB* by number of publications, 1912–22.

The author who issued the most editions between 1912 and 1922 was William Le Queux, a journalist and novelist who wrote romances and mysteries, and the best-selling anti-German pulp thriller The Invasion of 1910 in 1906 (Hughes, 2020). Figure 5 represents the number of editions of Le Queux’s works listed in the ECB compared to those of authors who are more famous today, like James Joyce and Rudyard Kipling.

Plot of publications over time by selected authors, *ECB* 1912–1922.

Romance novelist Charles Garvice, who also wrote under the pseudonym Caroline Hart, was the third most published author on the list and arguably the best-selling Anglophone writer of the period, selling millions of copies of his works around the world (Matter, 2007). Other authors appearing toward the top were English romance and suspense writer E. Phillips Oppenheim, who authored the bestselling spy thriller The Great Impersonation (1920) (Burton, 2015); American poet, novelist, and spiritualist Ella Wheeler Wilcox (Sorby, 2009); and English author Nat Gould, who wrote popular novels often set in Australia, where he also worked as a reporter (Page, 2004). Gould, like Garvice, was recognized as one of the best-selling authors of the day, and an advertisement included in the ECB notes that over 14 million copies of his work had been sold by 1917 (Figure 6).

Advertisement for books by Nat Gould included in the 1917 issue of the *ECB*.

Publishers’ continuous interest in issuing these authors’ works points to their ongoing profitability, with many of their titles being issued in new editions year after year. While editions cannot serve as a proxy for book sales, they do indicate continued popularity and ongoing circulation, since publishers generally did not issue new editions of works unless the previous edition was likely to sell out or had already done so. At the same time, even the most popular authors’ publishing careers during this period evidence the impact of the shifting political climate in England. For all of the top five authors, 1919 –the period in the immediate aftermath of the First World War –marked their lowest number of publications. By the next year, however, the industry was starting to bounce back, and multiple new editions of each of their works were issued again.

(4.3) Top Formats

The ECB also includes a significant amount of practical information about the forms in which texts circulated. Cataloguers used standard abbreviations for conventional book formats within the publishing industry. In descending order of size, these formats are: folio (fo., usually unabbreviated), quarto (4to.), octavo (8vo.), duodecimo (12mo.), sixteenmo (16mo.), and octodecimo (18mo.). Octavos were overwhelmingly the dominant format during this period, and our data illustrates the rise in particular of the crown octavo–a smaller, cheaper octavo designed to fit the budgets of a growing reading public (Figure 7).

Plot of top book formats by number of publications in ECB, 1912–1922.

(5) Implications/Applications

The ECB dataset represents information on a broad swath of books, most of which have long been out of print. Margaret Cohen (1999) famously calls this vast body of unremembered texts “the great unread,” highlighting how quickly the majority of books are forgotten. Due, in part, to the difficulty of engaging this body of historical work, views of the literary past are often filtered through received scholarly narratives that hinge on ideas of literary importance. In British literary studies, modernism has long been considered the dominant literary and aesthetic movement of the period; and in histories of modernism, 1922 (the final year of our dataset) is often positioned as an “annus mirabilis,” or miracle year, because it witnessed the publication of multiple important books, including T.S. Eliot’s The Waste Land, James Joyce’s Ulysses, and Virginia Woolf’s Jacob’s Room (Levenson, 2017). While scholars have long emphasized these works’ enmeshment in broader print cultures (Huyssen, 1986; Rainey, 1998; Bornstein, 2001; Mao & Walkowitz, 2008; Brinkman, 2016), exploring what was actually circulating around them at the time has remained difficult. The ECB dataset thus provides an opportunity to recontextualize and reconsider these important literary texts and this crucial literary and historical moment within a wider field of print and broader reading cultures, and to situate the eternally read within the great unread. While our analysis here has focused on prominent authors, imprints, and formats, we hope the dataset could also be used to trace popular titles being reissued year after year, to identify key topics appearing in British print over time, or to uncover important works by understudied authors.

The dataset can also cue us into broader, colonial contexts of literary production: imperial economic systems and vast networks of colonial circulation undergirded much of the publishing industry in London in the early 20^th century, which profited, as well, from issuing a range of imperial content. The Crown’s Stationery Office, the second most prolific publisher in our data, was a state-run operation that produced a wide range of official publications, including policy documents on regions under colonial control, a seemingly endless series of reports from branches of the military, and various documentations of tax statutes around the world. Catalogued in the dataset, as well, is the output of mainstream presses such as Macmillan and Oxford, which directly benefited from imperial trade policies, expanding their operations in South Asia during this period, and increasing their profits and cultural reach. Meanwhile, some of the most popular writers represented in the data–writers such as Nat Gould and William Le Queux–played on imperial anxieties to appeal to metropolitan audiences and consequently bolster their sales. Such a quantitative look at the state of literary production in Britain, and especially in the imperial center of London, opens up opportunities to investigate the tensions of endogenous and exogenous literary development; trace the prehistories of colonial literary influence in postcolonial reading culture; and explore the evolving nature of global literary and textual production.

Our work on the ECB dataset, we hope, offers an approach that can be extended to other issues of the catalogue and other periods of British publishing. Issues of the ECB from 1835 to 1929 are currently available in OCR-transcribed facsimiles on HathiTrust and Internet Archive. The catalogue, however, ran all the way through to 1968, and later issues will be entering the public domain in the coming decades.⁴ We will continue to work on gathering data from earlier issues of the catalogue, and we are making our code publicly available, should others be interested in extracting information from the ECB using similar methods. We note that the typography, formatting, and indexing practices of the ECB change over time, and each issue is unique, so the approach requires tailoring for the conventions of each individual edition.

Data Accessibility Statement

The data and code from this study are openly available via Dryad: https://doi.org/10.5061/dryad.rr4xgxdn3. The data is derived from texts in the public domain, which are available via HathiTrust: https://catalog.hathitrust.org/Record/000550349 (retrieved 28 February, 2026).

Additional File

Appendix 1: Code for ECB entry extraction and LLM prompting: https://doi.org/10.5061/dryad.rr4xgxdn3.

Notes

[1] Datasets for all editions except for 1919 and 1921 were built from facsimiles of copies held at Princeton University Library. Due to quality issues in the facsimile for 1919 and the unavailability of a copy for 1921, datasets for these two years were built, instead, from copies held at the New York Public Library. The copies of the ECB we used are listed in this catalogue record: https://catalog.hathitrust.org/Record/000550349 (accessed 28 February, 2026). We used OCR-generated text available in July 2024. As the HathiTrust website (https://www.hathitrust.org/member-libraries/resources-for-librarians/data-resources/, accessed 28 February, 2026) notes, “The HathiTrust collection is not static,” and the organization is continuously working with partner libraries “to try to correct errors in bibliographic records and digital content (including poor OCR)”. Thus, the OCR-generated text available on HathiTrust’s website now may not exactly match the text we used to generate the dataset.

[2] Using a combination of Jaccard similarity measure and manual review, we found no data loss across the 1,505 entries, and discovered that in each flagged entry, the LLM had simply rearranged elements of the original entry, but had not inserted or deleted characters. We found that these elements had been parsed into the correct categories but had been flagged because they had appeared in an irregular position in the original entry (usually due to OCR errors).

[3] We considered additional computational approaches to measuring parsing accuracy, including measures based on average range of position and relative length, but because of inconsistent lengths of entries and generally short average length of entries, this method, and others similar to it, did not yield useful results. We thus settled on manually checking a subset of the data.

[4] Issues from across this period have been digitized, but as far as we know, there is no single complete, continuous set available on Internet Archive, HathiTrust, or Google Books.

Acknowledgements

This project has benefitted from the work, input, and insight of teams of undergraduate and graduate researchers who have participated in the Humanities Data Science Summer Institute (HDSSI) at the University of Washington between 2023 and 2025. We are grateful to our HDSSI collaborators for their contributions to the creation of this dataset, and we are grateful to program leaders, Melanie Walsh and Naomi Alterman, for their ongoing support and feedback.

Competing Interests

The authors have no competing interests to declare.

Author contributions

Anna Preus: conceptualization, data curation, funding acquisition, methodology, project administration, resources, supervision, validation, writing-original draft, writing-review and editing.

Siddharth Bhogra: conceptualization, data curation, investigation, methodology, supervision, validation, visualization, writing-original draft, writing-review and editing.

John Carlyle: conceptualization, data curation, investigation, methodology, writing-review and editing.