Overview
Introduction
Biomarkers, including genetic alterations and gene and protein expression values, can indicate how a patient might respond to specific treatments. For example, a patient without a certain mutation or who has a certain gene overexpressed may be more likely to respond to a particular cancer treatment, such as certain non-small cell lung cancer patients without EGFR or ALK mutations responding to pembrolizumab as first-line treatment [1] or breast cancer patients with HER2 overexpression responding to trastuzumab [2].
The U.S. Food and Drug Administration (FDA) includes biomarker information in drug labels, which are officially documented in the DailyMed database. DailyMed contains approximately 140,000 FDA-approved and FDA-regulated product labels [3]. For a drug or a biological product, its label contains essential prescribing information in structured sections, such as the indications and usage and dosage and administration [3]. Within each section, information is mostly unstructured, being present in free text. This information contains a variety of biomedical data, including biomarker information, which is valuable for and used by a wide range of stakeholders, such as doctors, healthcare providers, and biomedical researchers. Despite the availability of this data, extracting relevant biomarker information efficiently at scale remains a challenge. Researchers and healthcare professionals often need to manually sift through multiple labels to curate the necessary data, which is generally time-consuming and inefficient. Existing application programming interfaces (APIs), such as the FDA open-source APIs [4] and DailyMed APIs [3] allow users to download drug information into XML or JSON format, but they do not directly provide biomarker information for corresponding drug products. When users want to get the biomarker information for specific diseases or drugs, they still need to parse the XML or JSON files and extract the relevant information from them. To address this gap, we developed biomarker_nlp, a Python package designed to streamline the extraction of biomarker information for targeted cancer therapies from DailyMed and other biomedical texts. This tool leverages Named-Entity Recognition (NER) techniques and pre-trained language models to identify and extract biomarkers such as genes, proteins, therapies, and diseases. By automating the extraction process, biomarker_nlp is anticipated to significantly reduce the time and effort required for biomarker curation, facilitating more efficient research and potentially, clinical decision-making.
Implementation and architecture
The biomarker_nlp package is freely available on GitHub (https://github.com/Damonlin11/FDA-approved-Targeted-Therapies-Label-Extraction/tree/main/biomarker_nlp) and through the Python Package Index (PyPI) (https://pypi.org/project/biomarker-nlp/). The core functionalities of the package include 1) HTML parsing, 2) Named-entity recognition, and 3) Negation detection. They are used to pull information from the following two sources:
DailyMed: the National Library of Medicine (NLM)’s database of drug labels [3].
National Cancer Institute (NCI): the list of FDA-approved targeted cancer therapies [5].
These sources provide HTML-formatted pages from which biomarker_nlp parses the text by leveraging tools in the lxml library developed by Stefan, Faassen, & Ian (2005) [6]. biomarker_nlp enables users to scrape certain pieces of biomarker information such as drug label, targeted therapy name, and disease name from DailyMed and NCI webpages by passing the URL to the relevant function in the package, without requiring them to consider the HTML tree structure. For instance, if users want to extract the drug label name from a DailyMed label, they will simply need to pass a URL link to the relevant function in the package, as follows:
>>> url = “https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=939b5d1f-9fb2-4499-80ef-0607aa6b114e”
>>> biomarker_extraction.drug_brand_label(dailyMedURL = url)
‘AVASTIN- bevacizumab injection, solution’
The package also includes the functions to work directly with free text and extract biomarker information from it. The package utilizes various pre-trained NER models from ScispaCy (Neumann et al. 2019) [7] to help identify and extract biomarker entities from provided free text, including gene, protein, and therapy names. Consider the following two examples of extracting gene and chemical names from free text:
>>> txt = “PEMAZYRE is indicated for the treatment of adults with relapsed or refractory myeloid/lymphoid neoplasms (MLNs) with fibroblast growth factor receptor 1 (FGFR1) rearrangement.”
>>> biomarker_extraction.gene_protein_chemical(text = txt, gene = 1, protein = 0, chemical = 1)
{‘gene’: [‘FGFR1’], ‘chemical’: [‘PEMAZYRE’]}
>>> txt = “ERBITUX is indicated, in combination with encorafenib, for the treatment of adult patients with metastatic colorectal cancer (CRC) with a BRAF V600E mutation, as detected by an FDA-approved test, after prior therapy.”
>>> biomarker_extraction.gene_protein_chemical(text = txt, gene = 1, protein = 0, chemical = 1)
‘gene’: [‘BRAF’], ‘chemical’: [‘encorafenib’]}
Users are also able to extract the NDC codes on a DailyMed label by simply passing the URL link. For example, this can be done for pemigatinib (Pemazyre®):
>>> url = “https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=9e1f2222-1d89-4e63-989c-ccebe2ab1eb4”
>>> biomarker_extraction.ndc_code(dailyMedURL = url)
‘50881-026-01, 50881-027-01, 50881-028-01’
Additionally, the package enables users to extract the diseases from an NCI page by providing the URL link. For example, we may want to obtain the diseases which can be treated with cetuximab (Erbitux®):
>>> url = “https://www.cancer.gov/about-cancer/treatment/drugs/cetuximab”
>>> biomarker_extraction.therapy_disease(url = url)
[‘Squamous cell carcinoma of the head and neck’, ‘Colorectal cancer’]
One example of usage is that the automatically extracted entities, such as biomarkers, diseases, NDC codes, and drug labels, can be organized into a single data structure as shown in Table 1. This provides a benefit compared to FDA open-source APIs and DailyMed APIs or manual searching through individual labels, as it allows users to easily re-organize this information into a desired format for further research.
Table 1
An example data structure in which the extracted biomarkers are organized for downstream research. This is a subset of the output table at https://github.com/Damonlin11/FDA-approved-Targeted-Therapies-Label-Extraction/blob/main/2021-06-27%20FDA-approved%20targeted%20therapy%20labels.csv, generated with the code at https://github.com/Damonlin11/FDA-approved-Targeted-Therapies-Label-Extraction/blob/main/FDA_therapy_biomarker_extraction.ipynb.
| geneProtein_label | therapy_label | disease_label | drug_label | drug_label |
|---|---|---|---|---|
| PD-L1 | Atezdizumab | Urothelial carcinoma | TECENTRIQ- atezolizumab injection, solution | 50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86 |
| BRAF | Atezolizumab | Melanoma | TECENTRIQ- atezolizumab injection, solution | 50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86 |
| PD-L1 | Atezolizumab | Non-small cell lung cancer | TECENTRIQ- atezolizumab injection, solution | 50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86 |
| EGFR | Atezolizumab | Non-small cell lung cancer | TECENTRIQ- atezolizumab injection, solution | 50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86 |
| ALK genomic | Atezolizumab | Non-small cell lung cancer | TECENTRIQ- atezolizumab injection, solution | 50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86 |
| PD-L1 | Atezolizumab | Breast cancer | TECENTRIQ- atezdizumab injection, solution | 50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86 |
| PD-L1 | Nivolumab | Non-small cell lung cancer | OPDIVO- nivolumab Injection | 0003-3734-13, 0003-3772-11, 0003-3774-12 |
| EGFR | Nivolurnab | Non-small cell lung cancer | OPDIVO- nlvolumab injection | 0003-3734-13, 0003-3772-11, 0003-3774-12 |
| ALK genomic | Nivolurnab | Non-small cell lung cancer | OPDIVO- nlvolumab injection | 0003-3734-13, 0003-3772-11, 0003-3774-12 |
The logical structure and output for detecting and extracting biomarker entities using URL links or free text are visualized in Figure 1.

Figure 1
Logical structure and output for detecting biomarker entities in URL and free text using the biomarker_nlp package.
Negated biomarkers refer to the absence of a particular molecular change, for example, a treatment being prescribed for patients without a certain mutation can be extremely important but challenging to extract. As a result, the biomarker_nlp package provides tools to detect negations in sentences through two pre-trained negation transformer models from Khandelwal & Sawant’s (2020) NegBERT program [8], which applies a transfer learning approach, namely a negation cue detection model and a negation scope detection model. As the NegBERT program does not provide the output models, we performed the training process and published the two resulting models for free use. By using the negation modules and the pre-trained negation models, users can easily mine the negated information from free text. Figure 2 shows the logical structure and output for detecting negations. The following is an example usage of the functions to detect the negation cue and extract the negation scope from a sentence:

Figure 2
Logical structure and output for detecting negation in free text using the biomarker_nlp package.
# detect negation cue
>>> negation_cue_scope.negation_detect(text = txt, modelCue = modelCue)
# extract the negation scope
>>> negation_cue_scope.negation_scope(text = txt, modelCue = modelCue, modelScope = modelScope)
After having detected the negated scope from a sentence, users will want to apply entity detection to extract the negated biomarkers from the scope. The following is an example of extracting the negated genes:
>>> txt = “TECENTRIQ, in combination with bevacizumab, paclitaxel, and carboplatin, is indicated for the first-line treatment of adult patients with metastatic non-squamous NSCLC with no EGFR or ALK genomic tumor aberrations.”
>>> neg_scope = negation_cue_scope.negation_scope(text = txt, modelCue = modelCue, modelScope = modelScope)
>>> neg_text = ‘ ‘.join(i for i in neg_scope)
>>> biomarker_extraction.gene_protein_chemical(text = neg_text, gene = 1, protein = 0, chemical = 0)
{‘gene’: [‘EGFR’, ‘ALK genomic’]}
Quality control
The biomarker_nlp package has been tested with several DailyMed URLs, the NCI targeted therapies website, and additional free texts. It functions well when the package and pre-trained models are loaded appropriately. The following shows that the package is working when detecting genes and chemicals from a free text:
>>> txt = “PEMAZYRE is indicated for the treatment of adults with relapsed or refractory myeloid/lymphoid neoplasms (MLNs) with fibroblast growth factor receptor 1 (FGFR1) rearrangement.”
>>> biomarker_extraction.gene_protein_chemical(text = txt, gene = 1, protein = 0, chemical = 1)
{‘gene’: [‘FGFR1’], ‘chemical’: [‘PEMAZYRE’]}
As another example, the following shows that the package is working when detecting if negation is present in a text, by returning the value “True”:
>>> txt = “TECENTRIQ, in combination with bevacizumab, paclitaxel, and carboplatin, is indicated for the first-line treatment of adult patients with metastatic non-squamous NSCLC with no EGFR or ALK genomic tumor aberrations.”
>>> negation_cue_scope.negation_detect(text = txt, modelCue = modelCue)
True
Availability
Operating system
Works in all operating systems supporting Python.
Programming language
Python 3.6 or higher
Dependencies
Numpy, pandas, requests, re, scispacy, spacy, lxml, torch, keras, sklearn, transformers, html
List of contributors
Junxia Lin, Yuezheng He, Subha Madhavan, Chul Kim, Simina M. Boca.
Software location
Archive
Name: PyPi
Persistent identifier: https://pypi.org/project/biomarker-nlp/
Licence: MIT
Publisher: Junxia Lin
Version published: 0.0.4
Date published: 09/04/21
Code repository
Name: GitHub
Identifier: https://github.com/Damonlin11/FDA-approved-Targeted-Therapies-Label-Extraction/tree/main/biomarker_nlp
Licence: MIT
Date published: 09/05/21
Language
English
Reuse potential
biomarker_nlp is useful for a variety of users who want to detect biomarker information from free text, including but not restricted to the list of FDA-approved targeted cancer therapies available on the NCI website or the DailyMed database of drug labels on the NLM website. In addition, it is capable of detecting negation in any free text. biomarker_nlp could also be integrated into an “AI-augmented curation” workflow, building on similar work by Mahmood et al. (2017), which developed a system to extract associations between genomic anomalies and drug responses from the biomedical literature [9]. The package is written in Python, a widely used programming language. It is well documented and archived on both Python Package Index (PyPI) and GitHub repositories. Users can provide their questions or comments on GitHub issues or through email, and the authors of the package will do their best to support them.
Funding Information
This work was completed as part of a project funded by a pilot award (P30CA051008, PI of pilot: Kim).
Competing Interests
SM is currently an employee at Pfizer. SMB is currently an employee and minor shareholder at AstraZeneca. This work was performed while SM and SMB were at Georgetown University, with no funding coming from either Pfizer or AstraZeneca.
Author Contribution
JL developed the underlying code for the software and led the writing of the manuscript, with input from YH, SM, CK, and SMB. SMB oversaw the project, provided suggestions for improving the software development, and assisted with writing the manuscript. CK and SMB conceived the research project, with input from SM.
