Have a personal or library account? Click to login
Biomarker_nlp: A Python Package for Mining Biomarker Information for Targeted Cancer Therapies Cover

Biomarker_nlp: A Python Package for Mining Biomarker Information for Targeted Cancer Therapies

Open Access
|Dec 2024

Full Article

Overview

Introduction

Biomarkers, including genetic alterations and gene and protein expression values, can indicate how a patient might respond to specific treatments. For example, a patient without a certain mutation or who has a certain gene overexpressed may be more likely to respond to a particular cancer treatment, such as certain non-small cell lung cancer patients without EGFR or ALK mutations responding to pembrolizumab as first-line treatment [1] or breast cancer patients with HER2 overexpression responding to trastuzumab [2].

The U.S. Food and Drug Administration (FDA) includes biomarker information in drug labels, which are officially documented in the DailyMed database. DailyMed contains approximately 140,000 FDA-approved and FDA-regulated product labels [3]. For a drug or a biological product, its label contains essential prescribing information in structured sections, such as the indications and usage and dosage and administration [3]. Within each section, information is mostly unstructured, being present in free text. This information contains a variety of biomedical data, including biomarker information, which is valuable for and used by a wide range of stakeholders, such as doctors, healthcare providers, and biomedical researchers. Despite the availability of this data, extracting relevant biomarker information efficiently at scale remains a challenge. Researchers and healthcare professionals often need to manually sift through multiple labels to curate the necessary data, which is generally time-consuming and inefficient. Existing application programming interfaces (APIs), such as the FDA open-source APIs [4] and DailyMed APIs [3] allow users to download drug information into XML or JSON format, but they do not directly provide biomarker information for corresponding drug products. When users want to get the biomarker information for specific diseases or drugs, they still need to parse the XML or JSON files and extract the relevant information from them. To address this gap, we developed biomarker_nlp, a Python package designed to streamline the extraction of biomarker information for targeted cancer therapies from DailyMed and other biomedical texts. This tool leverages Named-Entity Recognition (NER) techniques and pre-trained language models to identify and extract biomarkers such as genes, proteins, therapies, and diseases. By automating the extraction process, biomarker_nlp is anticipated to significantly reduce the time and effort required for biomarker curation, facilitating more efficient research and potentially, clinical decision-making.

Implementation and architecture

The biomarker_nlp package is freely available on GitHub (https://github.com/Damonlin11/FDA-approved-Targeted-Therapies-Label-Extraction/tree/main/biomarker_nlp) and through the Python Package Index (PyPI) (https://pypi.org/project/biomarker-nlp/). The core functionalities of the package include 1) HTML parsing, 2) Named-entity recognition, and 3) Negation detection. They are used to pull information from the following two sources:

  1. DailyMed: the National Library of Medicine (NLM)’s database of drug labels [3].

  2. National Cancer Institute (NCI): the list of FDA-approved targeted cancer therapies [5].

These sources provide HTML-formatted pages from which biomarker_nlp parses the text by leveraging tools in the lxml library developed by Stefan, Faassen, & Ian (2005) [6]. biomarker_nlp enables users to scrape certain pieces of biomarker information such as drug label, targeted therapy name, and disease name from DailyMed and NCI webpages by passing the URL to the relevant function in the package, without requiring them to consider the HTML tree structure. For instance, if users want to extract the drug label name from a DailyMed label, they will simply need to pass a URL link to the relevant function in the package, as follows:

>>> url = “https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=939b5d1f-9fb2-4499-80ef-0607aa6b114e

>>> biomarker_extraction.drug_brand_label(dailyMedURL = url)

‘AVASTIN- bevacizumab injection, solution’

The package also includes the functions to work directly with free text and extract biomarker information from it. The package utilizes various pre-trained NER models from ScispaCy (Neumann et al. 2019) [7] to help identify and extract biomarker entities from provided free text, including gene, protein, and therapy names. Consider the following two examples of extracting gene and chemical names from free text:

>>> txt = “PEMAZYRE is indicated for the treatment of adults with relapsed or refractory myeloid/lymphoid neoplasms (MLNs) with fibroblast growth factor receptor 1 (FGFR1) rearrangement.”

>>> biomarker_extraction.gene_protein_chemical(text = txt, gene = 1, protein = 0, chemical = 1)

{‘gene’: [‘FGFR1’], ‘chemical’: [‘PEMAZYRE’]}

>>> txt = “ERBITUX is indicated, in combination with encorafenib, for the treatment of adult patients with metastatic colorectal cancer (CRC) with a BRAF V600E mutation, as detected by an FDA-approved test, after prior therapy.”

>>> biomarker_extraction.gene_protein_chemical(text = txt, gene = 1, protein = 0, chemical = 1)

‘gene’: [‘BRAF’], ‘chemical’: [‘encorafenib’]}

Users are also able to extract the NDC codes on a DailyMed label by simply passing the URL link. For example, this can be done for pemigatinib (Pemazyre®):

>>> url = “https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=9e1f2222-1d89-4e63-989c-ccebe2ab1eb4

>>> biomarker_extraction.ndc_code(dailyMedURL = url)

‘50881-026-01, 50881-027-01, 50881-028-01’

Additionally, the package enables users to extract the diseases from an NCI page by providing the URL link. For example, we may want to obtain the diseases which can be treated with cetuximab (Erbitux®):

>>> url = “https://www.cancer.gov/about-cancer/treatment/drugs/cetuximab

>>> biomarker_extraction.therapy_disease(url = url)

[‘Squamous cell carcinoma of the head and neck’, ‘Colorectal cancer’]

One example of usage is that the automatically extracted entities, such as biomarkers, diseases, NDC codes, and drug labels, can be organized into a single data structure as shown in Table 1. This provides a benefit compared to FDA open-source APIs and DailyMed APIs or manual searching through individual labels, as it allows users to easily re-organize this information into a desired format for further research.

Table 1
geneProtein_labeltherapy_labeldisease_labeldrug_labeldrug_label
PD-L1AtezdizumabUrothelial carcinomaTECENTRIQ- atezolizumab injection, solution50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86
BRAFAtezolizumabMelanomaTECENTRIQ- atezolizumab injection, solution50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86
PD-L1AtezolizumabNon-small cell lung cancerTECENTRIQ- atezolizumab injection, solution50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86
EGFRAtezolizumabNon-small cell lung cancerTECENTRIQ- atezolizumab injection, solution50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86
ALK genomicAtezolizumabNon-small cell lung cancerTECENTRIQ- atezolizumab injection, solution50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86
PD-L1AtezolizumabBreast cancerTECENTRIQ- atezdizumab injection, solution50242-917-01, 50242-917-86, 50242-918-01, 50242-918-86
PD-L1NivolumabNon-small cell lung cancerOPDIVO- nivolumab Injection0003-3734-13, 0003-3772-11, 0003-3774-12
EGFRNivolurnabNon-small cell lung cancerOPDIVO- nlvolumab injection0003-3734-13, 0003-3772-11, 0003-3774-12
ALK genomicNivolurnabNon-small cell lung cancerOPDIVO- nlvolumab injection0003-3734-13, 0003-3772-11, 0003-3774-12

The logical structure and output for detecting and extracting biomarker entities using URL links or free text are visualized in Figure 1.

jors-12-502-g1.jpg
Figure 1

Logical structure and output for detecting biomarker entities in URL and free text using the biomarker_nlp package.

Negated biomarkers refer to the absence of a particular molecular change, for example, a treatment being prescribed for patients without a certain mutation can be extremely important but challenging to extract. As a result, the biomarker_nlp package provides tools to detect negations in sentences through two pre-trained negation transformer models from Khandelwal & Sawant’s (2020) NegBERT program [8], which applies a transfer learning approach, namely a negation cue detection model and a negation scope detection model. As the NegBERT program does not provide the output models, we performed the training process and published the two resulting models for free use. By using the negation modules and the pre-trained negation models, users can easily mine the negated information from free text. Figure 2 shows the logical structure and output for detecting negations. The following is an example usage of the functions to detect the negation cue and extract the negation scope from a sentence:

jors-12-502-g2.jpg
Figure 2

Logical structure and output for detecting negation in free text using the biomarker_nlp package.

# detect negation cue

>>> negation_cue_scope.negation_detect(text = txt, modelCue = modelCue)

# extract the negation scope

>>> negation_cue_scope.negation_scope(text = txt, modelCue = modelCue, modelScope = modelScope)

After having detected the negated scope from a sentence, users will want to apply entity detection to extract the negated biomarkers from the scope. The following is an example of extracting the negated genes:

>>> txt = “TECENTRIQ, in combination with bevacizumab, paclitaxel, and carboplatin, is indicated for the first-line treatment of adult patients with metastatic non-squamous NSCLC with no EGFR or ALK genomic tumor aberrations.”

>>> neg_scope = negation_cue_scope.negation_scope(text = txt, modelCue = modelCue, modelScope = modelScope)

>>> neg_text = ‘ ‘.join(i for i in neg_scope)

>>> biomarker_extraction.gene_protein_chemical(text = neg_text, gene = 1, protein = 0, chemical = 0)

{‘gene’: [‘EGFR’, ‘ALK genomic’]}

Quality control

The biomarker_nlp package has been tested with several DailyMed URLs, the NCI targeted therapies website, and additional free texts. It functions well when the package and pre-trained models are loaded appropriately. The following shows that the package is working when detecting genes and chemicals from a free text:

>>> txt = “PEMAZYRE is indicated for the treatment of adults with relapsed or refractory myeloid/lymphoid neoplasms (MLNs) with fibroblast growth factor receptor 1 (FGFR1) rearrangement.”

>>> biomarker_extraction.gene_protein_chemical(text = txt, gene = 1, protein = 0, chemical = 1)

{‘gene’: [‘FGFR1’], ‘chemical’: [‘PEMAZYRE’]}

As another example, the following shows that the package is working when detecting if negation is present in a text, by returning the value “True”:

>>> txt = “TECENTRIQ, in combination with bevacizumab, paclitaxel, and carboplatin, is indicated for the first-line treatment of adult patients with metastatic non-squamous NSCLC with no EGFR or ALK genomic tumor aberrations.”

>>> negation_cue_scope.negation_detect(text = txt, modelCue = modelCue)

True

Availability

Operating system

Works in all operating systems supporting Python.

Programming language

Python 3.6 or higher

Dependencies

Numpy, pandas, requests, re, scispacy, spacy, lxml, torch, keras, sklearn, transformers, html

List of contributors

Junxia Lin, Yuezheng He, Subha Madhavan, Chul Kim, Simina M. Boca.

Software location

Archive

Name: PyPi

Persistent identifier: https://pypi.org/project/biomarker-nlp/

Licence: MIT

Publisher: Junxia Lin

Version published: 0.0.4

Date published: 09/04/21

Code repository

Language

English

Reuse potential

biomarker_nlp is useful for a variety of users who want to detect biomarker information from free text, including but not restricted to the list of FDA-approved targeted cancer therapies available on the NCI website or the DailyMed database of drug labels on the NLM website. In addition, it is capable of detecting negation in any free text. biomarker_nlp could also be integrated into an “AI-augmented curation” workflow, building on similar work by Mahmood et al. (2017), which developed a system to extract associations between genomic anomalies and drug responses from the biomedical literature [9]. The package is written in Python, a widely used programming language. It is well documented and archived on both Python Package Index (PyPI) and GitHub repositories. Users can provide their questions or comments on GitHub issues or through email, and the authors of the package will do their best to support them.

Funding Information

This work was completed as part of a project funded by a pilot award (P30CA051008, PI of pilot: Kim).

Competing Interests

SM is currently an employee at Pfizer. SMB is currently an employee and minor shareholder at AstraZeneca. This work was performed while SM and SMB were at Georgetown University, with no funding coming from either Pfizer or AstraZeneca.

Author Contribution

JL developed the underlying code for the software and led the writing of the manuscript, with input from YH, SM, CK, and SMB. SMB oversaw the project, provided suggestions for improving the software development, and assisted with writing the manuscript. CK and SMB conceived the research project, with input from SM.

DOI: https://doi.org/10.5334/jors.502 | Journal eISSN: 2049-9647
Language: English
Submitted on: Dec 24, 2023
|
Accepted on: Jun 28, 2024
|
Published on: Dec 17, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Junxia Lin, Yuezheng He, Subha Madhavan, Chul Kim, Simina M. Boca, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.