Natural Language Processing with PubMed

Diabetes is a synergistic manifestation of a multitude of metabolic dysfunctions. This makes it difficult to pinpoint a specific molecular target for therapeutic interventions. Over the years, several drugs have been approved targeting different proteins. The best way to get information about the major proteins involved in the pathophysiology of Diabetes is to review the scientific literature published in this field. Here, we’ll mine the research articles in PubMed to identify major proteins for which inhibitors have been (or are being) developed to treat diabetes. Natural Language Processing (NLP) models trained on a corpus of biomedical data will be used to parse the literature.

Searching PubMed

We’ll search for all the research articles with “diabetes” in the title and “inhibitor” (and its variants) in the title or abstract. The functions available in the Biopython library will be used to do the search. Detailed steps for searching PubMed are outlined in this post. The full PubMed search query is given below.

((((diabetes[Title]) AND (inhibit[Title/Abstract] OR inhibitor[Title/Abstract] OR inhibitors[Title/Abstract])) NOT (review[Publication Type])) NOT (Clinical Trial[Publication Type])) NOT (Systematic Review[Publication Type])

Show Python code

from Bio import Entrez

Entrez.email = 'example@test.com' #change email id
pubs_diab = Entrez.esearch(db="pubmed", term="((((diabetes[Title]) AND (inhibit[Title/Abstract] OR inhibitor[Title/Abstract] OR inhibitors[Title/Abstract])) NOT (review[Publication Type])) NOT (Clinical Trial[Publication Type])) NOT (Systematic Review[Publication Type])",\
                            retmode="xml", retmax=9999) 
record = Entrez.read(pubs_diab)
pubs_diab.close()
idlist = record["IdList"]

with open('pubmed_ids.txt', 'w') as f:
    for ids in idlist:
        f.write(ids + '\n')

# get the abstracts
with open('pubmed_ids.txt', 'r') as f:
    idlist = [line.strip() for line in f]

def fetch_abstracts(pmid):
    abstracts = []
    pub = Entrez.efetch(db="pubmed", id=pmid, rettype="medline", retmode="text")
    record = pub.read()
    pub.close()
    lines = record.splitlines()
    if 'AB  - ' in record:
        for i, line in enumerate(lines):
            if line.startswith('AB  - '):
                abstract = line[6:].strip()
                j = i + 1
                while j < len(lines) and lines[j].startswith((' ', '\t')):
                    abstract += ' ' + lines[j].strip()
                    j += 1
                abstracts.append(abstract)
    else:
        abstracts.append('No abstract available')
    return abstracts

results = []
for ids in idlist:
    res = fetch_abstracts(ids)
    results.extend(res[0].splitlines())

# save the abstracts to a file
with open('abstracts_diabetes.txt', 'w',encoding='utf-8') as f:
    for abstract in results:
        f.write(abstract + '\n')

As of 31 August 2024, this search returned 9,966 hits; and 9,335 entries had abstracts in PubMed. Using these abstracts we’ll now perform the NLP analysis.

NLP based Named-Entity Recognition

Natural Language Processing falls under the umbrella of AI and aims to enable computers to understand human languages. There are several techniques that contribute to an efficient NLP analysis. One of these is Named-Entity Recognition (NER) wherein a pre-trained model detects and labels entities to different words in a sentence. For example, the table below shows the NER for the following sentence.

India’s largest listed biopharmaceutical firm Biocon Ltd forged a strategic deal with Pfizer Inc. for worldwide commercialization of four insulin products, seeking to address a market worth a combined $14 billion.

Entity	Label	Explanation
India	GPE	Geopolitical entities, including countries, cities, and states.
Biocon Ltd	ORG	Organizations, including companies, agencies, and institutions.
Pfizer Inc.	ORG	Organizations, including companies, agencies, and institutions.
four	CARDINAL	Cardinal numbers, such as “one,” “two,” etc.
$14 billion	MONEY	Monetary values, including amounts and currencies.

The above NER was done using en_core_web_sm which is a pre-trained model on English text. Custom models can be trained on a domain-specific text as well. For instance, models can be trained with biomedical text to identify entities such as genes, diseases, species, etc. There are now quite a few pre-trained NLP models for biomedical text, such as BioBERT, BioGPT, SciSpacy, etc. These models along with an NLP framework (like spacy) make it easy to set up a biomedical text-mining workflow.

The example below shows the named-entities identified in two sentences using en_ner_bionlp13cg_md model available in spacy.

With regards to medication management, for patients with clinical cardiovascular disease, a sodium-glucose cotransporter 2 (SGLT2) inhibitor or a glucagon-like peptide 1 (GLP-1) receptor agonist with proven cardiovascular benefit is recommended.

Entity	Label
patients	ORGANISM
cardiovascular	ANATOMICAL_SYSTEM
sodium-glucose cotransporter 2	GENE_OR_GENE_PRODUCT
SGLT2	GENE_OR_GENE_PRODUCT
glucagon-like peptide 1	GENE_OR_GENE_PRODUCT

We also quantified PAGs before and after glucose control with a sodium-glucose cotransporter 2 (SGLT2) inhibitor, dapagliflozin.

Entity	Label
glucose	SIMPLE_CHEMICAL
sodium-glucose cotransporter 2	GENE_OR_GENE_PRODUCT
SGLT2	GENE_OR_GENE_PRODUCT
dapagliflozin	SIMPLE_CHEMICAL

Next, we’ll run this code on all the abstracts collected above to identify GENE_OR_GENE_PRODUCT in those followed by counting the frequency of different genes. Below are the top five genes with maximum number of counts.

Show Python code

import spacy
from collections import Counter

nlp = spacy.load("en_ner_bionlp13cg_md")

with open('abstracts.txt', 'r',encoding='utf-8') as f:
    abstracts = [line.strip() for line in f]

abstracts = [x for x in abstracts if x != "No abstract available"]
#print("Number of abstracts:",len(abstracts))

all_genes = []
for abs1 in abstracts:
    doc = nlp(abs1)
    genes_list = [ent.text for ent in doc.ents if "GENE" in ent.label_]
    all_genes.extend(set(genes_list))

# save the gene counts to a file
with open('gene_counts.txt', 'w',encoding='utf-8') as f:
    for gene, count in Counter(all_genes).items():
        f.write(f"{gene}: {count}\n")

# display top 5
df = pd.DataFrame.from_dict(Counter(all_genes), orient="index")
df = df.reset_index()
df.columns=["Gene", "Count"]
df = df.sort_values('Count', ascending=False, ignore_index=True)
display(df.head())

Gene	Count
insulin	2525
SGLT2	790
hemoglobin	621
T2D	560
DPP-4	435

As you would have noticed, there are genes in this list that are irrelevant to our analysis ー entries like insulin, hemoglobin, T2D. Insulin is, for sure, associated with diabetes but can’t be a drug target; the same applies to hemoglobin. So, we need to filter this list to exclude such entries. After manually editing the list, the frequencies for different genes with counts >100 are shown below.

Show Python code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

data = []
with open('gene_counts - copy.txt', 'r',encoding='utf-8') as f:
    for line in f:
        if line.count(':') == 1:
            gene, count = line.strip().split(':')
            data.append((gene.strip(), int(count.strip())))
# create a DataFrame
df = pd.DataFrame(data, columns=['Gene', 'Count'])
# add counts where gene name is the same. keep the first row
df = df.groupby('Gene').agg({'Count': 'sum'}).reset_index()
df = df.sort_values('Count', ascending=False, ignore_index=True)

exclude_list = ["insulin", "hemoglobin", "t2d"]
df = df[~df['Gene'].str.lower().isin(exclude_list)]

df['percent'] = (df['Count'] / 9335) * 100

df_subset = df[df['Count'] > 100]

plt.figure(figsize=(10, 6))
colors = sns.color_palette('flare_r', len(df_subset))
a_values = np.linspace(0.4, 0.6, len(df_subset)).tolist()

sns.barplot(x='Count', y='Gene', data=df_subset)
for i, bar in enumerate(plt.gca().patches):
    bar.set_color(colors[i])
    bar.set_alpha(a_values[i])

all_counts = {}
for i, v in enumerate(df_subset['Count']):
    plt.text(v + 5, i, str(v), color=colors[i], ha='left', va='center', fontsize=12) #(df['percent'].values[i])[:4]
    all_counts[i] = v

for i, v in enumerate(df_subset['Gene']):
    if len(v) > 15 and all_counts[i]<300:
        v = v[:15] + '...'
    color_text = sns.color_palette('gray_r', len(df_subset))[i]
    plt.text(0, i, " "+v, color="black", ha='left', va='center')

plt.axis('off')
plt.box(False)
plt.tight_layout()
plt.show()

There are still some outstanding issues with these gene frequencies. As you can see, there are separate bars for different gene synonyms. For instance, SGLT-2, sodium-glucose cotransporter 2, and sodium-glucose cotransporter-2 have separate counts. Subtle differences in the gene name (like the presence or absence of a dash) result in their segregation. The table below highlights the synonymous entries for SGLT-2 in yellow, and GLP-1 in pink; not to mention several other “typo” variants in this list.

Gene	Count	Gene	Count
SGLT2	790	peptidase-4	150
DPP-4	435	IL-6	142
sodium-glucose cotransporter 2	415	plasminogen activator inhibitor-1	135
GLP-1	353	sodium-glucose co-transporter 2	125
SGLT2i	334	nitric oxide synthase	122
ACE	323	glucagon	121
sodium-glucose cotransporter-2	322	albumin	119
alpha-glucosidase	309	dipeptidyl peptidase 4	118
SGLT-2	217	sodium-glucose co-transporter-2	118
glucagon-like peptide-1 receptor agonists	208	ARBs	112
PAI-1	204	IL-1beta	112
glucagon-like peptide-1	200	SU	110
TNF-alpha	180	NF-kappaB	108
aldose reductase	151	DPP4i	102
DPP4	150	alpha-amylase	102

Now, we need to manually curate this list by renaming the synonymous genes to a particular name so that we can add their counts. After all the hard work, below are the top five genes identified using text-mining.

NLP opens up exciting opportunities to programmatically parse the ever-expanding repertoire of biomedical text. Meticulous use of NLP techniques can indeed provide new insights based on knowledge stored in literature databases. There are, however, some caveats that one needs to keep in mind when working with such AI capabilities. First and foremost is the model training. The named-entity recognition in biomedical text is still limited to a few entities. This limits the kind of questions that can be addressed using these models. Also, some words like T2D are labeled as a gene when they should be labeled as a disease. Another issue is the varied nomenclature for a particular thing in biomedical text. For example, GLP1 and GLP-1 are identified as two different genes. In fact, this point actually holds a lesson for biologists! They need to be mindful with the usage of technical terms in manuscripts and should adhere to canonical names to avoid littering the databases with undesired aliases.

This post walks through the method of NLP-based Text and Data Mining (TDM) and has flagged important issues along the way. The dataset used here is rather small with ~10K abstracts. Richer data, particularly full-text articles, would certainly add depth to the analysis. With the enhancement of the NLP models and the availability of open-access literature, TDM is poised to propel biomedical discoveries in the future.