Natural Language Processing with PubMed

Text-mining to identify key molecular targets associated with diabetes.
NLP
Disease-gene association
Python
Author

Manish Datt

Published

September 5, 2024

Diabetes is a synergistic manifestation of a multitude of metabolic dysfunctions. This makes it difficult to pinpoint a specific molecular target for therapeutic interventions. Over the years, several drugs have been approved targeting different proteins. The best way to get information about the major proteins involved in the pathophysiology of Diabetes is to review the scientific literature published in this field. Here, we’ll mine the research articles in PubMed to identify major proteins for which inhibitors have been (or are being) developed to treat diabetes. Natural Language Processing (NLP) models trained on a corpus of biomedical data will be used to parse the literature.

Searching PubMed

We’ll search for all the research articles with “diabetes” in the title and “inhibitor” (and its variants) in the title or abstract. The functions available in the Biopython library will be used to do the search. Detailed steps for searching PubMed are outlined in this post. The full PubMed search query is given below.

((((diabetes[Title]) AND (inhibit[Title/Abstract] OR inhibitor[Title/Abstract] OR inhibitors[Title/Abstract])) NOT (review[Publication Type])) NOT (Clinical Trial[Publication Type])) NOT (Systematic Review[Publication Type])

Show Python code
from Bio import Entrez

Entrez.email = 'example@test.com' #change email id
pubs_diab = Entrez.esearch(db="pubmed", term="((((diabetes[Title]) AND (inhibit[Title/Abstract] OR inhibitor[Title/Abstract] OR inhibitors[Title/Abstract])) NOT (review[Publication Type])) NOT (Clinical Trial[Publication Type])) NOT (Systematic Review[Publication Type])",\
                            retmode="xml", retmax=9999) 
record = Entrez.read(pubs_diab)
pubs_diab.close()
idlist = record["IdList"]

with open('pubmed_ids.txt', 'w') as f:
    for ids in idlist:
        f.write(ids + '\n')

# get the abstracts
with open('pubmed_ids.txt', 'r') as f:
    idlist = [line.strip() for line in f]

def fetch_abstracts(pmid):
    abstracts = []
    pub = Entrez.efetch(db="pubmed", id=pmid, rettype="medline", retmode="text")
    record = pub.read()
    pub.close()
    lines = record.splitlines()
    if 'AB  - ' in record:
        for i, line in enumerate(lines):
            if line.startswith('AB  - '):
                abstract = line[6:].strip()
                j = i + 1
                while j < len(lines) and lines[j].startswith((' ', '\t')):
                    abstract += ' ' + lines[j].strip()
                    j += 1
                abstracts.append(abstract)
    else:
        abstracts.append('No abstract available')
    return abstracts

results = []
for ids in idlist:
    res = fetch_abstracts(ids)
    results.extend(res[0].splitlines())

# save the abstracts to a file
with open('abstracts_diabetes.txt', 'w',encoding='utf-8') as f:
    for abstract in results:
        f.write(abstract + '\n')

As of 31 August 2024, this search returned 9,966 hits; and 9,335 entries had abstracts in PubMed. Using these abstracts we’ll now perform the NLP analysis.

NLP based Named-Entity Recognition

Natural Language Processing falls under the umbrella of AI and aims to enable computers to understand human languages. There are several techniques that contribute to an efficient NLP analysis. One of these is Named-Entity Recognition (NER) wherein a pre-trained model detects and labels entities to different words in a sentence. For example, the table below shows the NER for the following sentence.

India’s largest listed biopharmaceutical firm Biocon Ltd forged a strategic deal with Pfizer Inc. for worldwide commercialization of four insulin products, seeking to address a market worth a combined $14 billion.

Entity Label Explanation
India GPE Geopolitical entities, including countries, cities, and states.
Biocon Ltd ORG Organizations, including companies, agencies, and institutions.
Pfizer Inc. ORG Organizations, including companies, agencies, and institutions.
four CARDINAL Cardinal numbers, such as “one,” “two,” etc.
$14 billion MONEY Monetary values, including amounts and currencies.

The above NER was done using en_core_web_sm which is a pre-trained model on English text. Custom models can be trained on a domain-specific text as well. For instance, models can be trained with biomedical text to identify entities such as genes, diseases, species, etc. There are now quite a few pre-trained NLP models for biomedical text, such as BioBERT, BioGPT, SciSpacy, etc. These models along with an NLP framework (like spacy) make it easy to set up a biomedical text-mining workflow.

The example below shows the named-entities identified in two sentences using en_ner_bionlp13cg_md model available in spacy.

With regards to medication management, for patients with clinical cardiovascular disease, a sodium-glucose cotransporter 2 (SGLT2) inhibitor or a glucagon-like peptide 1 (GLP-1) receptor agonist with proven cardiovascular benefit is recommended.

Entity Label
patients ORGANISM
cardiovascular ANATOMICAL_SYSTEM
sodium-glucose cotransporter 2 GENE_OR_GENE_PRODUCT
SGLT2 GENE_OR_GENE_PRODUCT
glucagon-like peptide 1 GENE_OR_GENE_PRODUCT


We also quantified PAGs before and after glucose control with a sodium-glucose cotransporter 2 (SGLT2) inhibitor, dapagliflozin.

Entity Label
glucose SIMPLE_CHEMICAL
sodium-glucose cotransporter 2 GENE_OR_GENE_PRODUCT
SGLT2 GENE_OR_GENE_PRODUCT
dapagliflozin SIMPLE_CHEMICAL

Next, we’ll run this code on all the abstracts collected above to identify GENE_OR_GENE_PRODUCT in those followed by counting the frequency of different genes. Below are the top five genes with maximum number of counts.

Show Python code
import spacy
from collections import Counter

nlp = spacy.load("en_ner_bionlp13cg_md")

with open('abstracts.txt', 'r',encoding='utf-8') as f:
    abstracts = [line.strip() for line in f]

abstracts = [x for x in abstracts if x != "No abstract available"]
#print("Number of abstracts:",len(abstracts))

all_genes = []
for abs1 in abstracts:
    doc = nlp(abs1)
    genes_list = [ent.text for ent in doc.ents if "GENE" in ent.label_]
    all_genes.extend(set(genes_list))

# save the gene counts to a file
with open('gene_counts.txt', 'w',encoding='utf-8') as f:
    for gene, count in Counter(all_genes).items():
        f.write(f"{gene}: {count}\n")

# display top 5
df = pd.DataFrame.from_dict(Counter(all_genes), orient="index")
df = df.reset_index()
df.columns=["Gene", "Count"]
df = df.sort_values('Count', ascending=False, ignore_index=True)
display(df.head())
Gene Count
insulin 2525
SGLT2 790
hemoglobin 621
T2D 560
DPP-4 435

As you would have noticed, there are genes in this list that are irrelevant to our analysis ー entries like insulin, hemoglobin, T2D. Insulin is, for sure, associated with diabetes but can’t be a drug target; the same applies to hemoglobin. So, we need to filter this list to exclude such entries. After manually editing the list, the frequencies for different genes with counts >100 are shown below.

Show Python code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

data = []
with open('gene_counts - copy.txt', 'r',encoding='utf-8') as f:
    for line in f:
        if line.count(':') == 1:
            gene, count = line.strip().split(':')
            data.append((gene.strip(), int(count.strip())))
# create a DataFrame
df = pd.DataFrame(data, columns=['Gene', 'Count'])
# add counts where gene name is the same. keep the first row
df = df.groupby('Gene').agg({'Count': 'sum'}).reset_index()
df = df.sort_values('Count', ascending=False, ignore_index=True)

exclude_list = ["insulin", "hemoglobin", "t2d"]
df = df[~df['Gene'].str.lower().isin(exclude_list)]

df['percent'] = (df['Count'] / 9335) * 100

df_subset = df[df['Count'] > 100]

plt.figure(figsize=(10, 6))
colors = sns.color_palette('flare_r', len(df_subset))
a_values = np.linspace(0.4, 0.6, len(df_subset)).tolist()

sns.barplot(x='Count', y='Gene', data=df_subset)
for i, bar in enumerate(plt.gca().patches):
    bar.set_color(colors[i])
    bar.set_alpha(a_values[i])

all_counts = {}
for i, v in enumerate(df_subset['Count']):
    plt.text(v + 5, i, str(v), color=colors[i], ha='left', va='center', fontsize=12) #(df['percent'].values[i])[:4]
    all_counts[i] = v

for i, v in enumerate(df_subset['Gene']):
    if len(v) > 15 and all_counts[i]<300:
        v = v[:15] + '...'
    color_text = sns.color_palette('gray_r', len(df_subset))[i]
    plt.text(0, i, " "+v, color="black", ha='left', va='center')

plt.axis('off')
plt.box(False)
plt.tight_layout()
plt.show()

There are still some outstanding issues with these gene frequencies. As you can see, there are separate bars for different gene synonyms. For instance, SGLT-2, sodium-glucose cotransporter 2, and sodium-glucose cotransporter-2 have separate counts. Subtle differences in the gene name (like the presence or absence of a dash) result in their segregation. The table below highlights the synonymous entries for SGLT-2 in yellow, and GLP-1 in pink; not to mention several other “typo” variants in this list.

Gene Count Gene Count
SGLT2 790       peptidase-4 150
DPP-4 435       IL-6 142
sodium-glucose cotransporter 2 415       plasminogen activator inhibitor-1 135
GLP-1 353       sodium-glucose co-transporter 2 125
SGLT2i 334       nitric oxide synthase 122
ACE 323       glucagon 121
sodium-glucose cotransporter-2 322       albumin 119
alpha-glucosidase 309       dipeptidyl peptidase 4 118
SGLT-2 217       sodium-glucose co-transporter-2 118
glucagon-like peptide-1 receptor agonists 208       ARBs 112
PAI-1 204       IL-1beta 112
glucagon-like peptide-1 200       SU 110
TNF-alpha 180       NF-kappaB 108
aldose reductase 151       DPP4i 102
DPP4 150       alpha-amylase 102

Now, we need to manually curate this list by renaming the synonymous genes to a particular name so that we can add their counts. After all the hard work, below are the top five genes identified using text-mining.

NLP opens up exciting opportunities to programmatically parse the ever-expanding repertoire of biomedical text. Meticulous use of NLP techniques can indeed provide new insights based on knowledge stored in literature databases. There are, however, some caveats that one needs to keep in mind when working with such AI capabilities. First and foremost is the model training. The named-entity recognition in biomedical text is still limited to a few entities. This limits the kind of questions that can be addressed using these models. Also, some words like T2D are labeled as a gene when they should be labeled as a disease. Another issue is the varied nomenclature for a particular thing in biomedical text. For example, GLP1 and GLP-1 are identified as two different genes. In fact, this point actually holds a lesson for biologists! They need to be mindful with the usage of technical terms in manuscripts and should adhere to canonical names to avoid littering the databases with undesired aliases.

This post walks through the method of NLP-based Text and Data Mining (TDM) and has flagged important issues along the way. The dataset used here is rather small with ~10K abstracts. Richer data, particularly full-text articles, would certainly add depth to the analysis. With the enhancement of the NLP models and the availability of open-access literature, TDM is poised to propel biomedical discoveries in the future.

To receive updates about new posts, announcements, etc., please share your details below.