my.bioinfo.guru – Word cloud using Python

Learn to make fancy word clouds using Python.

Introduction

Parsing textual data to extract the required information can be a daunting task. In scientific research, hundreds of articles get published each day and it is impractical to manually go through this repertoire of scientific knowledge. To stay up-to-date with the cutting-edge research one needs to learn how to programmatically parse the text data. A word cloud is a pictorial representation of the frequencies for different words in a particular text. Such a representation makes it intuitive to study patterns in a textual data.

Here, we’ll use abstracts from the PubMed database as the data source to generate word clouds. We’ll use the data from multiple keyword searches and then do a comparative analysis of the different word clouds.

To generate word clouds using Python, we have a couple of options in term of libraries that we can use. The wordcloud library provides function to generate word clouds. However, this library has limited option to customize the appearance of the word diagram. Another library, stylecloud, provides option to decorate the word cloud with the fontawesome icons. It can be installed by running pip install stylecloud. For this tutorial, we’ll use the stylecloud library. The gen_stylecloud function takes a text file as a keyword argument to read the data. The icon_name argument can be used to specify the icon (from fontawesome) which is to be used a shape for the word cloud. To save an image of the word cloud, specify the file name to the output_name argument.

import stylecloud

stylecloud.gen_stylecloud(file_path='abstract-prediabete-set.txt',
                          icon_name= "fas fa-heart",
                         output_name='preDia_cloud.png')

stylecloud.gen_stylecloud(file_path='abstract-diabetesTi-set.txt',
                          icon_name= "fas fa-heart",
                         output_name='Dia_cloud.png')

The data for the word clouds include abstracts of articles having 1) diabetes and 2) prediabetes in the title. Articles having both the words in the title have been excluded and only review articles were considered. Top twenty articles based on date of publication, for each search, were finally selected.

We can filter some of the common word from the data to make the word cloud more meaningful. For instance, words such as “review”, “studies”, etc. can be removed from the text. Let’s re-draw the cloud after excluding these common words.

Looking at the word cloud we get some interesting information about the two cases – diabetes and prediabetes. In the prediabetes cloud, predominant words include “early”, “intervention”, “lifestyle modification”, etc. On the other hand, in the diabetes word cloud, the words appearing predominantly include “patient”, “disease”, “GDM”, etc.

We can certainly further optimize the cloud by removing a few more common words or the words that are obvious (like diabetes). An important point to note here is that we have used a rather small dataset (20 articles). To get a better representation of the words associated with the two terms, a larger dataset would be more appropriate. This approach of drawing word cloud can of course be applied to other keyword searches. These word cloud are a great way to enhance the presentability of posters or infographics.