Data Science

Key Phrase Extraction and Visualization: Python and Microsoft Power BI

Discover insights in unstructured text

 

Key-Phrase Extraction, Photo by Rabie Madaci on Unsplash

We live in an age where data is the new currency! This makes the Big tech giants the richest companies in the world. The best investment for the next few decades will be the investment in data. So, what do these companies do with this data? How can anyone handle pieces of textual and unstructured data from Facebook posts, Twitter or Linkedin? To a layman, scanning or sampling might sound like a good idea, however, data scientists know the risks of sampling and the pain of scanning text by text, row by row, and word by word 😬. This is where data experts use “Key-phrase Extraction”.

Key-phrase Extraction is the skill to evaluate unstructured text and returning a list of key phrases. For example, given input text “The food was delicious and there were wonderful staff”, the service returns the main talking points: “food” and “wonderful staff”.

What will we Discuss?

Here is the link for the sample data that we will use: Sample Data

What is RAKE?

Resources Required

  • Microsoft Power BI Desktop (Pro License)
  • (OPTIONAL) Microsoft Azure Subscription (Free Trial or Paid) to correlate key-phrases together with sentiments.

Are you ready?? Here we go 🏄

Step 1: Install RAKE package and store stop wordlist

!pip install python-rake==1.4.4
 
Image for post
Installing RAKE algorithm package in Spyder Python instance

1.2 Create stop wordlist: Stop words are the words that generally do not help in text analysis and are typically dropped within all the informational systems and also not included in various text analyses as they are considered to be meaningless. Words that are considered to carry a meaning related to the text are described as the content bearing and are called content words. You can download the stopwords list here and customize the same as per your requirements. Save it at the desired location and copy the path for configuring the Python script.

Image for post
Example of stopwords

Step 2: Open Power BI, Import Data & Configure Python script

 
Image for post
Calling “Run Python script” in Power Query Editor

2.2 Prepare your Python Script: You can use the below Python script and customize the same by replacing the path for stopwords list in row 11.

Also, you can specify/restrict the # of key-phrases to be extracted by modifying the count in row 31 (i.e. replace [-1:] to [-5:] to get up to 5 key-phrases from 1 text input)

"""
@author: Jayant Kumar Kodwani
"""
# 'dataset' holds the input data for this script
import RAKE
import pandas as pd
"""Add stopwords list, REPLACE path as required"""
stop_dir = r"C:\Users\Jayant\.spyder-py3\stopwords.txt"
rake_object = RAKE.Rake(stop_dir)
"""Create a empty dataframe to store output"""
Rake_Final_Output = pd.DataFrame()
#Assign your dataset to a variable
df= dataset
def Sort_Tuple(tup):
tup.sort(key= lambda x:x[1])
return tup
# Loop through all the field/column values and apply RAKE
for x in range(len(df)):
subtitles = df.Answer[x]
print (subtitles)
"""Run Rake Algorithm, You can change the parameter [-1:] to get more than 1 keyphrase from the text"""
keywords=Sort_Tuple(rake_object.run(subtitles))[-1:]
# create DataFrame using RAKE output data
Output = pd.DataFrame(keywords, columns =['Word', 'KeywordScore'])
Output['Keywords']=keywords
Output['KeywordScore'] = Output['KeywordScore'].astype('float')
Output['Date']=df.Date[x]
Output['Question']=df.Question[x]
Output['Answer']=df.Answer[x]
Output['Index']=df.Index[x]
Rake_Final_Output = Rake_Final_Output.append(Output, ignore_index=True)
 

Once done with the customization, you can apply the script and expand the “Rake_Final_Output” dataset. You can Save and Close the Power Query Editor to apply the script. This is how your dataset looks like after new fields added for key-phrases and their scores.

Image for post
Power BI Dataset with Key-phrases and Scores

Step 3: Power BI Integration and Visualization

In order to visualize the key-phrases, I would recommend to use a Word Cloud ☁️ together with tables preferably with sentiment analysis 😃, so you can relate the key-phrases with positive, neutral and negative sentiments.

You can download a sample Power BI template which integrates Sentiment analysis as well as key-phrase extraction all packed together in a Power BI.

As you can see in the below example, we have “Top 10 Key phrases with negative sentiments” where phrases like “Slower Connections” and “restart 10 times” are directly correlated to negative sentiments 😢

Image for post
Word cloud with correlation of Negative Sentiments

Similarly, we have “Top 10 Key phrases with positive sentiments” where phrases like “explained neatly” and “great in depth knowledge” are directly correlated to positive sentiments 😃.

 
Image for post
Word Cloud with correlation of Positive Sentiments

Conclusion

You could use other datasets and customize the code to see what suits your use case best! 👍

Came across a different approach for key-phrase extraction? Please drop it in the comments !

References

[2] https://towardsdatascience.com/analyzing-and-visualizing-sentiments-from-unstructured-raw-data-c263ba96cc2c

[3] Data source: prepared manually by the Author

 

Categories: Data Science

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.