Key Phrases and Entities detection with AWS Comprehend and Python

I already wrote a little about AWS Comprehend and how I’m using it to detect sentiment in text. Today I would like to show you a different example of the AWS Comprehend usage – detection of key phrases and entities.

Keyphrases Extraction and Entity Recognition

It would be great to group various text entries into the sets – we can try to categorize them manually, but given that we have a lot of data to process, this may take a while. Once we use entity recognition, we obtain a set of words that are considered to be entities in the text. Such set contains dates, locations, names, companies – the data you can use to categorize your text entries.

Keyphrases extraction is a little bit more tricky – you will receive a list of phrases that are considered important in given text. The API is returning the number of occurrences, the confidence of the score and the location of each phrase. This can be a good starting point to categorize your texts using the words that were used in these phrases since the phrases extracted from various pieces of text can vary and it can be not that simple to use exact categorization.

The sample code

Again I used Python to implement the basic script to retrieve the data from the database. The tools provided by Comprehend returns objects that contain all the information we need. I decided to store key phrases and entities in the same database table, but you can use separate tables instead. I simply wanted to use it on live data and test what can I achieve.

Please note that you should execute your scripts from the EC2 machine which is able to use Comprehend service (configured by the proper role attached to the EC2 instance) or adjust my script to provide proper IAM credentials when connecting to the AWS. To work with AWS API you also have to install and import boto3 – the AWS SDK for Python.

In my script below, I’m connecting to the MySQL database but you can use any source of the text for analysis.

import boto3
import json
import mysql.connector
import dbconfig

#initialize comprehend module
comprehend = boto3.client(service_name='comprehend', region_name='us-east-1')

#database connection
cnx = mysql.connector.connect(user=dbconfig.DATABASE['user'], 
                              password=dbconfig.DATABASE['password'],
                              host=dbconfig.DATABASE['host'], 
                              database=dbconfig.DATABASE['dbname'])

cursor = cnx.cursor()
insCursor = cnx.cursor()

#retrieve the data
query = ("SELECT id, comments FROM commentsTable "
         "WHERE comments != '' AND comments IS NOT NULL ")

cursor.execute(query)
receivedData = list(cursor)
cursor.close()

#prepare insert query
insQueryValues = ("INSERT INTO keywordsvalues(commentid, text, score, beginoffset, endoffset, type) "
        "VALUES(%(commentid)s, %(Text)s, %(Score)s, %(BeginOffset)s, %(EndOffset)s, %(Type)s )")

#actual analysis in the loop
for (id, comments) in receivedData:
  kpData = comprehend.detect_key_phrases(Text=comments, LanguageCode='en')
  enData = comprehend.detect_entities(Text=comments, LanguageCode='en')
  
  qdata = {
    'commentid': id,
    'Text': "",
    'Score': 0,
    'BeginOffset': 0,
    'EndOffset': 0,
    'Type': "",
  }
  #keyphrases data preparation and handling
  for keyphrase in kpData['KeyPhrases']:
    qdata['Type'] = "KEYPHRASE"
    qdata['Text'] = keyphrase['Text']
    qdata['Score'] = keyphrase['Score']
    qdata['BeginOffset'] = keyphrase['BeginOffset']
    qdata['EndOffset'] = keyphrase['EndOffset']
    pprint(qdata)
    insCursor.execute(insQueryValues,qdata)
  #entity data preparation and handling
  for entity in enData['Entities']:
    qdata['Type'] = entity['Type']
    qdata['Text'] = entity['Text']
    qdata['Score'] = entity['Score']
    qdata['BeginOffset'] = entity['BeginOffset']
    qdata['EndOffset'] = entity['EndOffset']
    pprint(qdata)
    insCursor.execute(insQueryValues,qdata)

#cleanup
cnx.commit()
insCursor.close()
               
cnx.close()

Please note that you have to use two separate methods to perform both operations. This also means that you can decide to run only one of them if you need to extract only entities or you want to review the key phrases only. The script is not efficient for a large amount of data – the entries are processed one by one. You can use batch methods (batch detect entities, batch detect keyphrases) if you want to handle more at once.