Clinical text mining using the Amazon Comprehend Medical new SNOMED CT API

Mining medical concepts from written clinical text, such as patient encounters, plays an important role in clinical analytics and decision-making applications, such as population analytics for providers, pre-authorization for payers, and adverse-event detection for pharma companies. Medical concepts contain medical conditions, medications, procedures, and other clinical events. Extracting medical concepts is a complicated process due to the specialist knowledge required and the broad use of synonyms in the medical field. Furthermore, to make detected concepts useful for large-scale analytics and decision-making applications, they have to be codified. This is a process where a specialist looks up matching codes from a medical ontology, often containing tens to hundreds of thousands of concepts.

To solve these problems, Amazon Comprehend Medical provides a fast and accurate way to automatically extract medical concepts from the written text found in clinical documents. You can now also use a new feature to automatically standardize and link detected concepts to the SNOMED CT (Systematized Nomenclature of Medicine—Clinical Terms) ontology. SNOMED CT provides a comprehensive clinical healthcare terminology and accompanying clinical hierarchy, and is used to encode medical conditions, procedures, and other medical concepts to enable big data applications.

This post details how to use the new SNOMED CT API to link SNOMED CT codes to medical concepts (or entities) in natural written text that can then be used to accelerate research and clinical application building. After reading this post, you will be able to detect and extract medical terms from unstructured clinical text, map them to the SNOMED CT ontology (US edition), retrieve and manipulate information from a clinical database, including electronic health record (EHR) systems, and map SNOMED CT concepts to other ontologies using the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) if your EHR system uses an ontology other than SNOMED CT.

Solution overview

Amazon Comprehend Medical is a HIPAA-eligible natural language processing (NLP) service that uses machine learning (ML) to extract clinical data from unstructured medical text—no ML experience required—and automatically map them to SNOMED CT, ICD10, or RxNorm ontologies with a simple API call. You can then add the ontology codes to your EHR database to augment patient data or link to other ontologies as desired through OMOP CDM. For this post, we demonstrate the solution workflow as shown in the following diagram with code based on the example sentence “Patient X was diagnosed with insomnia.”

To use clinical concept codes based on a text input, we detect and extract clinical terms, connect to the clinical data base, transform SNOMED code to OMOP CDM code, and use them within our records.

For this post, we use the OMOP CDM as a database schema as an example. Historically, healthcare institutions in different regions and countries use their own terminologies and classifications for their own purposes, which prevents the interoperability of the systems. While SNOMED CT standardizes medical concepts with a clinical hierarchy, the OMOP CDM provides a standardization mechanism to move from one ontology to another, with an accompanying data model. The OMOP CDM standardizes the format and content of observational data so that standardized applications, tools and methods can be applied across different datasets. In addition, the OMOP CDM makes it easier to convert codes from one vocabulary to another by having maps between medical concepts in different hierarchical ontologies and vocabularies. The ontologies hierarchy is set such that descendants are more specific than ascendants. For example, non-small cell lung cancer is a descendent of malignant neoplastic disease. This allows querying and retrieving concepts and all their hierarchical descendants, and also enables interoperability between ontologies.

We demonstrate implementing this solution with the following steps:

  1. Extract concepts with Amazon Comprehend Medical SNOMED CT and link them to the SNOMED CT (US edition) ontology.
  2. Connecting to the OMOP CDM.
  3. Map the SNOMED CT code to OMOP CDM concept IDs.
  4. Use the structured information to perform the following actions:
    1. Retrieve the number of patients with the disease.
    2. Traverse the ontology.
    3. Map to other ontologies.

Prerequisites

Before you get started, make sure you have the following:

  • Access to an AWS account.
  • Permissions to create an AWS CloudFormation.
  • Permissions to call Amazon Comprehend Medical from Amazon SageMaker.
  • Permissions to query Amazon Redshift from SageMaker.
  • The SNOMED CT license. SNOMED International is a strong member-owned and driven organization with free use of SNOMED CT within the member’s territory. Members manage the release, distribution, and sub-licensing of SNOMED CT and other products of the association within their territory.

This post assumes that you have an OMOP CDM database set up in Amazon Redshift. See Create data science environments on AWS for health analysis using OHDSI to set up a sample OMOP CDM in your AWS account using CloudFormation templates.

Extract concepts with Amazon Comprehend Medical SNOMED CT

You can extract SNOMED CT codes using Amazon Comprehend Medical with two lines of code. Assume you have a document, paragraph, or sentence:

clinical_note = "Patient X was diagnosed with insomnia."

First, we instantiate the Amazon Comprehend Medical client in boto3. Then, we simply call Amazon Comprehend Medical’s SNOMED CT API:

import boto3
cm_client = boto3.client("comprehendmedical")
response = cm_client.infer-snomedct(Text=clinical_note)

Done! In our example, the response is as follows:

{'Characters': {'OriginalTextCharacters': 38},
 'Entities': [{'Attributes': [],
               'BeginOffset': 29,
               'Category': 'MEDICAL_CONDITION',
               'EndOffset': 37,
               'Id': 0,
               'SNOMEDCTConcepts': [{'Code': '193462001',
                                     'Description': 'Insomnia (disorder)',
                                     'Score': 0.7997841238975525},
                                    {'Code': '191997003',
                                     'Description': 'Persistent insomnia '
                                                    '(disorder)',
                                     'Score': 0.6464713215827942},
                                    {'Code': '762348004',
                                     'Description': 'Acute insomnia (disorder)',
                                     'Score': 0.6253700256347656},
                                    {'Code': '59050008',
                                     'Description': 'Initial insomnia '
                                                    '(disorder)',
                                     'Score': 0.6112624406814575},
                                    {'Code': '24121004',
                                     'Description': 'Insomnia disorder related '
                                                    'to another mental '
                                                    'disorder (disorder)',
                                     'Score': 0.6014388203620911}],
               'Score': 0.9989109039306641,
               'Text': 'insomnia',
               'Traits': [{'Name': 'DIAGNOSIS', 'Score': 0.7624053359031677}],
               'Type': 'DX_NAME'}],
 'ModelVersion': '0.0.1',
 'ResponseMetadata': {'HTTPHeaders': {'content-length': '873',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Mon, 20 Sep 2021 18:32:04 GMT',
                                      'x-amzn-requestid': 'e9188a79-3884-4d3e-b73e-4f63ed831b0b'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'e9188a79-3884-4d3e-b73e-4f63ed831b0b',
                      'RetryAttempts': 0},
 'SNOMEDCTDetails': {'Edition': 'US',
                     'Language': 'en',
                     'VersionDate': '20200901'}}

The response contains the following:

  • Characters – Total number of characters. In this case, we have 38 characters.
  • Entities – List of detected medical concepts, or entities, from Amazon Comprehend Medical. The main elements in each entity are:

    • Text – Original text from the input data.
    • BeginOffset and EndOffset –The beginning and ending location of the text in the input note, respectively.
    • Category – Category of the detected entity. For example, MEDICAL_CONDITION for medical condition.
    • SNOMEDCTConcepts – Top five predicted SNOMED CT concept codes with the model’s confidence scores (in descending order). Each linked concept code has the following:

      • Code – SNOMED CT concept code.
      • Description – SNOMED CT concept description.
      • Score – Confidence score of the linked SNOMED CT concept.
    • ModelVersion – Version of the model used for the inference.
    • ResponseMetadata – API call metadata.
    • SNOMEDCTDetails – Edition, language, and date of the SNOMED CT version used.

For more information, refer to the Amazon Comprehend Medical Developer Guide. By default, the API links detected entities to the SNOMED CT US edition. To request support for your edition, for example the UK edition, contact us via AWS Support or the Amazon Comprehend Medical forum.

In our example, Amazon Comprehend Medical identifies “insomnia” as a clinical term and provides five ordered SNOMED CT concepts and code that we might be referring to in the sentence. In this example, Amazon Comprehend Medical correctly identifies the clinical term as the most likely option. Therefore, the next step is to extract the response. See the following code:

#Get top predicted SNOMED CT Concept
pred_snomed = response['Entities'][0]['SNOMEDCTConcepts'][0]

The content of pred_snomed is as follows, with its predicted SNOMED concept code, concept description, and prediction score (probability):

{
 'Description': 'Insomnia (disorder)',
 'Code': '193462001',
 'Score': 0.803254246711731
}

We have identified clinical terms in our text and linked them to SNOMED CT concepts. We can now use SNOMED CT’s hierarchical structure and relations to other ontologies to accelerate clinical analytics and decision-making application development.

Before we access the database, let’s define some utility functions that are helpful in our operations. First, we must import the necessary Python packages:

import pandas
import psycopg2

The following code is a function to connect to the Amazon Redshift database:

def connect_to_db(redshift_parameters, user, password):
    """Connect to database and returns connection
    Args:
        redshift_parameters (dict): Redshift connection parameters.
        user (str): Redshift user required to connect. 
        password (str): Password associated to the user
    Returns:
        Connection: boto3 redshift connection 
    """

    try:
        conn = psycopg2.connect(
            host=redshift_parameters["url"],
            port=redshift_parameters["port"],
            user=user,
            password=password,
            database=redshift_parameters["database"],
        )

        return conn

    except psycopg2.Error:
        raise ValueError("Failed to open database connection.")

The following code is a function to run a given query on the Amazon Redshift database:

def execute_query(cursor, query, limit=None):
    """Execute query
    Args:
        cursor (boto3 cursor): boto3 object pointing and with established connection to Redshift.
        query (str): SQL query.
        limit (int): Limit of rows returned by the data frame. Default to 'None' for no limit
    Returns:
        pd.DataFrame: Data Frame with the query results.
    """
    try:
        cursor.execute(query)
    except:
        return None

    columns = [c.name for c in cursor.description]
    results = cursor.fetchall()
    if limit:
        results = results[:limit]

    out = pd.DataFrame(results, columns=columns)

    return out

In the next sections, we connect to the database and run our queries.

Connect to the OMOP CDM

EHRs are often stored in databases using a specific ontology. In our case, we use the OMOP CDM, which contains a large number of ontologies (SNOMED, ICD10, RxNorm, and more), but you can extend the solution to other data models by modifying the queries. The first step is to connect to Amazon Redshift where the EHR data is stored.

Let’s define the variables used to connect the database. You must substitute the placeholder values in the following code within with your actual values based on your Amazon Redshift database:

#Connect to Amazon Redshift Database
REDSHIFT_PARAMS = {
                    "url": "<database-url>", 
                    "port": "<database-port>",
                    "database": "<database-name>",
                  }
REDSHIFT_USER = "<user-name>"
REDSHIFT_PASSWORD = "<user-password>"

conn = connect_to_db(REDSHIFT_PARAMS, REDSHIFT_USER, REDSHIFT_PASSWORD)
cursor = conn.cursor()

Map the SNOMED CT code to OMOP CDM concept IDs

The OMOP CDM uses its own concept IDs as data model identifiers across ontologies. Those differ from specific ontology codes such as SNOMED CT’s codes, but you can retrieve them from SNOMED CT codes using pre-built OMOP CDM maps. To retrieve the concept_id of SNOMED CT code 193462001, we use the following query:

query1 = f"
SELECT DISTINCT concept_id 
FROM cmsdesynpuf23m.concept 
WHERE vocabulary_id='SNOMED' AND concept_code='{pred_snomed['Code']}';
"

out_df = execute_query(cursor, query1)
concept_id = out_df['concept_id'][0]
print(concept_id)

The output OMOP CDM concept_id is 436962. The concept ID uniquely identifies a given medical concept in the OMOP CDM database and is used as a primary key in the concept table. This enables linking of each code with patient information in other tables.

Use the structured information map from the SNOMED CT code to OMOP CDM concept ID

Now that we have OMOP’s concept_id, we can run many queries from the database. When we find the particular concept, we can use it for different use cases. For example, we can use it to query population statistics with a given condition, traverse ontologies to bridge operability gaps, and extract the unique hierarchical structure of concepts to achieve the right queries. In this section, we walk you through a few examples.

Retrieve the number of patients with a disease

The first example is retrieving the total number of patients with the insomnia condition that we linked to its appropriate ontology concept using Amazon Comprehend Medical. The following code formulates and runs the corresponding SQL query:

query2 = f"
SELECT COUNT(DISTINCT person_id) 
FROM cmsdesynpuf23m.condition_occurrence 
WHERE condition_concept_id='{concept_id}';
"
out_df = execute_query(cursor, query2)
print(out_df)

In our sample records described in the prerequisites section, the total number of patients in the database that have been diagnosed with insomnia are 26,528.

Traverse the ontology

One of the advantages of using SNOMED CT is that we can exploit its hierarchical taxonomy. Let’s illustrate how via some examples.

Ancestors: Going up the hierarchy

First, let’s find the immediate ancestors and descendants of the concept insomnia. We use concept_ancestor and concept tables to get the parent (ancestor) and children (descendants) of the given concept code. The following code is the SQL statement to output the parent information:

query3 = f"
SELECT DISTINCT concept_code, concept_name 
FROM cmsdesynpuf23m.concept 
WHERE concept_id IN (SELECT ancestor_concept_id 
FROM cmsdesynpuf23m.concept_ancestor 
WHERE descendant_concept_id='{concept_id}' AND max_levels_of_separation=1);
"
out_df = execute_query(cursor, query3)
print(out_df)

In the preceding example, we used max_levels_of_separation=1 to limit concept codes that are immediate ancestors. You can increase the number to get more in the hierarchy. The following table summarizes our results.

concept_code concept_name
44186003 Dyssomnia
194437008 Disorders of initiating and maintaining sleep

SNOMED CT offers a polyhierarchical classification, which means a concept can have more than one parent. This hierarchy is also called a directed acyclic graph (DAG).

Descendants: Going down the hierarchy

We can use a similar logic to retrieve the children of the code insomnia:

query4 = f"SELECT DISTINCT concept_code, concept_name 
FROM cmsdesynpuf23m.concept 
WHERE concept_id IN (SELECT descendant_concept_id 
FROM cmsdesynpuf23m.concept_ancestor 
WHERE ancestor_concept_id='{concept_id}' AND max_levels_of_separation=1);
"
out_df = execute_query(cursor, query4)
print(out_df)

As a result, we get 26 descendant codes; the following table shows the first 10 rows.

concept_code concept_name
24121004 Insomnia disorder related to another mental disorder
191997003 Persistent insomnia
198437004 Menopausal sleeplessness
88982005 Rebound insomnia
90361000119105 Behavioral insomnia of childhood
41975002 Insomnia with sleep apnea
268652009 Transient insomnia
81608000 Insomnia disorder related to known organic factor
162204000 Late insomnia
248256006 Not getting enough sleep

We can then use these codes to query a broader set of patients (parent concept) or a more specific one (child concept).

Finding the concept in the appropriate hierarchy level is important, because if not accounted for appropriately, you might get wrong statistical answers from your queries. For example, in the preceding use case, let’s say that you want to find the number of patients with insomnia that is only related with not getting enough sleep. Using the parent concept for the general insomnia gives you a different answer than when specifying the descendant concept code only related with not getting enough sleep.

Map to other ontologies

We can also map the SNOMED concept code to other ontologies such as ICD10CM for conditions and RxNorm for medications. Because insomnia is condition, let’s find the corresponding ICD10 concept codes for the given insomnia’s SNOMED concept code. The following code is the SQL statement and function to find the ICD10 concept codes:

query5 = f"
SELECT DISTINCT concept_code, concept_name, vocabulary_id 
FROM cmsdesynpuf23m.concept 
WHERE vocabulary_id='ICD10CM' AND 
concept_id IN (SELECT concept_id_2 
FROM cmsdesynpuf23m.concept_relationship 
WHERE concept_id_1='{concept_id}' AND relationship_id='Mapped from');
"
out_df = execute_query(cursor, query5)
print(out_df)

The following table lists the corresponding ICD10 concept codes with their descriptions.

concept_code concept_name vocabulary_id
G47.0 Insomnia ICD10CM
G47.00 Insomnia, unspecified ICD10CM
G47.09 Other insomnia ICD10CM

When we’re done running SQL queries, let’s close the connection to the database:

conn.close()

Conclusion

Now that you have reviewed this example, you’re ready to apply Amazon Comprehend Medical on your clinical text to extract and link SNOMED CT concepts. We also provided concrete examples of how to use this information with your medical records using an OMOP CDM database to run SQL queries and get patient information related with the medical concepts. Finally, we also showed how to extract the different hierarchies of medical concepts and convert SNOMED CT concepts to other standardized vocabularies such as ICD10CM.

The Amazon ML Solutions Lab pairs your team with ML experts to help you identify and implement your organization’s highest value ML opportunities. If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon ML Solutions Lab.


About the Author

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he helps customers across different industries accelerate their use of machine learning and AWS Cloud services to solve their business challenges.

Miguel Romero Calvo is an Applied Scientist at the Amazon ML Solutions Lab where he partners with AWS internal teams and strategic customers to accelerate their business through ML and cloud adoption.

Lin Lee Cheong is a Senior Scientist and Manager with the Amazon ML Solutions Lab team at Amazon Web Services. She works with strategic AWS customers to explore and apply artificial intelligence and machine learning to discover new insights and solve complex problems.

Read More