Amazon Textract is a machine learning (ML) service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond simple optical character recognition (OCR) to identify and extract data from forms and tables.
Currently, thousands of customers are using Amazon Textract to process different types of documents. Many include tables across one or multiple pages, such as bank statements and financial reports.
Many developers expressed interest in merging Amazon Textract responses where tables exist across multiple pages. This post demonstrates how you can use the amazon-textract-response-parser utility to accomplish this and highlights a few tricks to optimize the process.
Solution overview
When tables span multiple pages, a series of steps and validations are required to determine the linkage across pages correctly.
These include analyzing the table structure similarities across pages (columns, headers, margins) and determining if any additional contents like headers or footers exist that may logically break the tables. These logical steps are separated into two major groups (page context and table structure), and you can adjust and optimize each logical step according to your use case.
This solution runs these tasks in series and only merges the results when all checks are completed and passed. The following diagram shows the solution workflow.
Implement the solution
To get started, you must install the amazon-textract-response-parser
, and amazon-textract-helper
libraries. The Amazon Textract response parser library enables us to easily parse the Amazon Textract JSON response and provides constructs to work with different parts of the document effectively. This post focuses on the merge/link tables feature. Amazon-textract-helper
is another useful library that provides a collection of ready-to-use functions and sample implementations to speed up the evaluation and development of any project using Amazon Textract.
- Install the libraries with the following code:
!pip install amazon-textract-response-parser
!pip install amazon-textract-helper
- The postprocessing step to identify related tables and merge them is part of the
trp.trp2
library, which you must import into your notebook:
import trp.trp2 as t2
from trp.t_pipeline import pipeline_merge_tables
from textractcaller.t_call import call_textract, Textract_Features
from trp.trp2 import TDocument, TDocumentSchema
from trp.t_tables import MergeOptions, HeaderFooterType
- Next, call Amazon Textract to process the document:
textract_json = call_textract(input_document=s3_uri_of_documents, features=[Textract_Features.TABLES], boto3_textract_client = textract_client)
- Finally, load the response JSON into a document and run the pipeline. The footer and header heights are configurable by the user. There are three default values can be used for
HeaderFooterType
:None
,Narrow
, andNormal
.
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE)
Pipeline_merge_tables
takes a merge option parameter that can be either .MERGE
or .LINK
.
MergeOptions.MERGE
combines the tables and makes them appear as one for postprocessing, with the drawback that the geometry information is no longer in the correct location because you now have cells and tables from subsequent pages moved to the page with the first part of the table.
MergeOptions.LINK
maintains the geometric structure and enriches the table information with links between the table elements. A custom['previous_table']
and custom['next_table']
attribute is added to the TABLE
blocks in the Amazon Textract JSON schema.
The following image represents a sample PDF file with a table that spans over two pages.
The following shows the Amazon Textract response without table merge postprocessing (left) and the response with table merge postprocessing (right).
Define a custom table merge validation function
The provided postprocessing API works for the majority of use cases; however, based on your specific use case, you can define a custom merge function to improve its accuracy.
This custom function is passed to the CustomTableDetectionFunction
parameter of the pipeline_merge_tables
function to overwrite the existing logic of identifying the tables to merge. The following steps represent the existing logic.
- Validate context between tables. Check if there are any line items between the first and second table except in the footer and header area. If there are any line items, tables are considered separate tables.
- Compare the column numbers. If the two tables don’t have the same number of columns, this is an indicator of separate logical tables.
- Compare the headers. If the two tables have the exact same columns (same cell number and cell labels), this is a very strong indication of the same logical table.
- Compare table dimensions. Verify that the two tables have the same left and right margin. An accuracy percentage parameter can be passed to allow for some degree of error (for example, if the pages are scanned from papers, consequent tables on different pages may have different weights).
If you have a different requirement, you can pass your own custom table detection function to the pipeline_merge_tables
API as follows:
def CustomTableDetectionFunction(t_document) -> List[List[str]])
table_ids_merge_list = []
ordered_doc = order_blocks_by_geo(t_document)
trp_doc = Document(TDocumentSchema().dump(ordered_doc))
for current_page in trp_doc.pages:
for table in current_page.tables:
# Provide your custom logic here to determine which tableids should merge to one table
# if(custom logic)
# table_ids_merge_list.append(>tableid1, tableid2, tableid3, ...etc.)
return table_ids_merge_list
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, CustomTableDetectionFunction, HeaderFooterType.NORMAL)
Our current implementation for the table detection function and pipeline_merge_tables
function in our Amazon Textract response parser library is available on GitHub. The customTableDetection
function returns a list of lists (of strings), which is required by the merge_table
or link_table
functions (based on the MergeOptions
parameter) called internally by the pipeline_merge_tables
API.
Run sample code
The Amazon Textract multi-page tables processing repository provides sample code on how to use the merge tables feature and covers common scenarios that you may encounter in your documents. To try the sample code, you first launch an Amazon SageMaker notebook instance with the code repository, then you can access the notebook to review the code samples.
Launch a SageMaker notebook instance with the code repository
To launch a SageMaker notebook instance, complete the following steps:
- Choose the following link to launch an AWS CloudFormation template that deploys a SageMaker notebook instance along with the sample code repository:
- Sign in to the AWS Management Console with your AWS Identity and Access Management (IAM) user name and password.
You arrive at the Create Stack page on the Specify Template step.
- Choose Next.
- For Specify Stack Name, enter a stack name.
- Choose Next.
- Choose Next
- On the review page, acknowledge the IAM resource creation and choose Create stack.
Access the SageMaker notebook and review the code samples
When the stack creation is complete, you can access the notebook and review the code samples.
- On the Outputs tab of the stack, choose the link corresponding to the value of the
NotebookInstanceName
key. - Choose Open Jupyter.
- Go to the home page of your Jupyter notebook and browse to the
amazon-textract-multipage-tables-processing
directory. - Open the Jupyter notebook inside this directory and the sample code provided.
Conclusion
This post demonstrated how to use the Amazon Textract response parser component to identify and merge tables that span multiple pages. You walked through generic checks that you can use to identify a multi-page table, learned how to build your own custom function, and reviewed the two options to merge tables in the Amazon Textract response JSON.
If this post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available on the GitHub repo for you to use and extend. Contributions are always welcome!
About the Authors
Mehran Najafi, PhD, is a Senior Solutions Architect for AWS focused on AI/ML solutions and architectures at scale.
Keith Mascarenhas is a Solutions Architect and works with our small and medium sized customers in central Canada to help them grow and achieve outcomes faster with AWS. He is also passionate about machine learning and is a member of the Amazon Computer Vision Hero program.
Yuan Jiang is a Sr Solutions Architect with a focus in machine learning. He’s a member of the Amazon Computer Vision Hero program and the Amazon Machine Learning Technical Field Community.
Martin Schade is a Senior ML Product SA with the Amazon Textract team. He has over 20 years of experience with internet-related technologies, engineering, and architecting solutions, and joined AWS in 2014. He has guided some of the largest AWS customers on the most efficient and scalable use of AWS services, and later focused on AI/ML with a focus on computer vision. He is currently obsessed with extracting information from documents.