In 1993, the Portable Document Format or the PDF was born and released to the world. Since then, companies across various industries have been creating, scanning, and storing large volumes of documents in this digital format. These documents and the content within them are vital to supporting your business. Yet in many cases, the content is text-heavy and often written in a different language. This limits the flow of information and can directly influence your organization’s business productivity and global expansion strategy. To address this, you need an automated solution to extract the contents within these PDFs and translate them quickly and cost-efficiently.
In this post, we show you how to create an automated and serverless content-processing pipeline for analyzing text in PDF documents using Amazon Textract and translating them with Amazon Translate.
Amazon Textract automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple OCR to also identify the contents of fields in forms and information stored in tables. This allows Amazon Textract to read virtually any type of document and accurately extract text and data without needing any manual effort or custom code.
Once the text and data are extracted, you can use Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and natural-sounding translation than traditional statistical and rule-based translation algorithms. The translation service is trained on a wide variety of content across different use cases and domains to perform well on many kinds of content. Its asynchronous batch processing capability enables you to translate a large collection of text or HTML documents with a single API call.
Solution overview
To be scalable and cost-effective, this solution uses serverless technologies and managed services. In addition to Amazon Textract and Amazon Translate, the solution uses the following services:
- Amazon Simple Storage Service (Amazon S3) – Stores your documents and allows for central management with fine-tuned access controls.
- Amazon Simple Notification Service (Amazon SNS) – Enables you to decouple microservices, distributed systems, and serverless applications with a highly available, durable, secure, fully managed pub/sub messaging service.
- AWS Lambda – Runs code in response to triggers such as changes in data, changes in application state, or user actions. Because services like Amazon S3 and Amazon SNS can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
- AWS Step Functions – Coordinates multiple AWS services into serverless workflows.
Solution architecture
The architecture workflow contains the following steps:
- Users upload a PDF for analysis to Amazon S3.
- The Amazon S3 upload triggers a Lambda function.
- The function invokes Amazon Textract to extract text from the PDF in batch mode.
- Amazon Textract sends an SNS notification when the job is complete.
- A Lambda function reads the Amazon Textract response and stores the extracted text in Amazon S3.
- The Lambda function from the previous step invokes Amazon Translate in batch mode to translate the extracted texts into the target language.
- The Step Functions-based job poller polls for the translation job to complete.
- Step Functions sends an SNS notification when the translation is complete.
- A Lambda function reads the translated texts in Amazon S3 and generates a translated document in Amazon S3.
The following diagram illustrates this architecture.
For processing documents at scale, you can expand this solution to include Amazon Simple Queue Service (Amazon SQS) to queue the jobs and handle any potential failure related to throttling and default service concurrency limits. For more information about the limits in Amazon Translate and Amazon Textract, see Guidelines and Limits and Limits in Amazon Textract, respectively.
Deploying the solution with AWS CloudFormation
The first step is to use an AWS CloudFormation template to provision the necessary resources needed for the solution, including the AWS Identity and Access Management (IAM) roles, IAM policies, and SNS topics.
- Launch the AWS CloudFormation template by choosing the following (this creates the stack the
us-east-1
Region):
- For Stack name, enter a unique stack name for this account; for example,
document-translate
. - For TargetLanguageCode, enter the language code that you want your translated documents in; for example,
es
for Spanish.
For more information about supported languages, see Supported Languages and Language Codes.
- In the Capabilities and transforms section, and select the check-boxes to acknowledge that AWS CloudFormation will create IAM resources and transform the AWS Serverless Application Model (AWS SAM) template.
AWS SAM templates simplify the definition of resources needed for serverless applications. When deploying AWS SAM templates in AWS CloudFormation, AWS CloudFormation performs a transform to convert the AWS SAM template into a CloudFormation template. For more information, see Transform.
- Choose Create stack.
The stack creation may take up to 20 minutes, after which the status changes to CREATE_COMPLETE
. You can see the name of the newly created S3 bucket on the Outputs tab.
Translating the document
To translate your document, upload a document in English to the input
folder of the S3 bucket you created in the previous step.
For this post, we scanned the “Universal Declaration of Human Rights,” created by the United Nations.
This upload event triggers the Lambda function <Stack name>-S3EventProcessor-<Random string>
, which invokes the Amazon Textract startDocumentTextDetection
API to extract the text from the scanned document.
When Amazon Textract completes the batch job, it sends an SNS notification. The notification triggers the Lambda function <Stack name>-TextractSNSEventProcessor-<Random string>
, which processes the Amazon Textract response page by page to extract the LINE
block elements to store them in the S3 bucket.
Amazon Textract extracts LINE
block elements with a BoundingBox
. A sentence in the scanned document results in multiple LINE
block elements. To make sure that Amazon Translate has the entire sentence in scope for translation, the solution combines multiple LINE
block elements to recreate the sentence boundary in the source document. This done by using the BreakIterator
class available for Java. For more information, see Class BreakIterator.
The sentences are then stored in the S3 bucket as individual objects. Finally, the Amazon Translate job startTextTranslationJob
is invoked with the input S3 bucket location where the text to be translated is available.
The Amazon Translate job completion SNS notification from the job poller triggers the Lambda function <Stack name>-TranslateJobSNSEventProcessor-<Random string>
. The function creates the editable document by combining the translated texts created by the Amazon Translate batch job in the output folder of the S3 bucket with the following naming convention: inputFileName-TargetLanguageCode.docx
.
The following screenshot shows the document translated in Spanish.
The solution also supports translating documents for right-to-left (RTL) script languages such as Arabic and Hebrew. The following screenshot shows the translated document in Arabic (language code: ar
).
For any pipeline failure, check the Amazon CloudWatch logs for the corresponding Lambda function and look for potential errors that caused the failure.
To do a translation in a different language, you can update the LANG_CODE
environment variable for the <Stack name>-TextractSEventProcessor-<Random string>
function and trigger the solution pipeline by uploading a new document into the input
folder of the S3 bucket.
Conclusion
In this post, we demonstrated how to extract text from PDF documents and translate them into an editable document in a different language using Amazon Translate asynchronous batch processing. For a low-latency, low-throughput solution translating smaller PDF documents, you can perform the translation through the real-time Amazon Translate API.
The ability to process data at scale is becoming important to organizations across all industries. Managed machine learning services like Amazon Textract and Amazon Translate can simplify your document processing and translation needs, helping you focus on addressing core business needs while keeping overall IT costs manageable.
For further reading, we recommend the following:
- Asynchronous Batch Processing
- Detecting and Analyzing Text in Multipage Documents
- Translating documents with Amazon Translate, AWS Lambda, and the new Batch Translate API
- Automatically extract text and structured data from documents with Amazon Textract
- Getting a batch job completion message from Amazon Translate
About the Authors
Siva Rajamani is a Boston-based Enterprise Solutions Architect for AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are Serverless, Application Integration, and Security. Outside of work, he enjoys outdoors activities and watching documentaries.
Sudhanshu Malhotra is a Boston-based Enterprise Solutions Architect for AWS. He’s a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are DevOps, Machine Learning, and Security. When he’s not working with customers on their journey to the cloud, he enjoys reading, hiking, and exploring new cuisines.