New expanded data format support in Amazon Kendra

Enterprises across the globe are looking to utilize multiple data sources to implement a unified search experience for their employees and end customers. Considering the large volume of data that needs to be examined and indexed, the retrieval speed, solution scalability, and search performance become key factors to consider when choosing an enterprise intelligent search solution. Additionally, these unique data sources comprise structured and unstructured content repositories—including various file types—which may cause compatibility issues.

Amazon Kendra is a highly accurate and intelligent search service that enables users to search for answers to their questions from your unstructured and structured data using natural language processing and advanced search algorithms. It returns specific answers to questions, giving users an experience that’s close to interacting with a human expert.

Today, Amazon Kendra launched seven additional data format support options for you to use. This allows you to easily integrate your existing data sources as is and perform intelligent search across multiple content repositories.

In this post, we discuss the new supported data formats and how to use them.

New supported data formats

Previously, Amazon Kendra supported documents that included structured text in the form of frequently asked questions and answers, as well as unstructured text in the form of HTML files, Microsoft PowerPoint presentations, Microsoft Word documents, plain text documents, and PDFs.

With this launch, Amazon Kendra now offers support for seven additional data formats:

Rich Text Format (RTF)
JavaScript Object Notation (JSON)
Markdown (MD)
Comma separated values (CSV)
Microsoft Excel (MS Excel)
Extensible Markup Language (XML)
Extensible Stylesheet Language Transformations (XSLT)

Amazon Kendra users can ingest these documents with different data formats to their index in the following two ways:

Using the BatchPutDocument API:
- Pass the document as an Amazon Simple Storage Service (Amazon S3) file.
- Pass the document as binary data (blob).
As a data source. For more information, see Creating a data source.

Solution overview

In the following sections, we walk through the steps for adding documents from a data source and performing a search on those documents.

The following diagram shows our solution architecture.

For testing this solution for any of the supported formats, you need to use your own data. You can test by uploading documents of the same or different formats to the S3 bucket.

Create an Amazon Kendra index

For instructions on creating your Amazon Kendra index, refer to Creating an index.

You can skip this step if you have a pre-existing index to use for this demo.

Upload documents to an S3 bucket and ingest to the index using the S3 connector

Complete the following steps to connect an S3 bucket to your index:

Create an S3 bucket to store your documents.
Create a folder named sample-data.
Upload the documents that you want to test to the folder.
On the Amazon Kendra console, go to your index and choose Data sources.
Choose Add data source.
Under Available data sources, select S3 and choose Add Connector.
Enter a name for your connector (such as Demo_S3_connector) and choose Next.
Choose Browse S3 and choose the S3 bucket where you uploaded the documents.
For IAM Role, create a new role.
For Set sync run schedule, select Run on demand.
Choose Next.
On the Review and create page, choose Add data source.
After the creation process is complete, choose Sync Now.

Now that you have ingested some documents, you can navigate to the built-in search console to test queries.

Search your documents with the Amazon Kendra search console

On the Amazon Kendra console, choose Search indexed content in the navigation pane.

The following are examples of the results from the search for different document types:

RTF – Input data in RTF format uploaded to the S3 bucket and sync the data source: