Index your Confluence content using the new Confluence connector V2 for Amazon Kendra

Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.

Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to pull together data across several structured and unstructured repositories to index and search on.

One such unstructured data repository is Confluence. Confluence is a team workspace that gives knowledge worker teams a place to create, capture, and collaborate on any project or idea. Team spaces help teams structure, organize, and share work, so every team member has visibility into institutional knowledge and access to the information they need.

There are two Confluence offerings:

Cloud – This is offered as a software as a service (SaaS) product. It’s always on, continuously updated, and highly secure.
Data Center (self-managed) – Here, you host Confluence on your infrastructure, which could be on premises or the cloud. This allows you to keep data within your network and manage it yourself.

We’re excited to announce that you can now use the new Amazon Kendra connector V2 for Confluence to search information stored in your Confluence account both on the cloud and your data center. In this post, we show how to index information stored in Confluence and use the Amazon Kendra intelligent search function. In addition, the ML-powered intelligent search can accurately find information from unstructured documents having natural language narrative content, for which keyword search is not very effective.

What’s new for this version

This version supports OAuth 2.0 authentication in addition to basic authentication for the Cloud edition. For the Data Center (on-premises) edition, we have added OAuth2 in addition to basic authentication and personal access tokens for showing search results based on user access rights. You can benefit from the following features:

You can now crawl comments in addition to spaces, pages, blogs, and attachments
You now have fine-grained choices for your sync scope—you can specify pages, blogs, comments, and attachments
You can choose to import identities (or not)
This version offers regex support for choosing entity titles as well as file types
You have the choice of multiple Sync modes

Solution overview

With Amazon Kendra, you can configure multiple data sources to provide a central place to search across your document repository. For our solution, we demonstrate how to index a Confluence repository using the Amazon Kendra connector for Confluence. The solution consists of the following steps:

Choose an authentication mechanism.
Configure an app on Confluence and get the connection details.
Store the details in AWS Secrets Manager.
Create a Confluence data source V2 via the Amazon Kendra console.
Index the data in the Confluence repository.
Run a sample query to test the solution.

Prerequisites

To try out the Amazon Kendra connector for Confluence, you need the following:

A Confluence account (Cloud or Data Center edition).
An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
Basic knowledge of AWS.

Choose an authentication mechanism

Choose your preferred authentication method:

Basic – This works on both the Cloud and Data Center editions. You need a user ID and a password to configure this method.
Personal access token – This option only works for the Data Center edition.
OAuth2 – This is more involved and works for both Cloud and Data Center editions.

Gather authentication details

In this section, we show the steps to gather your authentication details depending on your authentication method.

Basic authentication

For basic authentication with the Data Center edition, all you need is your login and password. Make sure your login has privileges to gather all content.

For Cloud edition, your user ID serves as your user login. For your password, you need to get a token. Complete the following steps:

For Label, enter a name for the token.
Choose Create.

Copy the value and save it to use as your password.

Personal access token

This authentication method works for on premises (Data Center) only. Complete the following steps to acquire authentication details:

Log in to your Confluence URL using the user ID and password that you want Amazon Kendra to use while retrieving content.
Choose the profile icon and choose Settings.

Choose Personal Access Tokens in the navigation pane, then choose Create token.

For Token name, enter a name.
For Expiry date, deselect Automatic expiry.
Choose Create.

Copy the token and save it in a safe place.

To configure Secrets Manager, we use the login URL and this value.

OAuth2 authentication for Confluence Cloud edition

This authentication method follows the full OAuth2.0 (3LO) documentation from Confluence. We first create and configure an app on Confluence and enable it for OAuth2. The process is slightly different for the Cloud and Data Center editions. We then get an authorization token and exchange this for an access token. Finally, we get the client ID, client secret, and client code. Complete the following steps:

Log in to the Confluence app.
Navigate to https://developer.atlassian.com/.
Next to My apps, choose Create and choose OAuth2 Integration.

For Name, enter a name.
Choose Create.

Choose Authorization in the navigation pane.
Choose Add next to your authorization type.

For Callback URL, enter the URL you use to log in to Confluence.
Choose Save changes.

Under Authorization URL generator, choose Add APIs.

Next to User identity API, choose Add, then choose Configure.

Choose Edit Scopes to configure read scopes for the app.
Select View active user profile and View user profiles.

Choose Permissions in the navigation pane.
Next to Confluence API, choose Add, then choose Configure.
On the Classic scopes tab, choose Edit Scopes.
Select all read, search, and download scopes.
Choose Save.

On the Granular scopes tab, choose Edit Scopes.
Search for read and select all the scopes found.
Choose Save.

Choose Authorization in the navigation pane.
Next to your authorization type, choose Configure.

You should see three URLs listed.

Copy the code for Granular Confluence API authorization URL.

The following is example code:

https://auth.atlassian.com/authorize?
audience=api.atlassian.com
&client_id=YOUR_CLIENT_ID
&scope=REQUESTED_SCOPE%20REQUESTED_SCOPE_TWO

&redirect_uri=https://YOUR_APP_CALLBACK_URL
&state=YOUR_USER_BOUND_VALUE
&response_type=code
&prompt=consent

If you want to generate a refresh token so that you don’t have to repeat this process, add offline_access (or %20offline_access) to the end of all the scopes in the URL (for example, &scope=REQUESTED_SCOPE%20REQUESTED_SCOPE_TWO%20offline_access).
If you’re okay generating a new token each time, just enter the URL in your browser.
Choose Accept.

You’re redirected to your Confluence home page.

Inspect the browser URL and locate code=xxxxx.
Copy this code and save it.

This is the authorization code that we use to exchange with the access token.

Return to the Atlassian developer console and choose Settings in the navigation pane.
Copy the values of the client ID and secret ID and save them.

We need these values to make a call to exchange the authorization token with the access token.

Next, we use the Postman utility to post the authorization code to get the access token. You can use alternate tools like curl to do this as well.

The URL to post the authorization code is https://auth.atlassian.com/oauth/token.

The JSON body to post is as follows:

{"grant_type": "authorization_code",
"client_id": "YOUR_CLIENT_ID",
"client_secret": "YOUR_CLIENT_SECRET",
"code": "YOUR_AUTHORIZATION_CODE",
"redirect_uri": "https://YOUR_APP_CALLBACK_URL"}

The grant_type parameter is hard-coded. We collected the values for client_id and client_secret in a previous step. The value for code is the authorization code we collected earlier.

A successful response will return the access token. If you added offline access to the URL earlier, you also get a refresh token.

Save the access token to use when setting up Secrets Manager.

If you’re generating a new token from the refresh token, the current token is valid only for 1 hour. If you need to get a new token, you can start all over again. However, if you have the refresh token, as before, use Postman to post to the following URL: https://auth.atlassian.com/oauth/token. Use the following JSON format for the body of the token:

{"grant_type": "refresh_token",
"client_id": "YOUR_CLIENT_ID",
"client_secret": "YOUR_CLIENT_SECRET",
"refresh_token": "YOUR_REFRESH_TOKEN"}

The call will return a new access token

OAuth2 authentication for Confluence Data Center edition

If using the Data Center edition with OAuth2 authentication, complete the following steps:

Log in to Confluence Data Center edition.
Choose the gear icon, then choose General configuration.
In the navigation pane, choose Application links, then choose Create link.
In the Create link pop-up window, select External application and Incoming, then choose Continue.
For Name, enter a name.
For Redirect URL, enter https://httpbin.org/.
Choose Save.
Copy and save the values for the client ID and client secret.
On a separate browser tab, open the URL https://example-app.com/pkce.
Choose Generate Random String and Calculate Hash.
Copy the value under Code Challenge.
Return to your original tab.

Use the following URL to get the authorization code:

https://<confluence url>/rest/oauth2/latest/authorize
?client_id=CLIENT_ID
&redirect_uri=REDIRECT_URI
&response_type=code
&scope=SCOPE
&code_challenge=CODE_CHALLENGE
&code_challenge_method=S256

Use the client ID you copied earlier, and https://httpbin.org for the redirect URI. For CODE_CHALLENGE, enter the code you copied earlier.

Choose Allow.

You’re redirected to httpbin.org.

Save the code to use in the next step.

To get the access token and refresh token, use a tool such as curl or Postman to post the following values to https://<your confluence URL>/rest/oauth2/latest/token:

grant_type: authorization_code
client_id: YOUR_CLIENT_ID
client_secret: YOUR_CLIENT_SECRET
code: YOUR_AUTHORIZATION_CODE
code_verifier: CODE_VERIFIER
redirect_uri: YOUR_REDIRECT_URL

Use the client ID, client secret, and authorization code you saved earlier. For CODE_VERIFIER, enter the value from when you generated the code challenge.

Copy the access token and refresh token to use later

The access token and refresh token are valid only for 1 hour. To refresh the token, post the following code to the same URL to get new values:

grant_type: refresh_token
client_id: YOUR_CLIENT_ID
client_secret: YOUR_CLIENT_SECRET
refresh_token: REFRESH_TOKEN
redirect_uri: YOUR_REDIRECT_URL

The new tokens are valid for 1 hour.

Store Confluence credentials in Secrets Manager

To store your Confluence credentials in Secrets Manager, compete the following steps:

On the Secrets Manager console, choose Store a new secret.
Select Other type of secret.

Depending on the type of secret, enter the key-values as follows:
- For Confluence Cloud basic authentication, enter the following key-value pairs (note that the password is not the login password, but the token you created earlier):
```
"username" : "<your login username>",

"password" : "<your token value>"
```
- For Confluence Cloud OAuth authentication, enter the following key-value pairs:
```
"confluenceAppKey" : “<your clientid>”

"confluenceAppSecret" : “<your client Secret>”

"confluenceAccessToken" : “<your access token>”

"confluenceRefreshToken" : “<your refresh token>”
```
- For Confluence Data Center basic authentication, enter the following key-value pairs:
```
"username" : "<login username>"

"password" : "<login password>"
```
- For Confluence Data Center personal access token authentication, enter the following key-value pairs:
```
"patToken" :"<your personal access token>"
```
- For Confluence Data Center OAuth authentication, enter the following key-value pairs:
```
"confluenceAppKey" : "<your client id>"

"confluenceAppSecret" : “<your Client Secret>”

"confluenceAccessToken" : “<your Access Token>"

"confluenceRefreshToken" : “<your refresh token>”
```

Choose Next.

For Secret name, enter a name (for example, AmazonKendra-my-confluence-secret).
Enter an optional description.
Choose Next.

In the Configure rotation section, keep all settings at their defaults and choose Next.

On the Review page, choose Store.

Configure the Amazon Kendra connector for Confluence

To configure the Amazon Kendra connector, complete the following steps:

On the Amazon Kendra console, choose Create an Index.

For Index name, enter a name for the index (for example, my-confluence-index).
Enter an optional description.
For Role name, enter an IAM role name.
Configure optional encryption settings and tags.
Choose Next.

In the Configure user access control section, leave the settings at their defaults and choose Next.

In the Specify provisioning section, select Developer edition and choose Next.

On the review page, choose Create.

This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.

Create a Confluence data source

Complete the following steps to create your data source:

On the Amazon Kendra console, choose Data sources in the navigation pane.
Under Confluence connector V2.0, choose Add connector.

For Data source name, enter a name (for example, my-Confluence-data-source).
Enter an optional description.
Choose Next.

Choose either Confluence Cloud or Confluence Server depending on your data source.
For Authentication, choose your authentication option.
Select Identity crawler is on.
For IAM role¸ choose Create a new role.
For Role name, enter a name (for example, AmazonKendra-my-confluence-datasource-role).
Choose Next.

For Confluence Data Center and Cloud editions, we can add additional optional information (not shown) like the VPC. For Data Center edition only, we can add additional information for the web proxy. There is also an additional authentication option if using a personal access token that is valid only for Data Center and not Cloud edition.

For Sync scope, select all the content to sync.
For Sync mode, select Full sync.
For Frequency, choose Run on demand.
Choose Next.

Optionally, you can set mapping fields.

Mapping fields is a useful exercise where you can substitute field names to values that are user-friendly and fit in your organization’s vocabulary.

For this post, keep all defaults and choose Next.

Review the settings and choose Add data source.
To sync the data source, choose Sync now.

A banner message appears when the sync is complete.

Test the solution

Now that you have ingested the content from your Confluence account into your Amazon Kendra index, you can test some queries. For the purposes of our test, we have created a Confluence website with two teams: team1 with the member Analyst1 and team2 with the member Analyst2.

On the Amazon Kendra console, navigate to your index and choose Search indexed content.
Enter a sample search query and review your search results (your results will vary based on the contents of your account).

The Confluence connector also crawls local identity information from Confluence. You can use this feature to narrow down your query by user. Confluence offers comprehensive visibility options. Users can choose their content to be seen by other users, at a space level, or by groups. When you filter your searches by users, the query returns only those documents that the user has access to at the time of ingestion.

To use this feature, expand Test query with user name or groups and choose Apply user name or groups.
Enter the user name of your user and choose Apply.

Note that for Confluence Data Center edition, the user name is the email ID.

Rerun your search query.

This brings you a filtered set of results. Notice we bring back just 62 results.

We now go back and restrict Bob Straham to just be able to access his workspace and run the search again.

Notice that we get just a subset of the results because the search is restricted to just Bob’s content.

When fronting Amazon Kendra with an application such as an application built using Experience Builder, you can pass the user identity (in the form of the email ID for Cloud edition or user name for Data Center edition) to Amazon Kendra to ensure that each user only sees content specific to their user ID. Alternately, you can use AWS IAM Identity Center (successor to AWS Single Sign-On) to control user context being passed to Amazon Kendra to limit queries by user.

Congratulations! You have successfully used Amazon Kendra to surface answers and insights based on the content indexed from your Confluence account.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Confluence V2, delete that data source.

Conclusion

With the new Confluence connector V2 for Amazon Kendra, organizations can tap into the repository of information stored in their account securely using intelligent search powered by Amazon Kendra.

To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide. For more information on how you can create, modify, or delete metadata and content when ingesting your data from Confluence, refer to Enriching your documents during ingestion and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.

About the author

Ashish Lagwankar is a Senior Enterprise Solutions Architect at AWS. His core interests include AI/ML, serverless, and container technologies. Ashish is based in the Boston, MA, area and enjoys reading, outdoors, and spending time with his family.

Vedere AI