Applications Now Open for $60,000 NVIDIA Graduate Fellowship Awards

Applications Now Open for $60,000 NVIDIA Graduate Fellowship Awards

Bringing together the world’s brightest minds and the latest accelerated computing technology leads to powerful breakthroughs that help tackle some of the biggest research problems.

To foster such innovation, the NVIDIA Graduate Fellowship Program provides grants, mentors and technical support to doctoral students doing outstanding research relevant to NVIDIA technologies. The program, in its 24th year running, is now accepting applications worldwide.

It focuses on supporting students working in AI, machine learning, autonomous vehicles, computer graphics, robotics, healthcare, high-performance computing and related fields. Awards are up to $60,000 per student.

Since its start in 2002, the Graduate Fellowship Program has awarded 200 grants worth more than $6.5 million.

Students must have completed at least their first year of Ph.D.-level studies at the time of application.

The application deadline for the 2025-2026 academic year is Friday, Sept. 13, 2024. An in-person internship at an NVIDIA research office preceding the fellowship year is mandatory; eligible candidates must be available for the internship in summer 2025.

For more on eligibility and how to apply, visit the program website.

Read More

Harness the power of AI and ML using Splunk and Amazon SageMaker Canvas

Harness the power of AI and ML using Splunk and Amazon SageMaker Canvas

As the scale and complexity of data handled by organizations increase, traditional rules-based approaches to analyzing the data alone are no longer viable. Instead, organizations are increasingly looking to take advantage of transformative technologies like machine learning (ML) and artificial intelligence (AI) to deliver innovative products, improve outcomes, and gain operational efficiencies at scale. Furthermore, the democratization of AI and ML through AWS and AWS Partner solutions is accelerating its adoption across all industries.

For example, a health-tech company may be looking to improve patient care by predicting the probability that an elderly patient may become hospitalized by analyzing both clinical and non-clinical data. This will allow them to intervene early, personalize the delivery of care, and make the most efficient use of existing resources, such as hospital bed capacity and nursing staff.

AWS offers the broadest and deepest set of AI and ML services and supporting infrastructure, such as Amazon SageMaker and Amazon Bedrock, to help you at every stage of your AI/ML adoption journey, including adoption of generative AI. Splunk, an AWS Partner, offers a unified security and observability platform built for speed and scale.

As the diversity and volume of data increases, it is vital to understand how they can be harnessed at scale by using complementary capabilities of the two platforms. For organizations looking beyond the use of out-of-the-box Splunk AI/ML features, this post explores how Amazon SageMaker Canvas, a no-code ML development service, can be used in conjunction with data collected in Splunk to drive actionable insights. We also demonstrate how to use the generative AI capabilities of SageMaker Canvas to speed up your data exploration and help you build better ML models.

Use case overview

In this example, a health-tech company offering remote patient monitoring is collecting operational data from wearables using Splunk. These device metrics and logs are ingested into and stored in a Splunk index, a repository of incoming data. Within Splunk, this data is used to fulfill context-specific security and observability use cases by Splunk users, such as monitoring the security posture and uptime of devices and performing proactive maintenance of the fleet.

Separately, the company uses AWS data services, such as Amazon Simple Storage Service (Amazon S3), to store data related to patients, such as patient information, device ownership details, and clinical telemetry data obtained from the wearables. These could include exports from customer relationship management (CRM), configuration management database (CMDB), and electronic health record (EHR) systems. In this example, they have access to an extract of patient information and hospital admission records that reside in an S3 bucket.

The following table illustrates the different data explored in this example use case.

Description

Feature Name

Storage

Example Source

Age of patient

age

AWS

EHR

Units of alcohol consumed by patient every week

alcohol_consumption

AWS

EHR

Tobacco usage by patient per week

tabacco_use

AWS

EHR

Average systolic blood pressure of patient

avg_systolic

AWS

Wearables

Average diastolic blood pressure of patient

avg_diastolic

AWS

Wearables

Average resting heart rate of patient

avg_resting_heartrate

AWS

Wearables

Patient admission record

admitted

AWS

EHR

Number of days the device has been active over a period

num_days_device_active

Splunk

Wearables

Average end of the day battery level over a period

avg_eod_device_battery_level

Splunk

Wearables

This post describes an approach with two key components:

  • The two data sources are stored alongside each other using a common AWS data engineering pipeline. Data is presented to the personas that need access using a unified interface.
  • An ML model to predict hospital admissions (admitted) is developed using the combined dataset and SageMaker Canvas. Professionals without a background in ML are empowered to analyze the data using no-code tooling.

The solution allows custom ML models to be developed from a broader variety of clinical and non-clinical data sources to cater for different real-life scenarios. For example, it can be used to answer questions such as “If patients have a propensity to have their wearables turned off and there is no clinical telemetry data available, can the likelihood that they are hospitalized still be accurately predicted?”

AWS data engineering pipeline

The adaptable approach detailed in this post starts with an automated data engineering pipeline to make data stored in Splunk available to a wide range of personas, including business intelligence (BI) analysts, data scientists, and ML practitioners, through a SQL interface. This is achieved by using the pipeline to transfer data from a Splunk index into an S3 bucket, where it will be cataloged.

The approach is shown in the following diagram.

The diagram shows an architecture overview of data engineering pipeline. The components marked in the diagram are listed below.

Figure 1: Architecture overview of data engineering pipeline

The automated AWS data pipeline consists of the following steps:

  1. Data from wearables is stored in a Splunk index where it can be queried by users, such as security operations center (SOC) analysts, using the Splunk search processing language (SPL). Spunk’s out-of-the-box AI/ML capabilities, such as the Splunk Machine Learning Toolkit (Splunk MLTK) and purpose-built models for security and observability use cases (for example, for anomaly detection and forecasting), can be utilized inside the Splunk Platform. Using these Splunk ML features allows you to derive contextualized insights quickly without the need for additional AWS infrastructure or skills.
  2. Some organizations may look to develop custom, differentiated ML models, or want to build AI-enabled applications using AWS services for their specific use cases. To facilitate this, an automated data engineering pipeline is built using AWS Step Functions. The Step Functions state machine is configured with an AWS Lambda function to retrieve data from the Splunk index using the Splunk Enterprise SDK for Python. The SPL query requested through this REST API call is scoped to only retrieve the data of interest.
      1. Lambda supports container images. This solution uses a Lambda function that runs a Docker container image. This allows larger data manipulation libraries, such as pandas and PyArrow, to be included in the deployment package.
      2. If a large volume of data is being exported, the code may need to run for longer than the maximum possible duration, or require more memory than supported by Lambda functions. If this is the case, Step Functions can be configured to directly run a container task on Amazon Elastic Container Service (Amazon ECS).
  3. For authentication and authorization, the Spunk bearer token is securely retrieved from AWS Secrets Manager by the Lambda function before calling the Splunk /search REST API endpoint. This bearer authentication token lets users access the REST endpoint using an authenticated identity.
  4. Data retrieved by the Lambda function is transformed (if required) and uploaded to the designated S3 bucket alongside other datasets. This data is partitioned and compressed, and stored in storage and performance-optimized Apache Parquet file format.
  5. As its last step, the Step Functions state machine runs an AWS Glue crawler to infer the schema of the Splunk data residing in the S3 bucket, and catalogs it for wider consumption as tables using the AWS Glue Data Catalog.
  6. Wearable data exported from Splunk is now available to users and applications through the Data Catalog as a table. Analytics tooling such as Amazon Athena can now be used to query the data using SQL.
  7. As data stored in your AWS environment grows, it is essential to have centralized governance in place. AWS Lake Formation allows you to simplify permissions management and data sharing to maintain security and compliance.

An AWS Serverless Application Model (AWS SAM) template is available to deploy all AWS resources required by this solution. This template can be found in the accompanying GitHub repository.

Refer to the README file for required prerequisites, deployment steps, and the process to test the data engineering pipeline solution.

AWS AI/ML analytics workflow

After the data engineering pipeline’s Step Functions state machine successfully completes and wearables data from Splunk is accessible alongside patient healthcare data using Athena, we use an example approach based on SageMaker Canvas to drive actionable insights.

SageMaker Canvas is a no-code visual interface that empowers you to prepare data, build, and deploy highly accurate ML models, streamlining the end-to-end ML lifecycle in a unified environment. You can prepare and transform data through point-and-click interactions and natural language, powered by Amazon SageMaker Data Wrangler. You can also tap into the power of automated machine learning (AutoML) and automatically build custom ML models for regression, classification, time series forecasting, natural language processing, and computer vision, supported by Amazon SageMaker Autopilot.

In this example, we use the service to classify whether a patient is likely to be admitted to a hospital over the next 30 days based on the combined dataset.

The approach is shown in the following diagram.

The diagram shows an architecture overview of ML development. Important components of the solution are listed below.

Figure 2: Architecture overview of ML development

The solution consists of the following steps:

  1. An AWS Glue crawler crawls the data stored in S3 bucket. The Data Catalog exposes this data found in the folder structure as tables.
  2. Athena provides a query engine to allow people and applications to interact with the tables using SQL.
  3. SageMaker Canvas uses Athena as a data source to allow the data stored in the tables to be used for ML model development.

Solution overview

SageMaker Canvas allows you to build a custom ML model using a dataset that you have imported. In the following sections, we demonstrate how to create, explore, and transform a sample dataset, use natural language to query the data, check for data quality, create additional steps for the data flow, and build, test, and deploy an ML model.

Prerequisites

Before proceeding, refer to Getting started with using Amazon SageMaker Canvas to make sure you have the required prerequisites in place. Specifically, validate that the AWS Identity and Access Management (IAM) role your SageMaker domain is using has a policy attached with sufficient permissions to access Athena, AWS Glue, and Amazon S3 resources.

Create the dataset

SageMaker Canvas supports Athena as a data source. Data from wearables and patient healthcare data residing across your S3 bucket is accessed using Athena and the Data Catalog. This allows this tabular data to be directly imported into SageMaker Canvas to start your ML development.

To create your dataset, complete the following steps:

  1. On the SageMaker Canvas console, choose Data Wrangler in the navigation pane.
  2. On the Import and prepare dropdown menu, choose Tabular as the dataset type to denote that the imported data consists of rows and columns.
The screenshot shows how tabular data is imported using SageMaker Data Wrangler. Tabular from the import and prepare option is highlighted.

Figure 3: Importing tabular data using SageMaker Data Wrangler

  1. For Select a data source, choose Athena.

On this page, you will see your Data Catalog database and tables listed, named patient_data and splunk_ops_data.

  1. Join (inner join) the tables together using the user_id and id to create one overarching dataset that can be used during ML model development.
  2. Under Import settings, enter unprocessed_data for Dataset name.
  3. Choose Import to complete the process.
The screenshot shows how tabular data is joined using SageMaker Data Wrangler. 2 tables discovered from Athena are highlighted, alongside the user id fields that are used to join the 2 tables together.

Figure 4: Joining data using SageMaker Data Wrangler

The combined dataset is now available to explore and transform using SageMaker Data Wrangler.

Explore and transform the dataset

SageMaker Data Wrangler enables you to transform and analyze the source dataset through data flows while still maintaining a no-code approach.

The previous step automatically created a data flow in the SageMaker Canvas console which we have renamed to data_prep_data_flow.flow. Additionally, two steps are automatically generated, as listed in the following table.

Step

Name

Description

1

Athena Source

Sets the unprocessed_data dataset as source of the data flow

2

Data types

Sets column types of unprocessed_data dataset

Before we create additional transform steps, let’s explore two SageMaker Canvas features that can help us focus on the right actions.

Use natural language to query the data

SageMaker Data Wrangler also provides generative AI capabilities called Chat for data prep powered by a large language model (LLM). This feature allows you to explore your data using natural language without any background in ML or SQL. Furthermore, any contextualized recommendations returned by the generative AI model can be introduced directly back into the data flow without writing any code.

In this section, we present some example prompts to demonstrate this in action. These examples have been selected to illustrate the art of the possible. We recommend that you experiment with different prompts to gain the best results for your particular use cases.

Example 1: Identify Splunk default fields

In this first example, we want to know whether there are Splunk default fields that we could potentially exclude from our dataset prior to ML model development.

  1. In SageMaker Data Wrangler, open your data flow.
  2. Choose Step 2 Data types, and choose Chat for data prep.
  3. In the Chat for data prep pane, you can enter prompts in natural language to explore and transform the data. For example:
Prompt: This data has been exported from a Splunk index. Which columns are likely to be Splunk default fields? 
Response: _bkt, _cd, _indextime, _raw, _si, _sourcetype, _time are likely Splunk default fields based on the Summary section showing they are mostly metadata/index fields. 
Prompt: OK. Let's drop _bkt, _cd, _indextime, _raw, _si, _sourcetype, and _time. 
Response: The code drops the columns '_bkt', '_cd', '_indextime', '_raw', '_si', '_sourcetype' and '_time' from the dataframe df using the drop() method.

In this example, the generative AI LLM has correctly identified Splunk default fields that could be safely dropped from the dataset.

  1. Choose Add to steps to add this identified transformation to the data flow.
The screenshot shows how SageMaker Data Wrangler’s chat for data prep is used to identify Splunk’s default fields. Conversation has been captured above the screenshot. Add to steps button is highlighted in the screenshot.

Figure 5: Using SageMaker Data Wrangler’s chat for data prep to identify Splunk’s default fields

Example 2: Identify additional columns that could be dropped

We now want to identify any further columns that could be dropped without being too specific about what we’re looking for. We want the LLM to make the suggestions based on the data, and provide us with the rationale. For example:

Prompt: List out the names of other columns that we should drop before ML development. Give me a list of columns, along with the reasons why. 
Response: The code drops the timestamp, punct, id, index and linecount columns from the dataset. These columns do not contain useful information for machine learning and would add noise to the models.

In addition to the Splunk default fields identified earlier, the generative AI model is now proposing the removal of columns such as timestamp, punct, id, index, and linecount that don’t appear to be conducive to ML model development.

The screenshot shows how SageMaker Data Wrangler’s chat for data prep is used to identify additional fields that can be dropped. Conversation has been captured above the screenshot.

Figure 6: Using SageMaker Data Wrangler’s chat for data prep to identify additional fields that can be dropped

Example 3: Calculate average age column in dataset

You can also use the generative AI model to perform Text2SQL tasks in which you can simply ask questions of the data using natural language. This is useful if you want to validate the content of the dataset.

In this example, we want to know what the average patient age value is within the dataset:

Prompt: What is the average age column of the dataset? 
Response: The average age in the dataset is 67 years.

By expanding View code, you can see what SQL statements the LLM has constructed using its Text2SQL capabilities. This gives you full visibility into how the results are being returned.

The screenshot shows how SageMaker Data Wrangler’s chat for data prep is used to run SQL statements. Under view code, the screenshot shows SELECT AVG(age) FROM df; which is the SQL statement chat for data prep has returned.

Figure 7: Using SageMaker Data Wrangler’s chat for data prep to run SQL statements

Check for data quality

SageMaker Canvas also provides exploratory data analysis (EDA) capabilities that allow you to gain deeper insights into the data prior to the ML model build step. With EDA, you can generate visualizations and analyses to validate whether you have the right data, and whether your ML model build is likely to yield results that are aligned to your organization’s expectations.

Example 1: Create a Data Quality and Insights Report

Complete the following steps to create a Data Quality and Insights Report:

  1. While in the data flow step, choose the Analyses tab.
  2. For Analysis type, choose Data Quality and Insights Report.
  3. For Target column, choose admitted.
  4. For Problem type, choose Classification.

This performs an analysis of the data that you have and provides information such as the number of missing values and outliers.

The screenshot shows how SageMaker Data Wrangler’s data quality and insights report is used to perform analysis of the data. It shows a summary of dataset characteristics, such as number of features, number of rows, missing values, duplicated rows and data validity.

Figure 8: Running SageMaker Data Wrangler’s data quality and insights report

Refer to Get Insights On Data and Data Quality for details on how to interpret the results of this report.

Example 2: Create a Quick Model

In this second example, choose Quick Model for Analysis type and for Target column, choose admitted. The Quick Model estimates the expected predicted quality of the model.

By running the analysis, the estimated F1 score (a measure of predictive performance) of the model and feature importance scores are displayed.

The screenshot shows how SageMaker Data Wrangler’s quick model feature is used to assess the potential accuracy of the model. It has determined that the model achieved a F1 score of 0.76, and that systlolic blood pressure, average end of day device battery level, average number of days device is active and age values all have an impact to the hospital admission prediction.

Figure 9: Running SageMaker Data Wrangler’s quick model feature to assess the potential accuracy of the model

SageMaker Canvas supports many other analysis types. By reviewing these analyses in advance of your ML model build, you can continue to engineer the data and features to gain sufficient confidence that the ML model will meet your business objectives.

Create additional steps in the data flow

In this example, we have decided to update our data_prep_data_flow.flow data flow to implement additional transformations. The following table summarizes these steps.

Step

Transform

Description

3

Chat for data prep

Removes Splunk default fields identified.

4

Chat for data prep

Removes additional fields identified as being unhelpful to ML model development.

5

Group by

Groups together the rows by user_id and calculates an average
of time-ordered numerical fields from Splunk. This is performed to convert the ML problem type from time series forecasting into a simple two-category prediction of target feature (
admitted) using averages of the input values over a given time period. Alternatively, SageMaker Canvas also supports time series forecasting.

6

Drop column (manage columns)

Drops remaining columns that are unnecessary for our ML development, such as columns with high cardinality (for example, user_id).

7

Parse column as type

Converts numerical value types, for example from Float to Long. This is performed to make sure values, such as those in unit of days, remain integers after calculations.

8

Parse column as type

Converts additional columns that need to be parsed (each column requires a separate step).

9

Drop duplicates (manage rows)

Drops duplicate rows to avoid overfitting.

To create a new transform, view the data flow, then choose Add transform on the last step.

The screenshot shows how a transform can be added to a data flow in SageMaker Data Wrangler. The add transform option on the final step is highlighted.

Figure 10: Using SageMaker Data Wrangler to add a transform to a data flow

Choose Add transform, and proceed to choose a transform type and its configuration.

The screenshot shows how a transform can be added to a data flow in SageMaker Data Wrangler. The add transform option on the final step is highlighted.

Figure 11: Using SageMaker Data Wrangler to add a transform to a data flow

The following screenshot shows our newly updated end-to-end data flow featuring multiple steps. In this example, we ran the analyses at the end of the data flow.

The screenshot shows the end-to-end data flow in SageMaker Data Wrangler. The steps shown in the data flow are described in the table above.

Figure 12: Showing the end-to-end SageMaker Canvas Data Wrangler data flow

If you want to incorporate this data flow into a productionized ML workflow, SageMaker Canvas can create a Jupyter notebook that exports your data flow to Amazon SageMaker Pipelines.

Develop the ML model

To get started with ML model development, complete the following steps:

  1. Choose Create model directly from the last step of the data flow.
The screenshot shows how a model is created from the data flow in SageMaker Data Wrangler. Create model option is highlighted on the final data flow step.

Figure 13: Creating a model from the SageMaker Data Wrangler data flow

  1. For Dataset name, enter a name for your transformed dataset (for example, processed_data).
  2. Choose Export.
The screenshot shows how the exported dataset is named in SageMaker Data Wrangler. A name, processed_data, is being entered into the dataset name field.

Figure 14: Naming the exported dataset to be used by the model in SageMaker Data Wrangler

This step will automatically create a new dataset.

  1. After the dataset has been created successfully, choose Create model to begin the ML model creation.
The screenshot shows how the model is then created from the exported dataset using SageMaker Data Wrangler. The create model link at the borttom of the screen is being highlighted.

Figure 15: Creating the model in SageMaker Data Wrangler

  1. For Model name, enter a name for the model (for example, my_healthcare_model).
  2. For Problem type, select Predictive analysis.
  3. Choose Create.
The screenshot shows how the model is named and predictive analysis type is selected in SageMaker Canvas. Model name my_healthcare_model is being entered, and the predictive analysis option being selected.

Figure 16: Naming the model in SageMaker Canvas and selecting the predictive analysis type

You are now ready to progress through the Build, Analyze, Predict, and Deploy stages to develop and operationalize the ML model using SageMaker Canvas.

  1. On the Build tab, for Target column, choose the column you want to predict (admitted).
  2. Choose Quick build to build the model.

The Quick build option has a shorter build time, but the Standard build option generally enjoys higher accuracy.

The screenshot shows how the target column to predict for the model is selected in SageMaker Canvas. Field admitted has been chosen in the target column drop-down. The quick build button is highlighted.

Figure 17: Selecting the target column to predict in SageMaker Canvas

After a few minutes, on the Analyze tab, you will be able to view the accuracy of the model, along with column impact, scoring, and other advanced metrics. For example, we can see that a feature from the wearables data captured in Splunk—average_num_days_device_active—has a strong impact on whether the patient is likely to be admitted or not, along with their age. As such, the health-tech company may proactively reach out to elderly patients who tend to keep their wearables off to minimize the risk of their hospitalization.

The screenshot shows how the results from the model quick build is displayed in SageMaker Canvas. For the specific column impact selected, it shows that there is strong correlation between the average number of days a device has been active for and the probability of the patient’s admission. Model accuracy is 82% with a F1 score of 0.609.

Figure 18: Displaying the results from the model quick build in SageMaker Canvas

When you’re happy with the results from the Quick build, repeat the process with a Standard build to make sure you have an ML model with higher accuracy that can be deployed.

Test the ML model

Our ML model has now been built. If you’re satisfied with its accuracy, you can make predictions using this ML model using net new data on the Predict tab. Predictions can be performed either using batch (list of patients) or for a single entry (one patient).

Experiment with different values and choose Update prediction. The ML model will respond with a prediction for the new values that you have entered.

In this example, the ML model has identified a 64.5% probability that this particular patient will be admitted to hospital in the next 30 days. The health-tech company will likely want to prioritize the care of this patient.

The screenshot shows how the results from a single prediction using the developed model is displayed in SageMaker Canvas. A prediction has been made for 88-year old patient. The model has returned that there is a 64.487% that they will be admitted into hospital.

Figure 19: Displaying the results from a single prediction using the model in SageMaker Canvas

Deploy the ML model

It is now possible for the health-tech company to build applications that can use this ML model to make predictions. ML models developed in SageMaker Canvas can be operationalized using a broader set of SageMaker services. For example:

To deploy the ML model, complete the following steps:

  1. On the Deploy tab, choose Create Deployment.
  2. Specify Deployment name, Instance type, and Instance count.
  3. Choose Deploy to make the ML model available as a SageMaker endpoint.

In this example, we reduced the instance type to ml.m5.4xlarge and instance count to 1 before deployment.

The screenshot shows how the developed model is deployed using SageMaker Canvas. The ml.m5.4xlarge instance type with an instance count of 1 has been selected.

Figure 20: Deploying the using SageMaker Canvas

At any time, you can directly test the endpoint from SageMaker Canvas on the Test deployment tab of the deployed endpoint listed under Operations on the SageMaker Canvas console.

Refer to the Amazon SageMaker Canvas Developer Guide for detailed steps to take your ML model development through its full development lifecycle and build applications that can consume the ML model to make predictions.

Clean up

Refer to the instructions in the README file to clean up the resources provisioned for the AWS data engineering pipeline solution.

SageMaker Canvas bills you for the duration of the session, and we recommend logging out of SageMaker Canvas when you are not using it. Refer to Logging out of Amazon SageMaker Canvas for more details. Furthermore, if you deployed a SageMaker endpoint, make sure you have deleted it.

Conclusion

This post explored a no-code approach involving SageMaker Canvas that can drive actionable insights from data stored across both Splunk and AWS platforms using AI/ML techniques. We also demonstrated how you can use the generative AI capabilities of SageMaker Canvas to speed up your data exploration and build ML models that are aligned to your business’s expectations.

Learn more about AI on Splunk and ML on AWS.


About the Authors

Alan Peaty

Alan Peaty is a Senior Partner Solutions Architect, helping Global Systems Integrators (GSIs), Global Independent Software Vendors (GISVs), and their customers adopt AWS services. Prior to joining AWS, Alan worked as an architect at systems integrators such as IBM, Capita, and CGI. Outside of work, Alan is a keen runner who loves to hit the muddy trails of the English countryside, and is an IoT enthusiast.

Brett Roberts

Brett Roberts is the Global Partner Technical Manager for AWS at Splunk, leading the technical strategy to help customers better secure and monitor their critical AWS environments and applications using Splunk. Brett was a member of the Splunk Trust and holds several Splunk and AWS certifications. Additionally, he co-hosts a community podcast and blog called Big Data Beard, exploring trends and technologies in the analytics and AI space.

Arnaud Lauer

Arnaud Lauer is a Principal Partner Solutions Architect in the Public Sector team at AWS. He enables partners and customers to understand how to best use AWS technologies to translate business needs into solutions. He brings more than 18 years of experience in delivering and architecting digital transformation projects across a range of industries, including public sector, energy, and consumer goods.

Read More

AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

*Work done during internship at Apple
Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can…Apple Machine Learning Research

APE: Active Prompt Engineering – Identifying Informative Few-Shot Examples for LLMs

Prompt engineering is an iterative procedure that often requires extensive manual efforts to formulate suitable instructions for effectively directing large language models (LLMs) in specific tasks. Incorporating few-shot examples is a vital and efficacious approach to provide LLMs with precise and tangible instructions, leading to improved LLM performance. Nonetheless, identifying the most informative demonstrations for LLMs is labor-intensive, frequently entailing sifting through an extensive search space. In this demonstration, we showcase an interactive tool called APE (Active Prompt…Apple Machine Learning Research

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final…Apple Machine Learning Research

How Deltek uses Amazon Bedrock for question and answering on government solicitation documents

How Deltek uses Amazon Bedrock for question and answering on government solicitation documents

This post is co-written by Kevin Plexico and Shakun Vohra from Deltek.

Question and answering (Q&A) using documents is a commonly used application in various use cases like customer support chatbots, legal research assistants, and healthcare advisors. Retrieval Augmented Generation (RAG) has emerged as a leading method for using the power of large language models (LLMs) to interact with documents in natural language.

This post provides an overview of a custom solution developed by the AWS Generative AI Innovation Center (GenAIIC) for Deltek, a globally recognized standard for project-based businesses in both government contracting and professional services. Deltek serves over 30,000 clients with industry-specific software and information solutions.

In this collaboration, the AWS GenAIIC team created a RAG-based solution for Deltek to enable Q&A on single and multiple government solicitation documents. The solution uses AWS services including Amazon Textract, Amazon OpenSearch Service, and Amazon Bedrock. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) and LLMs from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Deltek is continuously working on enhancing this solution to better align it with their specific requirements, such as supporting file formats beyond PDF and implementing more cost-effective approaches for their data ingestion pipeline.

What is RAG?

RAG is a process that optimizes the output of LLMs by allowing them to reference authoritative knowledge bases outside of their training data sources before generating a response. This approach addresses some of the challenges associated with LLMs, such as presenting false, outdated, or generic information, or creating inaccurate responses due to terminology confusion. RAG enables LLMs to generate more relevant, accurate, and contextual responses by cross-referencing an organization’s internal knowledge base or specific domains, without the need to retrain the model. It provides organizations with greater control over the generated text output and offers users insights into how the LLM generates the response, making it a cost-effective approach to improve the capabilities of LLMs in various contexts.

The main challenge

Applying RAG for Q&A on a single document is straightforward, but applying the same across multiple related documents poses some unique challenges. For example, when using question answering on documents that evolve over time, it is essential to consider the chronological sequence of the documents if the question is about a concept that has transformed over time. Not considering the order could result in providing an answer that was accurate at a past point but is now outdated based on more recent information across the collection of temporally aligned documents. Properly handling temporal aspects is a key challenge when extending question answering from single documents to sets of interlinked documents that progress over the course of time.

Solution overview

As an example use case, we describe Q&A on two temporally related documents: a long draft request-for-proposal (RFP) document, and a related subsequent government response to a request-for-information (RFI response), providing additional and revised information.

The solution develops a RAG approach in two steps.

The first step is data ingestion, as shown in the following diagram. This includes a one-time processing of PDF documents. The application component here is a user interface with minor processing such as splitting text and calling the services in the background. The steps are as follows:

  1. The user uploads documents to the application.
  2. The application uses Amazon Textract to get the text and tables from the input documents.
  3. The text embedding model processes the text chunks and generates embedding vectors for each text chunk.
  4. The embedding representations of text chunks along with related metadata are indexed in OpenSearch Service.

The second step is Q&A, as shown in the following diagram. In this step, the user asks a question about the ingested documents and expects a response in natural language. The application component here is a user interface with minor processing such as calling different services in the background. The steps are as follows:

  1. The user asks a question about the documents.
  2. The application retrieves an embedding representation of the input question.
  3. The application passes the retrieved data from OpenSearch Service and the query to Amazon Bedrock to generate a response. The model performs a semantic search to find relevant text chunks from the documents (also called context). The embedding vector maps the question from text to a space of numeric representations.
  4. The question and context are combined and fed as a prompt to the LLM. The language model generates a natural language response to the user’s question.

We used Amazon Textract in our solution, which can convert PDFs, PNGs, JPEGs, and TIFFs into machine-readable text. It also formats complex structures like tables for easier analysis. In the following sections, we provide an example to demonstrate Amazon Textract’s capabilities.

OpenSearch is an open source and distributed search and analytics suite derived from Elasticsearch. It uses a vector database structure to efficiently store and query large volumes of data. OpenSearch Service currently has tens of thousands of active customers with hundreds of thousands of clusters under management processing hundreds of trillions of requests per month. We used OpenSearch Service and its underlying vector database to do the following:

  • Index documents into the vector space, allowing related items to be located in proximity for improved relevancy
  • Quickly retrieve related document chunks at the question answering step using approximate nearest neighbor search across vectors

The vector database inside OpenSearch Service enabled efficient storage and fast retrieval of related data chunks to power our question answering system. By modeling documents as vectors, we could find relevant passages even without explicit keyword matches.

Text embedding models are machine learning (ML) models that map words or phrases from text to dense vector representations. Text embeddings are commonly used in information retrieval systems like RAG for the following purposes:

  • Document embedding – Embedding models are used to encode the document content and map them to an embedding space. It is common to first split a document into smaller chunks such as paragraphs, sections, or fixed size chunks.
  • Query embedding – User queries are embedded into vectors so they can be matched against document chunks by performing semantic search.

For this post, we used the Amazon Titan model, Amazon Titan Embeddings G1 – Text v1.2, which intakes up to 8,000 tokens and outputs a numerical vector of 1,536 dimensions. The model is available through Amazon Bedrock.

Amazon Bedrock provides ready-to-use FMs from top AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon. It offers a single interface to access these models and build generative AI applications while maintaining privacy and security. We used Anthropic Claude v2 on Amazon Bedrock to generate natural language answers given a question and a context.

In the following sections, we look at the two stages of the solution in more detail.

Data ingestion

First, the draft RFP and RFI response documents are processed to be used at the Q&A time. Data ingestion includes the following steps:

  1. Documents are passed to Amazon Textract to be converted into text.
  2. To better enable our language model to answer questions about tables, we created a parser that converts tables from the Amazon Textract output into CSV format. Transforming tables into CSV improves the model’s comprehension. For instance, the following figures show part of an RFI response document in PDF format, followed by its corresponding extracted text. In the extracted text, the table has been converted to CSV format and sits among the rest of the text.
  3. For long documents, the extracted text may exceed the LLM’s input size limitation. In these cases, we can divide the text into smaller, overlapping chunks. The chunk sizes and overlap proportions may vary depending on the use case. We apply section-aware chunking, (perform chunking independently on each document section), which we discuss in our example use case later in this post.
  4. Some classes of documents may follow a standard layout or format. This structure can be used to optimize data ingestion. For example, RFP documents tend to have a certain layout with defined sections. Using the layout, each document section can be processed independently. Also, if a table of contents exists but is not relevant, it can potentially be removed. We provide a demonstration of detecting and using document structure later in this post.
  5. The embedding vector for each text chunk is retrieved from an embedding model.
  6. At the last step, the embedding vectors are indexed into an OpenSearch Service database. In addition to the embedding vector, the text chunk and document metadata such as document, document section name, or document release date are also added to the index as text fields. The document release date is useful metadata when documents are related chronologically, so that LLM can identify the most updated information. The following code snippet shows the index body:
index_body = {
    "embedding_vector": <embedding vector of a text chunk>,
    "text_chunk": <text chunk>,
    "document_name": <document name>,
    "section_name": <document section name>,
    "release_date": <document release date>,
    # more metadata can be added
}

Q&A

In the Q&A phrase, users can submit a natural language question about the draft RFP and RFI response documents ingested in the previous step. First, semantic search is used to retrieve relevant text chunks to the user’s question. Then, the question is augmented with the retrieved context to create a prompt. Finally, the prompt is sent to Amazon Bedrock for an LLM to generate a natural language response. The detailed steps are as follows:

  1. An embedding representation of the input question is retrieved from the Amazon Titan embedding model on Amazon Bedrock.
  2. The question’s embedding vector is used to perform semantic search on OpenSearch Service and find the top K relevant text chunks. The following is an example of a search body passed to OpenSearch Service. For more details see the OpenSearch documentation on structuring a search query.
search_body = {
    "size": top_K,
    "query": {
        "script_score": {
            "query": {
                "match_all": {}, # skip full text search
            },
            "script": {
                "lang": "knn",
                "source": "knn_score",
                "params": {
                    "field": "embedding-vector",
                    "query_value": question_embedding,
                    "space_type": "cosinesimil"
                }
            }
        }
    }
}

  1. Any retrieved metadata, such as section name or document release date, is used to enrich the text chunks and provide more information to the LLM, such as the following:
    def opensearch_result_to_context(os_res: dict) -> str:
        """
        Convert OpenSearch result to context
        Args:
        os_res (dict): Amazon OpenSearch results
        Returns:
        context (str): Context to be included in LLM's prompt
        """
        data = os_res["hits"]["hits"]
        context = []
        for item in data:
            text = item["_source"]["text_chunk"]
            doc_name = item["_source"]["document_name"]
            section_name = item["_source"]["section_name"]
            release_date = item["_source"]["release_date"]
            context.append(
                f"<<Context>>: [Document name: {doc_name}, Section name: {section_name}, Release date: {release_date}] {text}"
            )
        context = "n n ------ n n".join(context)
        return context

  2. The input question is combined with retrieved context to create a prompt. In some cases, depending on the complexity or specificity of the question, an additional chain-of-thought (CoT) prompt may need to be added to the initial prompt in order to provide further clarification and guidance to the LLM. The CoT prompt is designed to walk the LLM through the logical steps of reasoning and thinking that are required to properly understand the question and formulate a response. It lays out a type of internal monologue or cognitive path for the LLM to follow in order to comprehend the key information within the question, determine what kind of response is needed, and construct that response in an appropriate and accurate way. We use the following CoT prompt for this use case:
"""
Context below includes a few paragraphs from draft RFP and RFI response documents:

Context: {context}

Question: {question}

Think step by step:

1- Find all the paragraphs in the context that are relevant to the question.
2- Sort the paragraphs by release date.
3- Use the paragraphs to answer the question.

Note: Pay attention to the updated information based on the release dates.
"""
  1. The prompt is passed to an LLM on Amazon Bedrock to generate a response in natural language. We use the following inference configuration for the Anthropic Claude V2 model on Amazon Bedrock. The Temperature parameter is usually set to zero for reproducibility and also to prevent LLM hallucination. For regular RAG applications, top_k and top_p are usually set to 250 and 1, respectively. Set max_tokens_to_sample to maximum number of tokens expected to be generated (1 token is approximately 3/4 of a word). See Inference parameters for more details.
{
    "temperature": 0,
    "top_k": 250,
    "top_p": 1,
    "max_tokens_to_sample": 300,
    "stop_sequences": [“nnHuman:nn”]
}

Example use case

As a demonstration, we describe an example of Q&A on two related documents: a draft RFP document in PDF format with 167 pages, and an RFI response document in PDF format with 6 pages released later, which includes additional information and updates to the draft RFP.

The following is an example question asking if the project size requirements have changed, given the draft RFP and RFI response documents:

Have the original scoring evaluations changed? if yes, what are the new project sizes?

The following figure shows the relevant sections of the draft RFP document that contain the answers.

The following figure shows the relevant sections of the RFI response document that contain the answers.

For the LLM to generate the correct response, the retrieved context from OpenSearch Service should contain the tables shown in the preceding figures, and the LLM should be able to infer the order of the retrieved contents from metadata, such as release dates, and generate a readable response in natural language.

The following are the data ingestion steps:

  1. The draft RFP and RFI response documents are uploaded to Amazon Textract to extract text and tables as the content. Additionally, we used regular expression to identify document sections and table of contents (see the following figures, respectively). The table of contents can be removed for this use case because it doesn’t have any relevant information.

  2. We split each document section independently into smaller chunks with some overlaps. For this use case, we used a chunk size of 500 tokens with the overlap size of 100 tokens (1 token is approximately 3/4 a word). We used a BPE tokenizer, where each token corresponds to about 4 bytes.
  3. An embedding representation of each text chunk is obtained using the Amazon Titan Embeddings G1 – Text v1.2 model on Amazon Bedrock.
  4. Each text chunk is stored into an OpenSearch Service index along with metadata such as section name and document release date.

The Q&A steps are as follows:

  1. The input question is first transformed to a numeric vector using the embedding model. The vector representation used for semantic search and retrieval of relevant context in the next step.
  2. The top K relevant text chunk and metadata are retrieved from OpenSearch Service.
  3. The opensearch_result_to_context function and the prompt template (defined earlier) are used to create the prompt given the input question and retrieved context.
  4. The prompt is sent to the LLM on Amazon Bedrock to generate a response in natural language. The following is the response generated by Anthropic Claude v2, which matched with the information presented in the draft RFP and RFI response documents. The question was “Have the original scoring evaluations changed? If yes, what are the new project sizes?” Using CoT prompting, the model can correctly answer the question.

Key features

The solution contains the following key features:

  • Section-aware chunking – Identify document sections and split each section independently into smaller chunks with some overlaps to optimize data ingestion.
  • Table to CSV transformation – Convert tables extracted by Amazon Textract into CSV format to improve the language model’s ability to comprehend and answer questions about tables.
  • Adding metadata to index – Store metadata such as section name and document release date along with text chunks in the OpenSearch Service index. This allowed the language model to identify the most up-to-date or relevant information.
  • CoT prompt – Design a chain-of-thought prompt to provide further clarification and guidance to the language model on the logical steps needed to properly understand the question and formulate an accurate response.

These contributions helped improve the accuracy and capabilities of the solution for answering questions about documents. In fact, based on Deltek’s subject matter experts’ evaluations of LLM-generated responses, the solution achieved a 96% overall accuracy rate.

Conclusion

This post outlined an application of generative AI for question answering across multiple government solicitation documents. The solution discussed was a simplified presentation of a pipeline developed by the AWS GenAIIC team in collaboration with Deltek. We described an approach to enable Q&A on lengthy documents published separately over time. Using Amazon Bedrock and OpenSearch Service, this RAG architecture can scale for enterprise-level document volumes. Additionally, a prompt template was shared that uses CoT logic to guide the LLM in producing accurate responses to user questions. Although this solution is simplified, this post aimed to provide a high-level overview of a real-world generative AI solution for streamlining review of complex proposal documents and their iterations.

Deltek is actively refining and optimizing this solution to ensure it meets their unique needs. This includes expanding support for file formats other than PDF, as well as adopting more cost-efficient strategies for their data ingestion pipeline.

Learn more about prompt engineering and generative AI-powered Q&A in the Amazon Bedrock Workshop. For technical support or to contact AWS generative AI specialists, visit the GenAIIC webpage.

Resources

To learn more about Amazon Bedrock, see the following resources:

To learn more about OpenSearch Service, see the following resources:

See the following links for RAG resources on AWS:


About the Authors

Kevin Plexico is Senior Vice President of Information Solutions at Deltek, where he oversees research, analysis, and specification creation for clients in the Government Contracting and AEC industries. He leads the delivery of GovWin IQ, providing essential government market intelligence to over 5,000 clients, and manages the industry’s largest team of analysts in this sector. Kevin also heads Deltek’s Specification Solutions products, producing premier construction specification content including MasterSpec® for the AIA and SpecText.

Shakun Vohra is a distinguished technology leader with over 20 years of expertise in Software Engineering, AI/ML, Business Transformation, and Data Optimization. At Deltek, he has driven significant growth, leading diverse, high-performing teams across multiple continents. Shakun excels in aligning technology strategies with corporate goals, collaborating with executives to shape organizational direction. Renowned for his strategic vision and mentorship, he has consistently fostered the development of next-generation leaders and transformative technological solutions.

Amin Tajgardoon is an Applied Scientist at the AWS Generative AI Innovation Center. He has an extensive background in computer science and machine learning. In particular, Amin’s focus has been on deep learning and forecasting, prediction explanation methods, model drift detection, probabilistic generative models, and applications of AI in the healthcare domain.

Anila Joshi has more than a decade of experience building AI solutions. As an Applied Science Manager at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Yash Shah and his team of scientists, specialists and engineers at AWS Generative AI Innovation Center, work with some of AWS most strategic customers on helping them realize art of the possible with Generative AI by driving business value. Yash has been with Amazon for more than 7.5 years now and has worked with customers across healthcare, sports, manufacturing and software across multiple geographic regions.

Jordan Cook is an accomplished AWS Sr. Account Manager with nearly two decades of experience in the technology industry, specializing in sales and data center strategy. Jordan leverages his extensive knowledge of Amazon Web Services and deep understanding of cloud computing to provide tailored solutions that enable businesses to optimize their cloud infrastructure, enhance operational efficiency, and drive innovation.

Read More

Golden Opportunities: California to Train Students, Educators in AI

Golden Opportunities: California to Train Students, Educators in AI

The State of California today announced a first-of-its-kind AI education initiative with NVIDIA.

The public-private collaboration supports the state’s goals in workforce training and economic development by giving universities, community colleges and adult education programs in California the resources to gain skills in generative AI.

“AI will continue to become more advanced and more prominent in all sectors, and California has the responsibility to support and prepare our students and faculties,” said Amy Tong, secretary of the California Government Operations Agency. “As a world leader in AI computing, NVIDIA is a natural partner to prepare the future of California’s workforce.”

Working With California Colleges and Universities

Through this initiative, California educators can gain certification through the NVIDIA Deep Learning Institute University Ambassador Program, which connects instructors with high-quality teaching kits, workshop content and NVIDIA GPU-accelerated workstations in the cloud.

“It’s always good to equip our professors and teachers because, as mentors to our youth, they are in the best position to help shape students’ career paths,” Tong said.

By empowering educators across the state with the skills to harness the latest AI technologies and NVIDIA GPUs, the initiative can prepare full-time students about to enter the workforce and it can train working professionals who are expanding their skills through community college or adult education courses.

“We want to train a workforce of the future, and also excite students and adults who are out of the workforce about opportunities for the future,” said Stewart Knox, secretary of the California Labor and Workforce Development Agency.

The state agencies are also exploring how internship and apprenticeship programs can offer students hands-on experience with AI skills.

Bolstering Efforts to Bridge the Digital Divide

NVIDIA is already working on multiple projects across California to make AI more accessible and understandable for students from a variety of backgrounds. The company’s educational initiatives and industry-spanning collaborations are helping students and professionals in biotechnology and life sciences, advanced manufacturing, media and entertainment, and other fields to gain proficiency in harnessing AI to support their work, enhance their productivity and drive innovation.

San José State University is evaluating how the NVIDIA Omniverse development platform could support the creation of digital twins — 3D virtual representations of real-world systems — for the city of San José. During the university’s annual Black Engineer Week in June, NVIDIA hosted dozens of students for a daylong program featuring tech demos and career advice discussions.

NVIDIA is embarking on several workforce, climate and community-based projects with schools in the University of California and California State University systems. One plans to train students on underwater data center technology, while another is working with California Black Media to train a large language model on nearly a century of journalism by Black journalists in the state.

The NVIDIA GTC AI conference, held earlier this year in San José, featured several sessions for educators to explore how to integrate generative AI and NVIDIA technologies into their curricula — as well as a panel discussion about the need for equitable access to AI education and resources.

Learn more about NVIDIA’s AI education resources.

Read More

ACL Conference 2024

Apple is sponsoring the annual meeting of the Association for Computational Linguistics (ACL), which takes place in person from August 11 to 16, in Bangkok, Thailand. ACL is a conference in the field of computational linguistics, covering a broad spectrum of diverse research areas that are concerned with computational approaches to natural language. Below is the schedule of Apple-sponsored workshops and events at ACL 2024.

Schedule
Stop by the Apple booth in Centara Grand and Bangkok Convention Center, Floor 22, Booth #1, from 9:00 – 17:30 (UTC+7) on August 12, 13 and 14.
Monday…Apple Machine Learning Research