BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computational Notebooks

This paper was accepted at IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) 2024
Programmers frequently engage with machine learning tutorials in computational notebooks and have been adopting code generation technologies based on large language models (LLMs). However, they encounter difficulties in understanding and working with code produced by LLMs. To mitigate these challenges, we introduce a novel workflow into computational notebooks that augments LLM-based code generation with an additional ephemeral UI step, offering users UI scaffolds as an intermediate stage…Apple Machine Learning Research

Cepsa Química improves the efficiency and accuracy of product stewardship using Amazon Bedrock

Cepsa Química improves the efficiency and accuracy of product stewardship using Amazon Bedrock

This is a guest post co-written with Vicente Cruz Mínguez, Head of Data and Advanced Analytics at Cepsa Química, and Marcos Fernández Díaz, Senior Data Scientist at Keepler.

Generative artificial intelligence (AI) is rapidly emerging as a transformative force, poised to disrupt and reshape businesses of all sizes and across industries. Generative AI empowers organizations to combine their data with the power of machine learning (ML) algorithms to generate human-like content, streamline processes, and unlock innovation. As with all other industries, the energy sector is impacted by the generative AI paradigm shift, unlocking opportunities for innovation and efficiency. One of the areas where generative AI is rapidly showing its value is the streamlining of operational processes, reducing costs, and enhancing overall productivity.

In this post, we explain how Cepsa Química and partner Keepler have implemented a generative AI assistant to increase the efficiency of the product stewardship team when answering compliance queries related to the chemical products they market. To accelerate development, they used Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy and safety.

Cepsa Química, a world leader in the manufacturing of linear alkylbenzene (LAB) and ranking second in the production of phenol, is a company aligned with Cepsa’s Positive Motion strategy for 2030, contributing to the decarbonization and sustainability of its processes through the use of renewable raw materials, development of products with less carbon, and use of waste as raw materials.

At Cepsa’s Digital, IT, Transformation & Operational Excellence (DITEX) department, we work on democratizing the use of AI within our business areas so that it becomes another lever for generating value. Within this context, we identified product stewardship as one of the areas with more potential for value creation through generative AI. We partnered with Keepler, a cloud-centered data services consulting company specialized in the design, construction, deployment, and operation of advanced public cloud analytics custom-made solutions for large organizations, in the creation of the first generative AI solution for one of our corporate teams.

The Safety, Sustainability & Energy Transition team

The Safety, Sustainability & Energy Transition area of Cepsa Química is responsible for all human health, safety, and environmental aspects related to the products manufactured by the company and the associated raw materials, among others. In this field, its areas of action are product safety, regulatory compliance, sustainability, and customer service around safety and compliance.

One of the responsibilities of the Safety, Sustainability & Energy Transition team is product stewardship, which takes care of regulatory compliance of the marketed products. The Product Stewardship department is responsible for managing a large collection of regulatory compliance documents. Their duty involves determining which regulations apply to each specific product in the company’s portfolio, compiling a list of all the applicable regulations for a given product, and supporting other internal teams that might have questions related to these products and regulations. Example questions might be “What are the restrictions for CMR substances?”, “How long do I need to keep the documents related to a toluene sale?”, or “What is the reach characterization ratio and how do I calculate it?” The regulatory content required to answer these questions varies over time, introducing new clauses and repealing others. This work used to consume a significant percentage of the team’s time, so they identified an opportunity to generate value by reducing the search time for regulatory consultations.

The DITEX department engaged with the Safety, Sustainability & Energy Transition team for a preliminary analysis of their pain points and deemed it feasible to use generative AI techniques to speed up the resolution of compliance queries faster. The analysis was conducted for queries based on both unstructured (regulatory documents and product specs sheets) and structured (product catalog) data.

An approach to product stewardship with generative AI

Large language models (LLMs) are trained with vast amounts of information crawled from the internet, capturing considerable knowledge from multiple domains. However, their knowledge is static and tied to the data used during the pre-training phase.

To overcome this limitation and provide dynamism and adaptability to knowledge base changes, we decided to follow a Retrieval Augmented Generation (RAG) approach, in which the LLMs are presented with relevant information extracted from external data sources to provide up-to-date data without the need to retrain the models. This approach is a great fit for a scenario where regulatory information is updated at a fast pace, with frequent derogations, amendments, and new regulations being published.

Additionally, the RAG-based approach enables rapid prototyping of document search use cases, allowing us to craft a solution based on regulatory information about chemical substances in a few weeks.

The solution we built is based on four main functional blocks:

  • Input processing – Input regulatory PDF documents are preprocessed to extract the relevant information. Each document is divided into chunks to ease the indexing and retrieval processes based on semantic meaning.
  • Embeddings generation – An embeddings model is used to encode the semantic information of each chunk into an embeddings vector, which is stored in a vector database, enabling similarity search of user queries.
  • LLM chain service – This service orchestrates the solution by invoking the LLM models with a fitting prompt and creating the response that is returned to the user.
  • User interface – A conversational chatbot enables interaction with users.

We divided the solution into two independent modules: one to batch process input documents and another one to answer user queries by running inference.

Batch ingestion module

The batch ingestion module performs the initial processing of the raw compliance documents and product catalog and generates the embeddings that will be later used to answer user queries. The following diagram illustrates this architecture.

Architecture diagram for the batch ingestion module

The batch ingestion module performs the following tasks:

  1. AWS Glue, a serverless data integration service, is used to run periodical extract, transform, and load (ETL) jobs that read input raw documents and the product catalog from Amazon Simple Storage Service (Amazon S3), an object storage service that offers industry-leading scalability, data availability, security, and performance.
  2. The AWS Glue job calls Amazon Textract, an ML service that automatically extracts text, handwriting, layout elements, and data from scanned documents, to process the input PDF documents. After data is extracted, the job performs document chunking, data cleanup, and postprocessing.
  3. The AWS Glue job uses Amazon Bedrock to generate vector embeddings for each document chunk using the Amazon Titan Text Embeddings
  4. Amazon Aurora PostgreSQL-Compatible Edition, a fully managed, PostgreSQL-compatible, and ACID-compliant relational database engine to store the extracted embeddings, is used with the pgvector extension enabled for efficient similarity searches.

Inference module

The inference module transforms user queries into embeddings, retrieves relevant document chunks from the knowledge base using similarity search, and prompts an LLM with the query and retrieved chunks to generate a contextual response. The following diagram illustrates this architecture.

Architecture diagram for the inference module

The inference module implements the following steps:

  1. Users interact through a web portal, which consists of a static website stored in Amazon S3, served through Amazon CloudFront, a content delivery network (CDN), and secured with AWS Cognito, a customer identity and access management platform.
  2. Queries are sent to the backend using a REST API defined in Amazon API Gateway, a fully managed service that makes it straightforward for developers to create, publish, maintain, monitor, and secure APIs at any scale, and implemented through an API Gateway private integration. The backend is implemented by an LLM chain service running on AWS Fargate, a serverless, pay-as-you-go compute engine that lets you focus on building applications without managing servers. This service orchestrates the interaction with the different LLMs using the LangChain
  3. The LLM chain service invokes Amazon Titan Text Embeddings on Amazon Bedrock to generate the embeddings for the user query.
  4. Based on the query embeddings, the relevant documents are retrieved from the embeddings database using similarity search.
  5. The service composes a prompt that includes the user query and the documents extracted from the knowledge base. The prompt is sent to Anthropic Claude 2.0 on Amazon Bedrock, and the model answer is sent back to the user.

Note on the RAG implementation

The product stewardship chatbot was built before Knowledge Bases for Amazon Bedrock was generally available. Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows. Knowledge Bases manages the initial vector store set up, handles the embedding and querying, and provides source attribution and short-term memory needed for production RAG applications.

With Knowledge Bases for Amazon Bedrock, the implementation of steps 3–4 of the Batch Ingestion and Inference modules can be significantly simplified.

Challenges and solutions

In this section, we discuss the challenges we encountered during the development of the system and the decisions we made to overcome those challenges.

Data preprocessing and chunking strategy

We discovered that the input documents contained a variety of structural complexities, which posed a challenge in the processing stage. For instance, some tables contain large amounts of information with minimal context except for the header, which is displayed at the top of the table. This can make it complex to obtain the right answers to user queries, because the retrieval process might lack context.

Additionally, some document annexes are linked to other sections of the document or even other documents, leading to incomplete data retrieval and generation of inaccurate answers.

To address these challenges, we implemented three mitigation strategies:

  • Data chunking – We decided to use larger chunk sizes with significant overlaps to provide maximum context for each chunk during ingestion. However, we set an upper limit to avoid losing the semantic meaning of the chunk.
  • Model selection – We selected a model with a large context window to generate responses that take a larger context into account. Anthropic Claude 2.0 on Amazon Bedrock, with a 100 K context window, provided the most accurate results. (The system was built before Anthropic Claude 2.1 or the Anthropic Claude 3 model family were available on Amazon Bedrock).
  • Query variants – Prior to retrieving documents from the database, multiple variants of the user query are generated using an LLM. Documents for all variants are retrieved and deduplicated before being provided as context for the LLM query.

These three strategies significantly enhanced the retrieval and response accuracy of the RAG system.

Evaluation of results and process refinement

Evaluating the responses from the LLM models is another challenge that is not found in traditional AI use cases. Because of the free text nature of the output, it’s difficult to assess and compare different responses in terms of a metric or KPI, leading to a manual review in most cases. However, a manual process is time-consuming and not scalable.

To minimize the drawbacks, we created a benchmarking dataset with the help of seasoned users, containing the following information:

  • Representative questions that require data combined from different documents
  • Ground truth answers for each question
  • References to the source documents, pages, and line numbers where the right answers are found

Then we implemented an automatic evaluation system with Anthropic Claude 2.0 on Amazon Bedrock, with different prompting strategies to evaluate document retrieval and response formation. This approach allowed for adjustment of different parameters in a fast and automated manner:

  • Preprocessing – Tried different values for chunk size and overlap size
  • Retrieval – Tested several retrieval techniques of incremental complexity
  • Querying – Ran the tests with different LLMs hosted on Amazon Bedrock:
    • Amazon Titan Text Premier
    • Cohere Command v1.4
    • Anthropic Claude Instant
    • Anthropic Claude 2.0

The final solution consists of three chains: one for translating the user query into English, one for generating variations of the input question, and one for composing the final response.

Achieved improvements and next steps

We built a conversational interface for the Safety, Sustainability & Energy Transition team that helps the product stewardship team be more efficient and obtain answers to compliance queries faster. Furthermore, the answers contain references to the input documents used by the LLM to generate the reply, so the team can double-check the response and find additional context if it’s needed. The following screenshot shows an example of the conversational interface.

Example screenshot of a user query and an answer from the chatbot

Some of the qualitative and quantitative improvements identified by the product stewardship team through the use of the solution are:

  • Query times – The following table summarizes the search time saved by query complexity and user seniority (considering all search times have been reduced to less than 1 minute).

 

Complexity

Time saved (minutes)
Junior user Senior user
Low 3.3 2
Medium 9.25 4
High 28 10
  • Answer quality – The implemented system offers additional context and document references that are used by the users to improve the quality of the answer.
  • Operational efficiency – The implemented system has accelerated the regulatory query process, directly enhancing the department operational efficiency.

From the DITEX department, we’re currently working with other business areas at Cepsa Química to identify similar use cases to help create a corporate-wide tool that reuses components from this first initiative and generalizes the use of generative AI across business functions.

Conclusion

In this post, we shared how Cepsa Química and partner Keepler have implemented a generative AI assistant that uses Amazon Bedrock and RAG techniques to process, store, and query the corpus of knowledge related to product stewardship. As a result, users save up to 25 percent of their time when they use the assistant to solve compliance queries.

If you want your business to get started with generative AI, visit Generative AI on AWS and connect with a specialist, or quickly build a generative AI application in PartyRock.


About the authors

Vicente Cruz Mínguez is the Head of Data & Advanced Analytics at Cepsa Química. He has more than 8 years of experience with big data and machine learning projects in financial, retail, energy, and chemical industries. He is currently leading the Data, Advanced Analytics & Cloud Development team in the Digital, IT, Transformation & Operational Excellence department at Cepsa Química, with a focus in feeding the corporate data lake and democratizing data for analysis, machine learning projects, and business analytics. Since 2023, he has also been working on scaling the use of generative AI in all departments.

Marcos Fernández Díaz is a Senior Data Scientist at Keepler, with 10 years of experience developing end-to-end machine learning solutions for different clients and domains, including predictive maintenance, time series forecasting, image classification, object detection, industrial process optimization, and federated machine learning. His main interests include natural language processing and generative AI. Outside of work, he is a travel enthusiast.

Guillermo Menéndez Corral is a Sr. Manager, Solutions Architecture at AWS for Energy and Utilities. He has over 18 years of experience designing and building software products and currently helps AWS customers in the energy industry harness the power of the cloud through innovation and modernization.

Read More

GraphStorm 0.3: Scalable, multi-task learning on graphs with user-friendly APIs

GraphStorm 0.3: Scalable, multi-task learning on graphs with user-friendly APIs

GraphStorm is a low-code enterprise graph machine learning (GML) framework to build, train, and deploy graph ML solutions on complex enterprise-scale graphs in days instead of months. With GraphStorm, you can build solutions that directly take into account the structure of relationships or interactions between billions of entities, which are inherently embedded in most real-world data, including fraud detection scenarios, recommendations, community detection, and search/retrieval problems.

Today, we are launching GraphStorm 0.3, adding native support for multi-task learning on graphs. Specifically, GraphStorm 0.3 allows you to define multiple training targets on different nodes and edges within a single training loop. In addition, GraphStorm 0.3 adds new APIs to customize GraphStorm pipelines: you now only need 12 lines of code to implement a custom node classification training loop. To help you get started with the new API, we have published two Jupyter notebook examples: one for node classification, and one for a link prediction task. We also released a comprehensive study of co-training language models (LM) and graph neural networks (GNN) for large graphs with rich text features using the Microsoft Academic Graph (MAG) dataset from our KDD 2024 paper. The study showcases the performance and scalability of GraphStorm on text rich graphs and the best practices of configuring GML training loops for better performance and efficiency.

Native support for multi-task learning on graphs

Many enterprise applications have graph data associated with multiple tasks on different nodes and edges. For example, retail organizations want to conduct fraud detection on both sellers and buyers. Scientific publishers want to find more related works to cite in their papers and need to select the right subject for their publication to be discoverable. To better model such applications, customers have asked us to support multi-task learning on graphs.

GraphStorm 0.3 supports multi-task learning on graphs with six most common tasks: node classification, node regression, edge classification, edge regression, link prediction, and node feature reconstruction. You can specify the training targets through a YAML configuration file. For example, a scientific publisher can use the following YAML configuration to simultaneously define a paper subject classification task on paper nodes and a link prediction task on paper-citing-paper edges for the scientific publisher use case:

version: 1.0
    gsf:
        basic: # basic settings of the backbone GNN model
            ...
        ...
        multi_task_learning:
            - node_classification:         # define a node classification task for paper subject prediction.
                target_ntype: "paper"      # the paper nodes are the training targets.
                label_field: "label_class" # the node feature "label_class" contains the training labels.
				mask_fields:
                    - "train_mask_class"   # train mask is named as train_mask_class.
                    - "val_mask_class"     # validation mask is named as val_mask_class.
                    - "test_mask_class"    # test mask is named as test_mask_class.
                num_classes: 10            # There are total 10 different classes (subject) to predict.
                task_weight: 1.0           # The task weight is 1.0.
                
            - link_prediction:                # define a link prediction paper citation recommendation.
                num_negative_edges: 4         # Sample 4 negative edges for each positive edge during training
                num_negative_edges_eval: 100  # Sample 100 negative edges for each positive edge during evaluation
                train_negative_sampler: joint # Share the negative edges between positive edges (to speedup training)
                train_etype:
                    - "paper,citing,paper"    # The target edge type for link prediction training is "paper, citing, paper"
                mask_fields:
                    - "train_mask_lp"         # train mask is named as train_mask_lp.
                    - "val_mask_lp"           # validation mask is named as val_mask_lp.
                    - "test_mask_lp"          # test mask is named as test_mask_lp.
                task_weight: 0.5              # The task weight is 0.5.

For more details about how to run graph multi-task learning with GraphStorm, refer to Multi-task Learning in GraphStorm in our documentation.

New APIs to customize GraphStorm pipelines and components

Since GraphStorm’s release in early 2023, customers have mainly used its command line interface (CLI), which abstracts away the complexity of the graph ML pipeline for you to quickly build, train, and deploy models using common recipes. However, customers are telling us that they want an interface that allows them to customize the training and inference pipeline of GraphStorm to their specific requirements more easily. Based on customer feedback for the experimental APIs we released in GraphStorm 0.2, GraphStorm 0.3 introduces refactored graph ML pipeline APIs. With the new APIs, you only need 12 lines of code to define a custom node classification training pipeline, as illustrated by the following example:

import graphstorm as gs
gs.initialize()

acm_data = gs.dataloading.GSgnnData(part_config='./acm_gs_1p/acm.json')

train_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_train_set(ntypes=['paper']), fanout=[20, 20], batch_size=64)
val_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_val_set(ntypes=['paper']), fanout=[100, 100], batch_size=256, train_task=False)
test_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_test_set(ntypes=['paper']), fanout=[100, 100], batch_size=256, train_task=False)

model = RgcnNCModel(g=acm_data.g, num_hid_layers=2, hid_size=128, num_classes=14)
evaluator = gs.eval.GSgnnClassificationEvaluator(eval_frequency=100)

trainer = gs.trainer.GSgnnNodePredictionTrainer(model)
trainer.setup_evaluator(evaluator)

trainer.fit(train_dataloader, val_dataloader, test_dataloader, num_epochs=5)

To help you get started with the new APIs, we also have released new Jupyter notebook examples in our Documentation and Tutorials page.

Comprehensive study of LM+GNN for large graphs with rich text features

Many enterprise applications have graphs with text features. In retail search applications, for example, shopping log data provides insights on how text-rich product descriptions, search queries, and customer behavior are related. Foundational large language models (LLMs) alone are not suitable to model such data because the underlying data distributions and relationships don’t correspond to what LLMs learn from their pre-training data corpuses. GML, on the other hand, is great for modeling related data (graphs) but until now, GML practitioners had to manually combine their GML models with LLMs to model text features and get the best performance for their use cases. Especially when the underlying graph dataset was large, this manual work was challenging and time-consuming.

In GraphStorm 0.2, GraphStorm introduced built-in techniques to train language models (LMs) and GNN models together efficiently at scale on massive text-rich graphs. Since then, customers have been asking us for guidance on how GraphStorm’s LM+GNN techniques should be employed to optimize performance. To address this, with GraphStorm 0.3, we released a LM+GNN benchmark using the large graph dataset, Microsoft Academic Graph (MAG), on two standard graph ML tasks: node classification and link prediction. The graph dataset is a heterogeneous graph, contains hundreds of millions of nodes and billions of edges, and the majority of nodes are attributed with rich text features. The detailed statistics of the datasets are shown in the following table.

Dataset Num. of nodes Num. of edges Num. of node/edge types Num. of nodes in NC training set Num. of edges in LP training set Num. of nodes with text-features
MAG 484,511,504 7,520,311,838 4/4 28,679,392 1,313,781,772 240,955,156

We benchmark two main LM-GNN methods in GraphStorm: pre-trained BERT+GNN, a baseline method that is widely adopted, and fine-tuned BERT+GNN, introduced by GraphStorm developers in 2022. With the pre-trained BERT+GNN method, we first use a pre-trained BERT model to compute embeddings for node text features and then train a GNN model for prediction. With the fine-tuned BERT+GNN method, we initially fine-tune the BERT models on the graph data and use the resulting fine-tuned BERT model to compute embeddings that are then used to train a GNN models for prediction. GraphStorm provides different ways to fine-tune the BERT models, depending on the task types. For node classification, we fine-tune the BERT model on the training set with the node classification tasks; for link prediction, we fine-tune the BERT model with the link prediction tasks. In the experiment, we use 8 r5.24xlarge instances for data processing and use 4 g5.48xlarge instances for model training and inference. The fine-tuned BERT+GNN approach has up to 40% better performance (link prediction on MAG) compared to pre-trained BERT+GNN.

The following table shows the model performance of the two methods and the overall computation time of the whole pipeline starting from data processing and graph construction. NC means node classification and LP means link prediction. LM Time Cost means the time spent on computing BERT embeddings and the time spent on fine-tuning the BERT models for pre-trained BERT+GNN and fine-tuned BERT+GNN, respectively.

Dataset Task Data processing time Target Pre-trained BERT + GNN Fine-tuned BERT + GNN
LM Time Cost One epoch time Metric LM Time Cost One epoch time Metric
MAG NC 553 min paper subject 206 min 135 min Acc:0.572 1423 min 137 min Acc:0.633
LP cite 198 min 2195 min Mrr: 0.487 4508 min 2172 min Mrr: 0.684

We also benchmark GraphStorm on large synthetic graphs to showcase its scalability. We generate three synthetic graphs with 1 billion, 10 billion, and 100 billion edges. The corresponding training set sizes are 8 million, 80 million, and 800 million, respectively. The following table shows the computation time of graph preprocessing, graph partition, and model training. Overall, GraphStorm enables graph construction and model training on 100 billion scale graphs within hours!

Graph Size Data pre-process Graph Partition Model Training
# instances Time # instances Time # instances Time
1B 4 19 min 4 8 min 4 1.5 min
10B 8 31 min 8 41 min 8 8 min
100B 16 61 min 16 416 min 16 50 min

More benchmark details and results are available in our KDD 2024 paper.

Conclusion

GraphStorm 0.3 is published under the Apache-2.0 license to help you tackle your large-scale graph ML challenges, and now offers native support for multi-task learning and new APIs to customize pipelines and other components of GraphStorm. Refer to the GraphStorm GitHub repository and documentation to get started.


About the Author

Xiang Song is a senior applied scientist at AWS AI Research and Education (AIRE), where he develops deep learning frameworks including GraphStorm, DGL and DGL-KE. He led the development of Amazon Neptune ML, a new capability of Neptune that uses graph neural networks for graphs stored in graph database. He is now leading the development of GraphStorm, an open-source graph machine learning framework for enterprise use cases. He received his Ph.D. in computer systems and architecture at the Fudan University, Shanghai, in 2014.

Jian Zhang is a senior applied scientist who has been using machine learning techniques to help customers solve various problems, such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning, particularly graph neural network, solutions for customers in China, USA, and Singapore. As an enlightener of AWS’s graph capabilities, Zhang has given many public presentations about the GNN, the Deep Graph Library (DGL), Amazon Neptune, and other AWS services.

Florian Saupe is a Principal Technical Product Manager at AWS AI/ML research supporting science teams like the graph machine learning group, and ML Systems teams working on large scale distributed training, inference, and fault resilience. Before joining AWS, Florian lead technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems/robotics scientist – a field in which he holds a phd.

Read More

Few-shot prompt engineering and fine-tuning for LLMs in Amazon Bedrock

Few-shot prompt engineering and fine-tuning for LLMs in Amazon Bedrock

This blog is part of the series, Generative AI and AI/ML in Capital Markets and Financial Services.

Company earnings calls are crucial events that provide transparency into a company’s financial health and prospects. Earnings reports detail a firm’s financials over a specific period, including revenue, net income, earnings per share, balance sheet, and cash flow statement. Earnings calls are live conferences where executives present an overview of results, discuss achievements and challenges, and provide guidance for upcoming periods.

These disclosures are vitally important for capital markets, significantly impacting stock prices. Investors and analysts closely watch key metrics like revenue growth, earnings per share, margins, cash flow, and projections to assess performance against peers and industry trends. The rate of growth and profit margins influence the premium and multiplier that investors are willing to pay for a company’s stock, ultimately affecting stock returns and price movements.

Earnings calls also allow investors to look for new clues about a company’s future. Companies often release information about new products, cutting-edge technology, mergers and acquisitions, and investments in new market themes and trends during these events. Such details can signal potential growth opportunities for investors, analysts, and portfolio managers.

Traditionally, earnings call scripts have followed similar templates, making it a repeatable task to generate them from scratch each time. On the other hand, generative artificial intelligence (AI) models can learn these templates and produce coherent scripts when fed with quarterly financial data. With generative AI, companies can streamline the process of creating first drafts of earnings call scripts for a new quarter using repeatable templates and information about specific performance and business highlights. The initial draft of a large language model (LLM) generated earnings call script can be then refined and customized using feedback from the company’s executives.

Amazon Bedrock offers a straightforward way to build and scale generative AI applications with foundation models (FMs) and LLMs. Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API. Model customization helps you deliver differentiated and personalized user experiences. To customize models for specific tasks, you can privately fine-tune FMs using your own labeled datasets in just a few quick steps.

In this post, we showcase how to generate the first draft of an earnings call script for the new quarter using LLMs. We demonstrate two methods to generate an earnings call script with LLMs: few-shot learning and fine-tuning. We assess the generated earnings call scripts and the applied methods from different dimensions—comprehensiveness, hallucinations, writing style, ease of use, and cost—and present our findings.

Solution overview

We apply two methods to generate the first draft of an earnings call script for the new quarter using LLMs:

  • Prompt engineering with few-shot learning – We use examples of the past earnings scripts with Anthropic Claude 3 Sonnet on Amazon Bedrock to generate an earnings call script for a new quarter.
  • Fine-tuning – We fine-tune Meta Llama 2 70B on Amazon Bedrock using input/output labeled data from the past earnings scripts and use the customized model to generate an earnings call script for a new quarter.

Both methods involve utilizing a consistent dataset of earnings call transcripts across multiple quarters. We use several past years of quarterly earnings calls, with one quarter set aside, which was used as ground truth for testing and comparison.

The process starts by retrieving the earnings call transcripts from the past quarters to the recent quarter. The next step involves selecting multiple scripts from the previous quarters to serve as few-shot learning examples as well as input/output dataset for fine-tuning. The script for the most recent quarter is held out for validation and evaluation of generated scripts. The generated script is evaluated by comparing it with the actual script for the quarter, which was initially kept aside.

The following diagram illustrates the solution architecture and workflow for both methods.

In the following sections, we discuss the workflows of each method in more detail.

Few-shot learning with Anthropic Claude 3 Sonnet on Amazon Bedrock

The prompt engineering for few-shot learning using Anthropic Claude 3 Sonnet is divided into four sections, as shown in the following figure. Three sections have constant instructions to the LLM based on assigning the LLM a role, instructions on style and tone of narrative, and examples for earnings calls from past quarters for few-shot learning. The fourth section has information on financial performance, results, and business highlights for the current quarter for which earnings calls are to be generated by the LLM.

We used Anthropic Claude 3 Sonnet to generate an earnings call for a new quarter using earnings calls from past quarters. The following is an example of our few-shot learning along with prompt instructions:

Section A: Overall prompt instructions (context)
 
You are the CEO and CFO of Any Company preparing to present the quarterly earnings report to investors. Draft a comprehensive earnings call script that covers the key financial metrics, business highlights, and future outlook for the given quarter. Provide details on revenue, operating income, segment performance, and important strategic initiatives or product launches during the quarter.

Section B: Specific guidance for the earnings script (context)
 
The earnings script should be written in a formal, investor-friendly tone suitable for a public earnings call. Use clear and concise language to explain financial performance and business developments. Aim to strike a balance between providing sufficient details and keeping the script reasonably concise. Incorporate specific data points and figures but avoid overwhelming with excessive numerical minutiae. The overall structure should flow logically, covering key topics like revenue, operating income, segment highlights, strategic priorities, and forward-looking guidance. Use the following 5 instructions when generating results for the earnings call script.

1. Provide a clear structure by organizing the content into logical sections, such as financial highlights, segment performance, operational metrics, strategic initiatives, and a forward-looking view. 
2. Include granular details and insights into the factors impacting performance, such as customer behavior trends, supply chain improvements, cost optimization efforts, and any other relevant context etc.
3. Substantiate your commentary with specific data points and percentages to lend credibility to your statements. 4. Offer a comprehensive forward-looking view by discussing capital investments, preparedness for upcoming events or seasons, and the long-term strategic focus or priorities. 
5. Maintain a measured, objective, and analytical tone throughout the content, avoiding overly conversational or casual language.

Section C: Example Scripts from past quarters (for Few Shot/ Chain-of-thought) 

The example scripts from past quarters provide a reference for the structure, tone, and level of detail expected in an earnings call script. Use these examples to understand how to present financial data, highlight key business initiatives, and address investor concerns or questions. However, ensure that the script for current specific Quarter is tailored to the specific financial performance and business events of that quarter.
<example>
Amazon Earnings call transcript for Q1 2021 ...

Amazon Earnings call transcript for Q2 2021 ...
<example>

Section D: Financial data for quarter for which script is required (context)

<financial_data>

Provide the actual financial results for the specific quarter, including:
Total revenue and year-over-year growth rate
Revenue breakdown by key segments (e.g. AWS, Online Stores, etc.)
Operating income (total and by segment if available)
Any key operating metrics (e.g. Prime membership, third-party seller metrics, etc.)
Notes on significant factors impacting results (e.g. foreign exchange, product launches, one-time events)
Forward-looking guidance on revenue, operating income for next quarter
Highlight key business developments, product launches or strategic priorities for the quarter :

<financial_data>

Fine-tune Meta Llama 2 70B on Amazon Bedrock

In this section, we present our approach to improving the quality of generated earnings call scripts by fine-tuning an LLM. We chose to adapt the Meta Llama 2 70B model, which is powerful and known for its strong performance across various natural languages tasks, to the specific domain of earnings call scripts.

The following diagram illustrates the workflow for our fine-tuning method.

To prepare the training data, we collected a comprehensive dataset of real earnings call transcripts from Q1 2021 to Q4 2022 for Amazon.com. This focused dataset allows the model to better learn the company’s domain-specific knowledge and terminology. The time span also makes sure the model can learn from recent trends and patterns in earnings communications.

Amazon Bedrock offers a model customization feature that enables you to directly use your own data to customize a wide variety of models. This feature not only helps improve model performance on specific tasks but also allows the model to better understand company-specific domain knowledge and terms, ultimately creating a better user experience.

To fine-tune a text-to-text model, you need to prepare training and optional validation datasets by creating a JSONL file with multiple JSON lines. Each JSON line is a sample containing both a prompt and completion field. In our use case, the prompt contains the prompt template, which includes key financial data for that quarter, and the completion field contains the actual earnings call transcript for that quarter.

We use the following prompt template:

{"prompt": ”Section A: Overall prompt instructions (context)… Section B: Specific guidance for the earnings script (context)… Section D: Financial data for Q1 2021 for which script is required (context) The financial data for {time_period} is:
<financial_data>{Section D}<financial_data> Please generate the earning report for {time_period} to the investors, based on the information provided above. Don't make up any information. ", "completion": ”Real earning call script for that Q1 2021"}

The training data is prepared in JSONL format, with each line representing an earnings call for a quarter:

{"prompt": "<prompt1>", "completion": "<expected generated text>"}
{"prompt": "<prompt2>", "completion": "<expected generated text>"}
{"prompt": "<prompt3>", "completion": "<expected generated text>"}

When the dataset is ready, we upload it to Amazon Simple Storage Service (Amazon S3) and set up a customization job in Amazon Bedrock. The training time varies from minutes to hours, depending on the size of the training data and the selected model. After the training job is complete, you must purchase Provisioned Throughput to use the model and generate future earnings call scripts. You can select the No Commitment option for Provisioned Throughput, which is billed on an hourly basis.

For inference, because some language models require a clear separation between the input prompt and expected output during fine-tuning, we need to add a special delimiting key before providing the input to the model. Specifically, for the Meta Llama 2 70B model, we add the key nn Response:n after the input prompt. This delimiter helps the model distinguish where the prompt ends and the expected response should begin, allowing it to generate more accurate outputs. The prompt would look as follows:

Prompt:
{User_Input_Prompt}

Response:

By providing this formatted prompt during inference, the fine-tuned Meta Llama 2 70B model can better understand the input context and generate a more relevant earnings call script as the response.

For better performance, you can use the same prompt template with the current quarter’s financial data (without the few-shot learning examples), format it with the delimiter, and send it to the customized model to generate the final earnings call script for that quarter.

Evaluation of few-shot prompt engineering and fine-tuning

We evaluated the generated earnings call transcripts from both methods (few-shot prompt engineering and fine-tuning) using two different approaches:

  • Evaluated by a human reviewer
  • Evaluated by comparing three variations using an LLM (Anthropic Claude 3 Sonnet)

Evaluated by human reviewer

The following table summarizes a human reviewer’s evaluation.

It is imperative to note that two factors contributed to the differences: varying approaches (few-shot learning and fine-tuning) and disparate models (Anthropic Claude 3 and Meta Llama 70B). Consequently, the results cannot be interpreted as a mere comparison of models. It is advisable to explore the approaches with your specific use case and data, and subsequently evaluate the outcomes by discussing with subject matter experts from the relevant business department.

Factor Fine-Tuned Model Few-shot Prompt Engineering
Comprehensiveness The script covers most of the key points provided in the prompts, although it ignored a few details. For example, it misses the point that the growth in advertising was primarily driven by using machine learning models to improve relevancy of ads. The script covers key points provided in the prompts.
Hallucination Two instances. (1) “This growth was driven by strong demand for our Prime Day event, which saw record-breaking sales and attracted millions of new Prime members.” (2) “This growth was driven by strong demand in our key markets, including India and Japan. Once. (1) “In North America, revenue grew 11% year-over-year to $87.9 billion, fueled by continued robust demand and greater purchase frequency by Prime Members.
Writing style (1) This script uses mostly objective and precise language, which is consistent with the real earnings call. Still, it has subjective expressions such as “a huge success,” and imprecise expressions such as “double digit growth.” (2) The language offers less variations. For example, it uses the format of “This ___ was driven by ___” 10 times without variations. (3) The model generated some additional sentences. For example, “Now, let’s turn to our forward guidance. At this time, we’re not providing specific revenue or operating income guidance for the fourth quarter. The real earnings call uses precise and objective language, while this script uses more metaphoric expressions such as “laser-focused” and “made further strides,” as well as subjective expressions such as “invest prudently” and “disciplined execution.
Ease of Use (1) Fine-tuning a model in Amazon Bedrock gives the option of following steps on the Amazon Bedrock console or apply coding to interact with LLMs on Amazon Bedrock through the API. (2) The fine-tuning process generally takes longer compared to few-shot prompt engineering based on the same documents. (3) Fine-tuning requires preparing data in input/output format (JSON files) for training the selected model. (4) If a new document is added, the whole fine-tuned model needs to be updated by going through the same fine-tuning process. (1) Amazon Bedrock allows users to give instructions and example data to an LLM as is using both the UI or creating reproducible codes. (2) If a new document is added, the user only needs to add to the prompt an example for few-shot learning or prompt instructions. Overall, few-shot prompt engineering is easier to implement, compared to fine-tuning a model.
Cost Monthly cost incurred for fine-tuning = Fine-tuning training cost for the model (priced by number of tokens for training data) + custom model storage per month + hourly cost (or Provisioned Throughput cost for time commitment) of custom model inference. Priced by number of input (few-shot prompts and examples) and output tokens for the model.

The cost comparison can be further evaluated by the frequency of usage, as shown in the following table.

Method One-Time Cost Recurring Cost Inference Cost
Fine-Tuning Priced by the number of tokens for training data Custom model storage cost per month Custom model inference cost (hourly or Provisioned Throughput commitment)
Few-Shot Prompt Engineering N/A N/A Priced by number of input (prompts and examples) and output tokens

Evaluated by comparing three variations using an LLM

We tested the following variations:

  • Variation A – Earnings call transcript from few-shot learning with Anthropic Claude v3 Sonnet
  • Variation B – Earnings call transcript with fine-tuned Meta Llama 70B
  • Variation C – Actual earnings call transcript for the quarter

The following table summarizes the key similarities and differences between the three variations of the Amazon Q3 2023 earnings call transcript. Variation A and Variation B have two main differences – different approaches (few-shot learning vs fine-tuning) and different models (Anthropic Claude 3 vs Meta Llama 70B).

. Identified Factor Result Summaries
Similarities Financial Metrics All variations report strong financial results, with revenue growth around 11% year-over-year and significant increases in operating income.
Business Highlights They highlight the success of Prime Day as a major driver of sales and Prime member growth. The transcripts mention continued growth in third-party seller services, advertising, and AWS.
Management Focus There is a focus on improving operational efficiency, cost optimization, and supply chain/delivery improvements.
Innovation and Partnerships Generative AI initiatives and partnerships (such as Anthropic, Amazon Bedrock, and Amazon CodeWhisperer) are discussed in relation to AWS.
Dissimilarities Level of Financial Detail Variation A provides more detailed financials (exact revenue, operating income figures) than B and C.
Narrative/ Commentary Style – Variation B has more personal commentary from “Jeff Bezos” and “Brian Olsavsky” compared to A and C’s more generic and impersonal style.
Level of Business Detail – Variation C goes into more specifics on initiatives like regionalization, inventory optimization, and cost reduction efforts. Variation A discusses priorities and forward-looking initiatives in more depth compared to B and C.
Forward Guidance Only Variation C mentions actual forward guidance on capital investments for 2023.

Moreover, we can compare the difference between A vs. C and B vs. C to better compare the generated results to the actual earning scripts.

Identified Factor Difference between A & C Difference between B & C
Financial Details A lacks some of the specific financial details and figures present in the actual script. B is more similar to the actual script in terms of providing segment-wise financial figures and percentages.
Depth of Content A mentions broad themes and priorities, whereas C dives deeper into operational metrics, cost savings initiatives, and strategic updates. C provides additional details on topics like free cash flow, capital investments, and strategic initiatives like generative AI.

Overall, although the core financial highlights are similar, there are nuances in the depth of details provided and the narrative and commentary style across the three variations.

Conclusion

Generating high-quality earnings call script drafts using LLMs is a promising approach that can streamline the process for companies. Both the few-shot prompt engineering and fine-tuning methods demonstrated the ability to produce scripts covering key financial metrics, business updates, and forward-looking guidance. Each method has its own nuances. However, there are trade-offs in terms of comprehensiveness, hallucinations, writing style, ease of implementation, and cost that companies must evaluate based on their specific needs and priorities. As language models continue advancing, further research in customizing and refining these models for the financial services and capital markets domain could unlock even more value for financial communications processes.

This blog presents a framework for two different approaches: few-shot prompt engineering and fine-tuning with Large Language Models (LLMs), followed by an evaluation of the results. The findings should not be interpreted as prescriptive recommendations for favoring one approach over the other, as the choice depends on the specific content and prompts. Additionally, the results should not be construed as a direct comparison of LLMs, as the methodologies employed with each LLM differ, making it an apples-to-oranges comparison. As LLMs continue to advance, we anticipate further improvements in their output quality.

As next steps, you can use Amazon Bedrock to explore your own data and use cases. You can engage in few-shot prompt engineering and fine-tuning methods with different LLMs on Amazon Bedrock, using your specific data securely and privately. Furthermore, you can evaluate the results of these methods by collaborating with subject matter experts or using evaluation frameworks, enabling you to assess the performance and suitability of the methods and LLMs on Amazon Bedrock for your particular use case. You can try out and compare the results, and either use prompt engineering or deploy your own fine-tuned model to generate the earnings calls tied to your company. You can also evaluate both approaches for any related use case.

Refer to Prompt engineering guidelines and Custom models for more information about these two methods. To learn more about applying generative AI for investment research, please refer to AI-powered assistants for investment research with multi-modal data: An application of Agents for Amazon Bedrock.

Refer to this blog to find out more about, empowering analysts to perform financial statement analysis, hypothesis testing, and cause-effect analysis with Amazon Bedrock, Anthropic Claude 3 Sonnet, and prompt engineering


About the Authors

Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers leverage GenAI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a Ph.D. degree in Electrical Engineering. Outside of work, she loves traveling, working out and exploring new things.

Jia (Vivian) Li is a Senior Solutions Architect in AWS, with specialization in AI/ML. She currently supports customers in financial industry. Prior to joining AWS in 2022, she had 7 years of experience supporting enterprise customers use AI/ML in the cloud to drive business results. Vivian has a BS from Peking University and a PhD from University of Southern California. In her spare time, she enjoys all the water activities, and hiking in the beautiful mountains in her home state, Colorado.

Read More

ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA Datasets with Large Language Models

The rapid evolution of Large Language Models (LLMs) and conversational assistants necessitates dynamic, scalable, and configurable conversational datasets for training and evaluation. These datasets must accommodate diverse user interaction modes, including text and voice, each presenting unique modeling challenges. Knowledge Graphs (KGs), with their structured and evolving nature, offer an ideal foundation for current and precise knowledge. Although human-curated KG-based conversational datasets exist, they struggle to keep pace with the rapidly changing user information needs. We present…Apple Machine Learning Research

Streamline insurance underwriting with generative AI using Amazon Bedrock – Part 1

Streamline insurance underwriting with generative AI using Amazon Bedrock – Part 1

Underwriting is a fundamental function within the insurance industry, serving as the foundation for risk assessment and management. Underwriters are responsible for evaluating insurance applications, determining the level of risk associated with each applicant, and making decisions on whether to accept or reject the application based on the insurer’s guidelines and risk appetite.

In this post, we discuss how to use AWS generative artificial intelligence (AI) solutions like Amazon Bedrock to improve the underwriting process, including rule validation, underwriting guidelines adherence, and decision justification. We’ve also provided an accompanying GitHub repo so you can try the solution.

The underwriting process typically involves several key steps:

  • Gathering and verifying information – Underwriters collect and review various data points about the applicant, such as age, health status, occupation, and lifestyle habits for life insurance, or property location, construction type, and safety features for property insurance
  • Risk assessment – Underwriters analyze the potential risk of insuring the applicant using statistical models, actuarial data, and their own expertise
  • Premium determination – Based on the risk assessment, underwriters calculate the appropriate premium for the desired coverage, aiming to strike a balance between competitive pricing and ensuring the insurer’s profitability
  • Policy customization – Underwriters may tailor insurance policies to meet the specific needs of applicants while aligning with the insurer’s risk management strategy
  • Decision-making – After assessing the risk and determining the appropriate premium, underwriters decide whether to accept or reject the application

Effective underwriting is crucial for the financial stability and profitability of insurance companies. By accurately assessing risk and setting appropriate premiums, underwriters help insurers maintain a balanced risk portfolio and avoid adverse selection of potential policy holders.

Challenges in document understanding for underwriting

Document understanding is a critical and complex aspect of the underwriting process that poses significant challenges for insurers. Underwriters must review and analyze a wide range of documents submitted by applicants, and the manual extraction of relevant information is a time-consuming and error-prone task. The challenges in document understanding can be broadly categorized into three areas:

  • Rule validation – Verifying that the information provided in the documents adheres to the insurer’s underwriting guidelines. This is a complex task when faced with unstructured data, varying document formats, and erroneous data.
  • Underwriting guidelines adherence – Consistently applying the insurer’s underwriting guidelines across all decisions is crucial for maintaining fairness and regulatory compliance. However, manual interpretation can lead to inconsistencies and potential human bias. Also, inconsistent data can lead to flawed rule applications, especially when dealing with large volumes of information.
  • Decision justification – Providing clear and concise explanations for underwriting decisions, especially in cases where an application is denied or offered modified terms or exceptions. This can be time-consuming and may lack the necessary clarity and objectivity.

The impact of these challenges on the underwriting process is significant. Manual data extraction and analysis can slow down the workflow, leading to longer processing times and lower customer retention. Errors in data interpretation or inconsistencies in applying guidelines can result in incorrect risk assessments, premium leakage, and lost customers for the insurer.

To address these challenges, insurers are increasingly turning to advanced technologies such as machine learning, natural language processing, and intelligent document processing solutions.

However, implementing these technologies has been challenging for carriers. Building rules and pipelines for each document or insurance product may require dedicated teams, subject matter expertise in new technologies, and security and compliance controls. Additionally, traditional approaches lack contextual understanding that come with underwriting, causing fragility in existing solutions. In the next section, we explore how generative AI and Amazon Bedrock can help insurers overcome these challenges and streamline the underwriting process through intelligent document understanding and automation.

How generative AI and Amazon Bedrock help solve these challenges

One of the key advantages of generative AI is its ability to understand and interpret context within documents. Unlike traditional rule-based systems that rely on strict pattern matching, generative AI models can grasp the nuances and semantics of language, allowing them to extract meaningful insights even from complex and varied document formats. This contextual understanding is particularly valuable in underwriting, where the interpretation of information often requires domain-specific knowledge and reasoning.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Amazon Bedrock simplifies the deployment, scaling, implementation, and management of generative AI models for insurers. With Amazon Bedrock, insurers can easily integrate pre-trained models or custom-built models into their existing underwriting workflows and systems, without the need for extensive ML expertise or infrastructure management. Using the power of AI to automate tedious and time-consuming tasks enables underwriters to focus on their core competencies.

To equip FMs with up-to-date and proprietary information, such as underwriting manuals, you can use Retrieval Augmented Generation (RAG), a technique that fetches data from company data sources and enriches the prompt to provide more relevant and accurate responses. Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow, from ingestion to retrieval and prompt augmentation, without having to build custom integrations to data sources and manage data flows.

In this solution, we use the knowledge base capability offered by Amazon Bedrock to enhance the reasoning and decision-making process of the generative AI models. Knowledge Bases for Amazon Bedrock allows us to ingest and incorporate relevant underwriting guidelines and manuals into the models’ knowledge base. Knowledge Bases for Amazon Bedrock simplifies the integration process by eliminating the need for custom integrations with data sources and the management of complex data flows. It streamlines the ingestion and retrieval of underwriting manuals, so models have access to the most current and relevant information. We can fetch specific information from the ingested underwriting manuals and enrich the prompts provided to the models. This makes sure the models have access to the most up-to-date and relevant information, enabling them to provide more accurate and context-aware responses. Knowledge Bases for Amazon Bedrock provides a crucial advantage by allowing insurers to infuse their proprietary domain knowledge and underwriting policies into the generative AI models. This empowers the models to make decisions that are fully aligned with the insurer’s risk management strategies, guidelines, and regulatory requirements.

Generative AI and Amazon Bedrock can address specific challenges in document understanding for underwriting:

  • Rule validation – Generative AI models can automatically validate the information provided in application documents against an insurer’s underwriting guidelines. By using techniques like RAG or in-context prompting, these models can extract relevant information from documents and compare it against predefined rules, flagging any discrepancies or non-compliance. This reduces the risk of errors and provides consistency in the underwriting process.
  • Underwriting guidelines adherence – Generative AI enables insurers to embed their underwriting guidelines directly into the prompts or instructions provided to the models. By engineering these prompts, insurers can align their AI-driven decision-making process with the company’s risk management strategy. This approach minimizes inconsistencies and potential bias in underwriting decisions.
  • Decision justification – Generative AI models can generate clear and concise explanations for underwriting decisions, providing transparency and objectivity in the process. These models can articulate the reasoning behind each decision based on the information extracted from documents and the insurer’s guidelines, along with the source documents used in its decision. This makes it straightforward for underwriters to review predications, and improves communication with applicants, auditors, and regulators.

By adopting generative AI and Amazon Bedrock, insurers can enhance underwriting efficiency, reduce processing times, minimize errors, adhere to fairness and regulatory compliance, and improve transparency and customer satisfaction. In this post, we show a simple use case of validating documents against a set of underwriting guidelines, and in future posts, we will show more complex scenarios across a large corpus of documents, and more advanced underwriting rules.

Solution overview

The following diagram illustrates the automated process for verifying driver’s license records and validating underwriting rules using various AWS services.

The solution includes the following steps:

  1. Users upload an image of a driver’s license record to an Amazon Simple Storage Service (Amazon S3) bucket. The bucket is configured to send event notifications to Amazon EventBridge.
  2. An EventBridge rule is configured to start an AWS Step Functions state machine when objects are uploaded to the S3 bucket.
  3. EventBridge sends the event data to the Step Functions workflow, which will orchestrate multiple AWS services to perform the required tasks for underwriting rules validation.
  4. The state machine starts and runs a series of event-driven steps:
    1. The workflow begins with the “Base64 Image Encoding” state, which encodes an image of the uploaded driver’s license into Base64 format.
    2. The Base64 encoding is then passed to the “Classification” state, which invokes Anthropic Claude 3 Haiku on Amazon Bedrock to classify the image as a driver’s license.
    3. Based on the classification result, the workflow decides whether to proceed using the “Choice (YES or NO)” state.
    4. If classified as a driver’s license, the workflow proceeds to the “Parallel” state to run two Amazon Bedrock tasks in parallel. If not classified as a driver’s license, the workflow will fail.
    5. Under the “Parallel” state, two tasks are run simultaneously:
      1. The first task proceeds to the “Extract Name and License #” workflow state, which uses Amazon Bedrock to invoke Anthropic Claude 3 Haiku to extract the name and the driver’s license number from the image. The name and the license number are then passed to an AWS Lambda function “Call DMV API with License Info” state, which integrates with the relevant Department of Motor Vehicles (DMV) API to retrieve the driving record.
      2. The second task under the “Parallel” state performs a “Retrieve Information from Underwriting Manual” action to obtain the underwriting rules applicable for a driver to get insurance.
    6. The retrieved underwriting rules information is then passed to Lambda function “Combine Retrieved information” to compile under the same body of text all the relevant rules to be validated.
    7. The final step comprises two tasks: the Lambda function “Generate Final Prompt” creates the prompt to be used to perform the verification against the underwriting manual, considering also the driving record report, which is then used to invoke an Amazon Bedrock model under the state “Get Final Result from Bedrock,” which generates the final report with the rules validation and recommendations.

By combining these AWS services and taking advantage of the capabilities of the Anthropic Claude 3 Haiku model, this solution offers a streamlined and intelligent approach to processing driver’s license records for underwriting rules validation purposes. It automates various tasks, reduces manual effort, and enhances the accuracy and efficiency of the underwriting process.

Prerequisites

You need to have the following to run the solution:

  • An AWS account
  • Basic understanding of how to download a repo from GitHub
  • Basic knowledge of running a command on a terminal
  • Underwriting guidelines

Deploy the solution

You can download all the necessary code with instructions from the GitHub repo. Follow the instructions in the GitHub repo README to deploy the solution.

Test the solution

To test the solution, upload a sample driver’s license to the underwriting document bucket.

To find the URL of the underwriting document bucket, follow these steps:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose the stack GenAIUnderwritingValidationStack.
  3. On the Outputs tab, note the value for UnderwritingBucketURL.

To upload the sample driver’s license to the underwriting document bucket, follow these steps:

  1. On the Amazon S3 console, navigate to the underwriting-document-bucket using the UnderwritingBucketURL.
  2. Choose Upload.
  3. Select the sample driver’s license and choose Upload.

To review the workflow of the Step Functions state machine, follow these steps:

  1. On the Step Functions console, choose State machines in the navigation pane.
  2. Select UnderwritingValidationStateMachine and choose View details.
  3. Select the state machine and review the graph, event, and state views for more details.

Clean up

After you try out the solution, follow the cleanup instructions in the GitHub repo README to avoid accruing charges.

Pricing

This solution is composed of four primary services: Amazon Bedrock, Amazon S3, EventBridge, and Step Functions. We discuss On-Demand Amazon Bedrock pricing in this post. For the other services, review the service’s pricing page.

With On-Demand mode, you pay only for what you use, with no time-based term commitments. For Anthropic Claude 3 models, you’re charged for every input token processed and every output token generated.

As shown in the following graph, pricing varies for each Anthropic models: Claude 3 Haiku, Claude 3 Sonnet, Claude 3 Opus.

Claude 3 Haiku is Anthropic’s fastest, most compact model for near-instant responsiveness. Claude 3 Sonnet strikes the ideal balance between intelligence and speed—particularly for enterprise workloads. This solution uses the sophisticated vision capabilities of Haiku to process photos of drivers’ licenses and uses Sonnet to perform RAG-powered rule validation of a driver’s license record against an underwriting manual document.

Conclusion

In this post, we explored the critical and complex challenges of document understanding within the underwriting process for insurers. Manually extracting relevant information from applicant documents, validating adherence to underwriting guidelines, and providing clear justifications for decisions is time-consuming and error-prone, and can lead to inconsistencies. Generative AI and Amazon Bedrock offer a powerful solution to help overcome these obstacles. We discussed how the reasoning and contextual understanding capabilities of generative AI models allow them to accurately interpret complex documents and extract meaningful insights aligned with an insurer’s specific domain knowledge (such as property and casualty, healthcare, and so on) and corresponding guidelines. We provided a reference architecture that uses Amazon Bedrock FMs and RAG capabilities using Knowledge Bases for Amazon Bedrock, along with orchestration services such as Step Functions, that allow insurers to improve automation in key underwriting tasks like rules validation.

Additionally, you learned about how you can use AWS generative AI solutions to extract relevant information, compare it against defined rules, and flag any non-compliance issues automatically. You can use this innovative approach to improve underwriting efficiency, reduce processing times, minimize human error, achieve fairness and regulatory compliance, and improve transparency with applicants. We showed how insurers can adopt generative AI and Amazon Bedrock to modernize their underwriting processes through intelligent document understanding and automation, gaining a competitive edge through mitigating risks more effectively.

Lastly, we offered a working solution with code you can deploy within your sandbox environment to accelerate the development of your own intelligent document understanding solution using AWS generative AI.


About the Authors

Paul Min is a Solutions Architect at AWS, where he works with customers to advance their mission and accelerate their cloud adoption. He is passionate about helping customers reimagine what’s possible with generative AI on AWS. Outside of work, Paul enjoys spending time with his wife and golfing.

Alfredo Castillo is a Sr. Solutions Architect at AWS, where he works with Financial Services customers on all aspects of internet-scale distributed systems, and specializes in Machine learning,  Natural Language Processing, Intelligent Document Processing, and GenAI. Alfredo has a background in both electrical engineering and computer science. He is passionate about family, technology, and endurance sports.

Max Tybar is a Solutions Architect at AWS with a background in computer science and application development. He enjoys leveraging DevOps practices to architect and build reliable cloud infrastructure that helps solve customer problems. His personal interests lie around leveraging Machine Learning and High-Performance Computing to help solve complex problems faced by Financial Service customers in Banking, Capital Markets and Life Insurance.

Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Read More

Import a fine-tuned Meta Llama 3 model for SQL query generation on Amazon Bedrock

Import a fine-tuned Meta Llama 3 model for SQL query generation on Amazon Bedrock

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API. Amazon Bedrock also provides a broad set of capabilities needed to build generative AI applications with security, privacy, and responsible AI practices.

Some FMs are publicly available, which allows for customization tailored to specific use cases and domains. However, deploying customized FMs to support generative AI applications in a secure and scalable manner isn’t a trivial task. Hosting large models involves complexity around the selection of instance type and deployment parameters. To address this challenge, AWS recently announced the preview of Amazon Bedrock Custom Model Import, a feature that you can use to import customized models created in other environments—such as Amazon SageMaker, Amazon Elastic Compute Cloud (Amazon EC2) instances, and on premises—into Amazon Bedrock. This feature abstracts the complexity of the deployment process through simple APIs for model deployment and invocation. Currently, Custom Model Import supports importing custom weights for selected model architectures (Meta Llama 2 and Llama 3, Flan, and Mistral) and precisions (FP32, FP16, and BF16), and serving the models on demand and with provisioned throughput.

Customizing FMs can unlock significant value by tailoring their capabilities to specific domains or tasks. This is the first in a series of posts about model customization scenarios that can be imported into Amazon Bedrock to simplify the process of building scalable and secure generative AI applications. By demonstrating the process of deploying fine-tuned models, we aim to empower data scientists, ML engineers, and application developers to harness the full potential of FMs while addressing unique application requirements.

In this post, we demonstrate the process of fine-tuning Meta Llama 3 8B on SageMaker to specialize it in the generation of SQL queries (text-to-SQL). Meta Llama 3 8B is a relatively small model that offers a balance between performance and resource efficiency. AWS customers have explored fine-tuning Meta Llama 3 8B for the generation of SQL queries—especially when using non-standard SQL dialects—and have requested methods to import their customized models into Amazon Bedrock to benefit from the managed infrastructure and security that Amazon Bedrock provides when serving those models.

Solution overview

We walk through the steps of fine-tuning an FM with using SageMaker, and importing and evaluating the fine-tuned FM for SQL query generation using Amazon Bedrock. The complete flow is shown in the following figure and it covers the following steps:

  1. The user invokes a SageMaker training job to fine-tune the model using QLoRA and store the weights in an Amazon Simple Storage Service (Amazon S3) bucket in the user’s account.
  2. When the fine-tuning job is complete, the user runs the model import job using the Amazon Bedrock console. This step will run Steps 3–5 automatically.
  3. Amazon Bedrock service starts an import job in an AWS operated deployment account.
  4. Model artifacts are copied from the user’s account into an AWS managed S3 bucket.
  5. When the import job is complete, the fine-tuned model will be made available to be invoked.

Bedrock custom model import architecture

All data remains within the selected AWS Region, the model artifacts are imported into the AWS operated deployment account using a VPC endpoint, and you can encrypt your model data with your own Amazon Key Management Service (AWS KMS) keys. The scripts for fine-tuning and evaluation are available on the GitHub repository.

A copy of your model artifacts is stored in an AWS operated deployment account. This copy will remain until the custom model is deleted. Deleting artifacts in the user’s account won’t delete the model or the artifacts in the AWS operated account. If different versions of a model are imported into Amazon Bedrock, each version will be managed as an independent project with its own set of artifacts. You can apply tags to models and import jobs to keep track of different projects and versions.

Meta Llama3 8B is a gated model on Hugging Face, which means that users must be granted access before they’re allowed to download and customize the model. Sign in to your Hugging Face account, read the Meta Llama 3 Acceptable Use Policy, and submit your contact information to be granted access. This process might take a couple of hours.

We use the sql-create-context dataset available on Hugging Face for fine-tuning. The dataset contains 78,577 tuples of context (table schema), question (query expressed in natural language), and answer (SQL query). Refer to the licensing information regarding this dataset before proceeding further.

We use Amazon SageMaker Studio to create a remote fine-tuning job, which will run as a SageMaker training job. SageMaker Studio is a single web-based interface for end-to-end machine learning (ML) development. If you need help configuring your SageMaker Studio domain and your JupyterLab environment, see Launch Amazon SageMaker Studio. The training job will use QLoRA and the PyTorch FullyShardedDataParallel API (FSDP) to fine-tune the Meta Llama 3 model. QLoRA quantizes a pretrained language model to 4 bits and attaches smaller low-rank adapters (LoRA), which are fine-tuned with our training data. PyTorch FSDP is a parallelism technique that shards the model across GPUs for efficient training. See the following notebook for the complete code sample.

Data preparation

In the data preparation stage, we use the following prompt template to insert specific instructions for interpreting the context and fulfilling the request, and store the modified training dataset as JSON files that are uploaded to Amazon S3:

system_message = """You are a powerful text-to-SQL model. Your job is to answer questions about a database."""

def create_conversation(record):
    sample = {"messages": [
        {"role": "system", "content": system_message + f"""You can use the following table schema for context: {record["context"]}"""},
        {"role": "user", "content": f"""Return the SQL query that answers the following question: {record["question"]}"""},
        {"role" : "assistant", "content": f"""{record["answer"]}"""}
    ]}
    return sample

Fine-tune Meta Llama 3 8B model

Refer to the run_fsdp_qlora.py file defined in the notebook for a full description of the fine-tuning script. The following snippets describe the configuration of the QLoRA job:

if script_args.use_qlora:
    print(f"Using QLoRA - {torch_dtype}")
    quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch_dtype,
            bnb_4bit_quant_storage=quant_storage_dtype,
        )
else:
    quantization_config = None

peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

The trainer class is based on Supervised Fine-tuning Trainer (SFT Trainer) from Hugging Face, which is an API to create your SFT models and train them with a few lines of code:

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    dataset_text_field="text",
    eval_dataset=test_dataset,
    peft_config=peft_config,
    max_seq_length=script_args.max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with special tokens
        "append_concat_token": False,  # No need to add additional separator token
    },
)

Once the adapter is trained, it is merged with the original model before persisting the weights. Custom Model Import does not support LoRA adapters at the moment.

model = model.merge_and_unload()
model.save_pretrained(
    sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB"
)

For this use case, we use an ml.g5.12xlarge instance, which has four NVIDIA A10 accelerators. The key configurations are as follows:

huggingface_estimator = HuggingFace(
    entry_point          = 'run_fsdp_qlora.py',    # train script
    source_dir           = 'scripts/trl/',      # directory which includes all the files needed for training
    instance_type        = 'ml.g5.12xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    max_run              = 2*24*60*60,        # maximum runtime in seconds (days * hours * minutes * seconds)
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.36.0',            # the transformers version used in the training job
    pytorch_version      = '2.1.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    disable_output_compression = True,        # not compress output to save training time and cost
    distribution={"torch_distributed": {"enabled": True}},
    environment          = {
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache", # set env variable to cache models in /tmp
        "HF_TOKEN": HfFolder.get_token(),       # Retrieve HuggingFace Token to be used for downloading base models from
        "ACCELERATE_USE_FSDP":"1", 
        "FSDP_CPU_RAM_EFFICIENT_LOADING":"1"
    },
)

In our testing, the training job completed two epochs in approximately 2.5 hours on a single ml.g5.12xlarge instance, which incurred approximately $18 for training cost. After training is complete, model weights in the Hugging Face safetensors format, the tokenizer, and the configuration file will be uploaded to the S3 bucket defined in the training script. This path should be stored to be used as the base directory for the import job in the next section.

s3_files_path = huggingface_estimator.model_data["S3DataSource"]["S3Uri"]

The configuration file config.json will inform Amazon Bedrock how to load the weights from the safetensors files. Some parameters to keep in mind are the model_type, which must be one of the types currently supported by Amazon Bedrock, max_position_embeddings, which sets the maximum length of input sequence that the model can handle, the model dimensions (hidden_size, intermediate_size, num_hidden_layers, and num_attention_heads), and rotary position embedding (RoPE) parameters, which describe the encoding of position information. See the following configuration:

{
  "_name_or_path": "meta-llama/Meta-Llama-3-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.40.2",
  "use_cache": true,
  "vocab_size": 128256
}

Import the fine-tuned model into Amazon Bedrock

To import the fine-tuned Meta Llama 3 model into Amazon Bedrock, compete the following steps:

  1. On the Amazon Bedrock console, choose Imported models on the navigation pane.
  2. Choose Import model.

  1. For Model name, enter llama-3-8b-text-to-sql.
  2. For Model import settings, enter the Amazon S3 location from the previous steps.
  3. Choose Import model.
    The model import job should take 15–18 minutes to complete.
  4. When it’s done, choose Models to see your model.
  5. Copy the model Amazon Resource Name (ARN) so you can invoke the model with the AWS SDK in the next section.

Evaluate SQL queries generated by the fine-tuned model

In this section, we provide two examples to evaluate the SQL queries generated by the fine-tuned model: one using the Amazon Bedrock Text Playground and one using a large language model (LLM) as a judge.

Using the Amazon Bedrock Text Playground

You can test the model using the Amazon Bedrock Text Playground. For optimal results, use the same prompt template used to preprocess your training data:

<s>[INST] <<SYS>>You are a powerful text-to-SQL model. Your job is to answer questions about a database. You can use the following table schema for context: CREATE TABLE table_name_11 (tournament VARCHAR)<</SYS>>

[INST]Human: Return the SQL query that answers the following question: Which Tournament has A in 1987?[/INST]

Assistant:

The following animation shows the results.

Using LLM as a judge

On the same example notebook, we used the Amazon Bedrock InvokeModel API to call our imported model on demand to generate SQL queries for records in our test dataset. We use the same prompt template used with the training data in the fine-tuning step. The imported model will only support parameters that were supported by the base model (max_tokens, top_p, and temperature). Imported models don’t support penalty terms (repetition_penalty or length_penalty) or the use of token sampling instead of greedy decoding (do_sample). See the following code:

def get_sql_query(system_prompt, user_question):
    """
    Generate a SQL query using Llama 3 8B
    Remember to use the same template used in fine tuning
    """
    formatted_prompt = f"<s>[INST] <<SYS>>{system_prompt}<</SYS>>nn[INST]Human: {user_question}[/INST]nnAssistant:"
    native_request = {
        "prompt": formatted_prompt,
        "max_tokens": 100,
        "top_p": 0.9,
        "temperature": 0.1
    }
    response = client.invoke_model(modelId=model_id,
                                   body=json.dumps(native_request))
    response_text = json.loads(response.get('body').read())["outputs"][0]["text"]

    return response_text

After we generate model predictions, we use a different (more powerful) model to act as a judge and evaluate our fine-tuned model responses. For this example, we use the Anthropic Claude 3 Sonnet LLM on Amazon Bedrock to measure the similarity between the desired answer and the predicted answer using the following prompt:

formatted_prompt = f"""You are a data science teacher that is introducing students to SQL. Consider the following question and schema:
<question>{question}</question>
<schema>{db_schema}</schema>
    
Here is the correct answer:
<correct_answer>{correct_answer}</correct_answer>
    
Here is the student's answer:
<student_answer>{test_answer}<student_answer>

Please provide a numeric score from 0 to 100 on how well the student's answer matches the correct answer for this question.
The score should be high if the answers say essentially the same thing.
The score should be lower if some parts are missing, or if extra unnecessary parts have been included.
The score should be 0 for an entirely wrong answer. Put the score in <SCORE> XML tags.
Do not consider your own answer to the question, but instead score based only on the correct answer above.
"""

The predicted score based on our holdout split of the dataset was 96.65%, which is excellent for a small model tuned to a specific task.

Clean up

The model will spin down to zero after a period of no activity and your cost will stop accruing. However, we recommend deleting the imported model using the Amazon Bedrock console. Remember to also delete model artifacts from your S3 bucket when the fine-tuned model is no longer needed to prevent incurring costs.

Conclusion

This post presented an overview of the process of fine-tuning a small model using SageMaker to help generate more accurate SQL queries based on questions asked in natural language and then importing the fine-tuned model into Amazon Bedrock using the Custom Model Import feature. After we imported the model, it was made available on demand through the Amazon Bedrock Playground and the InvokeModel API, which was used to evaluate the performance of the fine-tuned model against a holdout dataset using an LLM as a judge.

The following are recommended best practices that may be helpful when using fine-tuned FMs for code generation tasks:

  • Select a dataset that is relevant and diverse enough for your code generation task
  • Monitor the training job and PEFT parameters to prevent overfitting and catastrophic forgetting
  • Preprocess training data with a consistent instruction template
  • Store model weights using safetensors for fast loading
  • Invoke the model using the same instruction template used in fine-tuning, using only inference parameters that are supported by the base model and the Custom Model Import feature in Amazon Bedrock

Explore the Amazon Bedrock Custom Model Import feature as a way to deploy FMs fine-tuned for code generation tasks in a secure and scalable manner. Visit our GitHub repository to explore samples prepared for fine-tuning and importing models from various families.


About the Authors

Evandro Franco is a Sr. AI/ML Specialist Solutions Architect working on Amazon Web Services. He helps AWS customers overcome business challenges related to AI/ML on top of AWS. He has more than 18 years working with technology, from software development, infrastructure, serverless, to machine learning.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Jay Pillai is a Principal Solution Architect at Amazon Web Services. In this role, he functions as the Global Generative AI Lead Architect and also the Lead Architect for Supply Chain Solutions with AABG. As an Information Technology Leader, Jay specializes in artificial intelligence, data integration, business intelligence, and user interface domains. He has 23 years of extensive experience working with several clients across supply chain, legal technologies, real estate, financial services, insurance, payments, and market research business domains.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on the serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in Generative AI, Artificial Intelligence, Machine Learning, and System Design. He is passionate about developing state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Ragha Prasad is a Principal Engineer and a founding member of Amazon Bedrock, where he has had the privilege to listen to customer needs first-hand and understands what it takes to build and launch scalable and secure Gen AI products. Prior to Bedrock, he worked on numerous products in Amazon, ranging from devices to Ads to Robotics.

Read More

August Adventures Await: 18 New Games Coming to GeForce NOW

August Adventures Await: 18 New Games Coming to GeForce NOW

Members can choose their own adventure with GeForce NOW bringing 18 new games to the cloud in August — including Square Enix’s fantasy role-playing game Visions of Mana when it launches on PC Thursday, Aug. 29.

From cozy games to thrilling battles, there’s something for everyone. Dive into the latest titles and experience powerful performance across all devices — start with the six games available to stream this week.

Plus, the limited-time GeForce NOW Summer Sale continues, offering a 50% discount on new one-month and six-month Ultimate and Priority memberships. Check it out before the deal ends on Sunday, Aug. 18.

Awesome August

Stormgate on GeForce NOW
Fight for the future.

Plunge into the heart of battle with Stormgate, a newly released real-time strategy game from Frost Giant Studios, which is renowned for its work on popular games StarCraft II and Warcraft III. In single-player or multiplayer mode, fight demonic invaders, build bases and command armies to save humanity. Get immersed in a rich storyline, explore diverse factions and experience a blend of new and classic real-time strategy mechanics.

Members can check out the following new additions this week:

  • Stormgate Early Access (New release on Steam, July 30)
  • Space for Sale (New release on Steam, July 30)
  • Cyber Knights: Flashpoint (Steam)
  • Dark and Darker (Steam)
  • Kunitsu-Gami: Path of the Goddess (Xbox, available on PC Game Pass)
  • Kunitsu-Gami: Path of the Goddess Demo (Steam and Xbox)

And here’s a preview of what’s coming later this month:

  • Warhammer 40,000: Speed Freeks (New release on Steam, Aug. 6)
  • Ratten Reich (New release on Steam, Aug. 9)
  • Level Zero Extraction (New release on Steam, Aug. 13)
  • shapez 2 (New release on Steam, Aug. 15)
  • Akimbot (New release on Steam, Aug. 29)
  • Gori: Cuddly Carnage (New release on Steam, Aug. 29)
  • MEMORIAPOLIS (New release on Steam, Aug. 29)
  • Visions of Mana (New release on Steam, Aug. 29)
  • Breachway (New release on Steam, Aug. 30)
  • Star Wars Outlaws (New release on Ubisoft, Aug. 30)
  • Avatar: Frontiers of Pandora (Steam)
  • Heading Out (Steam)
  • Nine Sols (Steam)
  • Saturnalia (Steam)
  • We Were Here Too (Steam)

Jammin’ July

In addition to the 22 games announced last month, six more joined the GeForce NOW library:

  • Cricket 24: The Official Game of the Ashes (New release on Xbox and available on PC Game Pass, July 9)
  • The Elder Scrolls V: Skyrim (Steam)
  • The Elder Scrolls V: Skyrim Special Edition (Steam, Epic Games Store and Xbox, available on PC Game Pass)
  • Kunitsu-Gami: Path of the Goddess (New release on Steam, July 18)
  • Nobody Wants to Die (New release on Steam, July 17)
  • The Settlers: New Allies (Steam)
  • HAWKED and Flintlock: The Siege of Dawn (Xbox) were included in the July games list — HAWKED will no longer be added to GeForce NOW, while Flintlock: The Siege of Dawn will be added at another time. Stay tuned to GFN Thursday for more updates.

Starting in November, GeForce NOW will transition away from updating the GeForce NOW Windows and macOS apps for legacy operating systems, including Windows 7, Windows 8.1 and macOS 10.11-10.14. Members on these systems can still enjoy streaming on play.geforcenow.com via supported web browsers.

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

What’s Your Story: Emre Kiciman

What’s Your Story: Emre Kiciman

What's Your Story podcast | Emre Kiciman

In the Microsoft Research Podcast series What’s Your Story, Johannes Gehrke explores the who behind the technical and scientific advancements helping to reshape the world. A systems expert whose 10 years with Microsoft spans research and product, Gehrke talks to members of the company’s research community about what motivates their work and how they got where they are today. 

In this episode, Gehrke is joined by Senior Principal Research Manager Emre Kiciman. Kiciman’s work in causal machine learning has resulted in tools for finding meaning in data, including the DoWhy library for modeling and testing causal assumptions, and his study of AI is focused on advancing toward systems that not only are more secure but are as positive in their impact as possible. In this episode, Kiciman shares how a side business pursued by his dad opened the door to computing; why his PhD adviser strongly recommended not using the words “artificial intelligence” in his thesis; and the moments that precipitated his moves from systems and networking to computational social science and now causal analysis and large-scale AI applications.

Emre Kiciman - panel of three photos from childhood

Learn more:

Emre Kiciman at Microsoft Research 

AI Controller Interface: Generative AI with a lightweight, LLM-integrated VM 
Microsoft Research blog, February 2024 

AICI: Prompts as (Wasm) Programs (opens in new tab) 
GitHub repo 

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma 
Microsoft Research Podcast, June 2023 

Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization 
Publication, January 2023

An Open Source Ecosystem for Causal Machine Learning (opens in new tab) 
PyWhy.org 

U Rank Demo Screencast 
September 2008 

Transcript

[TEASER]     

[MUSIC PLAYS UNDER DIALOGUE]

EMRE KICIMAN: I think it’s really important for people to find passion and joy in the work that they do. At some point, do the work for the work’s sake. I think this will drive you through the challenges that you’ll inevitably face with any sort of project and give you the persistence that you need to really have the impact that you want to have. 

[TEASER ENDS]  

JOHANNES GEHRKE: Microsoft Research works at the cutting edge. But how much do we know about the people behind the science and technology that we create? This is What’s Your Story, and I’m Johannes Gehrke. In my 10 years with Microsoft, across product and research, I’ve been continuously excited and inspired by the people I work with, and I’m curious about how they became the talented and passionate people they are today. So I sat down with some of them. Now, I’m sharing their stories with you. In this podcast series, you’ll hear from them about how they grew up, the critical choices that shaped their lives, and their advice to others looking to carve a similar path.   

[MUSIC FADES] 


In this episode, I’m talking with Emre Kiciman, the senior principal research manager leading the AI for Industry research team at Microsoft Research Redmond. After completing a PhD in systems and networking in 2005, Emre began his career with Microsoft Research in the same area, studying reliability in large-scale internet services. Exposure to social data inspired him to refocus his research pursuits: his recent work in causal analysis—including DoWhy, a Python library for causal inference—is helping to connect the whats and whys in the abundance of data that exists. Meanwhile, his work with large language models is geared toward making AI systems more secure and maximizing their benefit to society. Here’s my conversation with Emre, beginning with some of his work at Microsoft Research and how he landed in computer science. 

GEHRKE: Welcome to What’s Your Story. So can you just tell us a little bit about what you do at MSR [Microsoft Research]?

KICIMAN: Sure. I work primarily on two areas at the moment, I guess. One is causal analysis, where we work on trying to answer cause-and-effect questions from data in a wide variety of domains, kind of, building that horizontal platform. And I work a lot recently, especially with this large language model focus, on the security of AI-driven systems: how do we make sure that these AI systems that we’re building are not opening up new vulnerabilities to attackers? 

GEHRKE: Super interesting. And maybe we can start out even before we go more in depth into that by, you know, how did you actually end up in computer science? I learned that you grew up in Berkeley. 

KICIMAN: Yeah, on average, I like to say.  

GEHRKE: On average? [LAUGHTER] 

KICIMAN: So I moved to the US with my parents when I was 2 years old, and we lived in El Cerrito, a small town just north of Berkeley. And then around middle school age, we moved to Piedmont, just south of Berkeley. So on average, yes, I grew up in Berkeley, and I did end up going there for college. And you asked about how I got into computer science. When I was probably around third or fourth grade, my dad, who was a civil engineer, decided that he wanted to start a business on the side, and he loved software engineering and wanted to build software to help automate a lot of the more cumbersome design tasks in the design of steel connections, and so he wrote … he bought a PC and brought it home and started working on his work. But then that was also my opportunity to learn what a computer was. 

GEHRKE: So that was your first computer? Was it an x86? 

KICIMAN: Yes, it was an IBM PC, the first x86, the one before the 286. And—it wasn’t the very original PC. It did have a CGA—color graphics adapter—so we could have four colors at once.  

GEHRKE: Nice. 

KICIMAN: And, yeah, that’s … it came with—luckily for me, I guess—it came with a BASIC manual. So reading that manual is how I learned how to program. 

GEHRKE: And this is the typical IBM white box with a monitor on top of it and a floppy drive, or how should I picture it? 

KICIMAN: Yeah, two floppy drives …  

GEHRKE: Two floppy drives? OK …  

KICIMAN: Two floppy drives, yeah, so you could copy from one to the other.  

GEHRKE: Five and a quarter or three and a half? 

KICIMAN: Five and a quarter, yeah, yeah. The loud, clickety-clack keyboard and, yeah, a nice monitor. So not the green and black; the one that could display the colors. And, yeah, had a lot of fun with programming. 

GEHRKE: So what were some of the first things that you wrote? 

KICIMAN: A lot of the first ones were just the examples from the book, the for loops, for example. But then after that, I started getting into some of the, you know, building, like, little mini painting tools. You know, you could move a cursor around the screen, click a button and paint to fill in a region, and then save the commands that you did to make graphics. Eventually, that actually turned into, like, a friend and I really enjoyed playing computer games, so we had in our mind we’re going to build a computer game. 

GEHRKE: Who doesn’t think that.  

KICIMAN: Of course, right? 

GEHRKE: Of course … 

KICIMAN: And so we had, like, a “choose your own adventure”–style program. I think we had maybe even four or five screens you could step through, right. And he was able to get some boxes, and we printed some manuals even. We had big plans, but then we didn’t know what to do, how to finish the game, how to get it out there, so … but we had a lot of fun.  

GEHRKE: Wow, that sounds amazing. 

KICIMAN: Really fond memories, yeah. 

GEHRKE: That sounds amazing. And then you went to Berkeley afterwards? Is that how you realized your passion, or how do you decide to study computer science?

KICIMAN: Yeah … so from that age, I was set on computing. I think my parents were a bit of a devil’s advocate. They wanted me to consider my options. So I did consider, like, mechanical engineering or industrial engineering in, like, maybe junior year of high school, but it never felt right. I went into computing, had a very smooth transition into Berkeley. They have a local program where students from the local high school can start to take college classes early. So I’d even started taking some computer classes and then just went right into my freshman year. 

GEHRKE: Sounds like a very smooth transition. Anything bumpy? Anything bumpy on the ride out there, or …?  

KICIMAN: Nothing really, nothing really bumpy. I had one general engineering class that somehow got on my schedule at 8 AM freshman year. 

GEHRKE: [LAUGHS] That’s a tough one.  

KICIMAN: That’s a tough one, yeah. And so there were a few weeks I didn’t attend class, and I knew there was a midterm coming up, so I show up. Because, you know, next week, there’s a midterm. I better figure out what they’re, what they’re learning. And I come in a couple minutes late because it’s, even though I’m intending to go, it’s still an 8 AM class. I show up a few minutes late, and everyone is heads down writing on pieces of paper. The whole room is quiet. And the TA gives me a packet and says, you might as well start now. “Oh no.” And I’m like freaking out. Like this is, this is a bad dream. [LAUGHS] And I’m flipping through … not only do I not know how to answer the questions; I don’t understand the questions, like the vocabulary. It’s only been three weeks. How did they learn so much? And then I noticed that it’s an open-book exam and I don’t have my book on top of it, like … but what I didn’t notice and what became apparent in about 20 minutes … the TA clapped his hands, and said, “All right, everyone, put it down. We’ll go over the answers now.” It was a practice. 

GEHRKE: Oh, lucky you. 

KICIMAN: Oh, my god, yes. So I did nothing but study for that exam for the next week and did fine on it. 

GEHRKE: So you didn’t have to drop the class or anything like that? 

KICIMAN: No, no, no. I studied enough that I did reasonably, you know, reasonably well.  

GEHRKE: At what point in time was it clear to you that you wanted to do a PhD or that you wanted to continue your studies? 

KICIMAN: I tried to explore a lot during my undergrad, so I did go off to industry for a summer internship. Super fun.  

GEHRKE: Where did you, where did you work?  

KICIMAN: It was Netscape. 

GEHRKE: Oh Netscape. 

KICIMAN: And it was a joint project with IBM. 

GEHRKE: Which year was that in? 

KICIMAN: This would have been ’90, around ’93.1 

GEHRKE: ’93 … OK, so the very early days of Netscape, actually. 

KICIMAN: Yeah, yeah. They were building Netscape Navigator 4, and the project I was on was Netscape Navigator for OS/2.  

GEHRKE: OK.

KICIMAN: IBM’s OS/2 had come out and was doing poorly against NT, and they wanted to raise its profile. And this team of 20 people were really just focused on getting this out there. And so I always thought of, you know—and I was an OS/2 user already, which is how I got onto that project. 

GEHRKE: OK … And how was the culture there, or …?  

KICIMAN: The culture, it’s what you would think of as a startup culture. You know, they gave out all their meals. There was lots of fun events. You know, dentists came into the parking lot like once a month or something like that. 

GEHRKE: Dentist?  

KICIMAN: There was, like, a yeah, it was, yeah, you know, everyone’s working too much at the office, so the company wanted to make things easy.  

GEHRKE: That sounds great. 

KICIMAN: But the next summer then, I did a research internship, a research assistantship, at Berkeley. I worked with Randy Katz and Eric Brewer and got into, you know, trying to understand cellphone networks and what they were thinking about, you know, cloud infrastructure for new cellular technologies. 

GEHRKE: And Eric Brewer, was he, at that point in time, already running Inktomi, or … ? 

KICIMAN: He was already running Inktomi. Yeah, yeah, he’d already started it. I don’t think it was public yet at the time, but maybe getting there.  

GEHRKE: OK. Well, this was right at the beginning when, like, all the, you know, cloud infrastructure was defined and, you know, a lot of the basics were set. So you did this internship then in your, after your junior year, the second one?  

KICIMAN: Yeah, after my junior year. It was then senior year, and it was time to apply for, you know, what’s going to come after college. And I knew it … after that assistantship at Berkeley, I knew I was going to go do a PhD. 

GEHRKE: So what is the thing about the internship that made you want to stay in research? 

KICIMAN: Oh, it’s just the … it gave a vision of the future. Like, we were playing with, like, you know, there were people in the lab playing with video over the internet and, you know, teleconferencing, and just seeing that, it felt like you were seeing into the future and diving deep technically across the stack in a way that the industry internship hadn’t done. And so that part of it and obviously lots of particulars. You know, lots of internships do go very deep in industry, as well, but that’s what struck me, is that, kind of, wanting to learn was the big driver.  

GEHRKE: And what excited you about systems as compared to something that’s more applications-oriented or more touching the user? I feel like systems you always have to have this, kind of, drive for infrastructure and for scale and for, you know, building the foundation as compared to, like, directly impacting the user. 

KICIMAN: I think the way I think about systems today—and I can’t remember what it was about systems then. I’d always done operating … like, operating systems was one of my first upper-division courses at Berkeley and everything. So, like, I certainly enjoyed it a lot. But the way I think about systems now—and I think I do bring systems thinking to a lot of the work I do, even in AI and responsible AI—is the way you structure software, it feels like you should be making a statement about what the underlying problem is, what is the component you should be building from an elegance or first-principles perspective. But really, it’s about the people who are going to be using and building and maintaining that system. You want to componentize it so that the teams who are going to be building the bigger thing can work independently, revise and update their software without having to coordinate every little thing. I think that’s where that systems thinking comes in for me, is what’s the right abstraction that’s going to decouple folks from each other. 

GEHRKE: That’s a really great analogy because the way it was once told to me was that systems is really about discovering the beauty in large software. Because once you touch the user, you, sort of, have to do whatever is necessary to, you know, make the user happy. But in the foundations, you should have simplicity; you should have ease; you should have elegance. Is that how you think about it? 

KICIMAN: I do think about those aspects, but it’s for a purpose. You know, you want the elegance and the simplicity so that you can have, you know, one team working on Layer 1 of the stack, another team working on Layer 2 of the stack, and you don’t want them to have to talk to each other every 10 minutes when they’re making any change to any line of code, right. And so thinking about, what is the more fundamental layer of abstraction that lets these people work on separate problems? That’s what’s important to me. And, of course, like, that then interplays with people’s interests and expertise. And as people’s expertise evolves, that might mean that that has implications for the design of your system.  

GEHRKE: And so you’re, OK, you’re an undergrad. You have done this research experience; you now apply. So now you go to grad school. Do you do anything fun between your undergrad and grad school? 

KICIMAN: No, I went straight in. 

GEHRKE: Right straight in?  

KICIMAN:  Right straight in. I did my PhD at Stanford. So I went, you know, a little way to school. 

GEHRKE: To a rival school, isn’t it? Isn’t it a big rival school? 

KICIMAN: To a rival school. Well, the undergrad school wins. I think that’s the general rule of thumb. But I did continue working with folks at Berkeley. So my adviser was also from Berkeley and so …  

GEHRKE: Who was your adviser? 

KICIMAN: My adviser was Armando Fox, …  

GEHRKE: OK, yeah. Mm-hmm.  

KICIMAN: and we had a … 

GEHRKE: Recovery-oriented computing? 

KICIMAN: Yes, exactly. Recovery-oriented computing. And the other person on the recovery-oriented computing project …  

GEHRKE: Dave Patterson …  

KICIMAN: … was Dave Patterson, yeah. 

GEHRKE: So it was really a true, sort of, Stanford-Berkeley joint project in a way?

KICIMAN: Yes, yeah. And that was my PhD. The work I did then was the first work to apply machine learning to the problem of fault detection and diagnosis in large-scale systems. I worked with two large companies—one of them was Amazon; one of them was anonymous—to test out these ideas in more realistic settings. And then I did a lot of open-source work with J2EE to demonstrate how you can trace the behavior of a system and build up models of its behavior and detect anomalies. Funnily enough, I know this is going to sound a little alien to us now maybe in today’s world: Dave and Armando would not let me use the phrase “artificial intelligence” anywhere in my thesis because they were worried I would not be able to get a job. 

GEHRKE: I see. Because that was, sort of, one of … I mean, AI goes through these hype cycles and then, you know, the winters again, and so this was one of the winter times? 

KICIMAN: This was definitely a wintertime. I was able to use the phrase “machine learning” in the body of the thesis, but I had to make up something about statistical monitoring for the title. 

GEHRKE: So what is the actual final title of your thesis, if you remember it? 

KICIMAN: “Statistical monitoring for fault detection and diagnosis in large-scale internet services” or something like that. 

GEHRKE: Makes sense. 

KICIMAN: Yeah. 

GEHRKE: So you replaced AI with statistical modeling and then everything [turned out all right]? 

KICIMAN: Yes, yeah. Everything … then it didn’t sound too hype-y. 

GEHRKE: And then after your PhD, you went straight to MSR, is that right? 

KICIMAN: Yeah. I mean, so here I’m coming out of my PhD with a focus on academic-style research for large-scale systems. Kind of boxed myself in a little bit. No university has a large-scale internet service, and most large-scale internet service companies don’t have research arms. So Microsoft Research was actually the perfect fit for this work. And when I got here, I started diving in and actually expanding a little bit and thinking about what are the end-to-end reliability issues with our services. So assume that the back end is running well. What else could go wrong that’s going to get in the way of the user? So I had one project going on, wide area network reliability with David Maltz, and one project …  

GEHRKE: Who is now CVP in Azure.  

KICIMAN: Who’s now, yeah, leading Azure network—the head of Azure networking. And one project on how we can monitor the behavior of our JavaScript applications that were just starting to become big. Like around then is when, you know, the first 10,000-line, 100,000-line-of-code JavaScript applications [were] appearing, and we had no idea whether they were actually running correctly, right? They’re running on someone else’s browser and someone else’s operating system. We didn’t know.  

GEHRKE: A big one at that point in time, I think was Gmail, right? This was, sort of, a really big one. But did we have any big ones in Microsoft? 

KICIMAN: Gmail was the first big one in the industry. 

GEHRKE: Hotmail, was it also Java, based in JavaScript? 

KICIMAN: Hotmail was not initially JavaScript based. The biggest one at that time was our maps. Not Bing maps, but whatever we called it.  

GEHRKE: MSN maps, or …  

KICIMAN: Probably something like that, yeah, yeah.  

GEHRKE: I see. And so you applied your techniques to that code base and tried to find a lot of bugs? 

KICIMAN: Yeah, this project was—and this was about data gathering, right, so I’m still thinking about it from the perspective of how do I analyze data to tell me what’s going on. We had data for the wide area network, but these web applications, we didn’t have any. So I’m, like, I’m going to build this infrastructure, collect the data, so that in a couple years, I can analyze it. And so what I wrote was a proxy that sat on the side of the IAS server and just dynamically instrumented all the JavaScript that got shipped out. And the idea was that no one user was going to pay the cost of the instrumentation, but everyone would pay a little small percentage, and then you could collect it in the back end to get the full complete picture.  

GEHRKE: Right. It’s so interesting because, I mean, in those days, right, you still thought maybe in terms of years and so on, right. I mean, you’ve said, well, I instrumented, then maybe in a year, I have some data. And today it happens that I instrument, and tomorrow I have enough data to make a decision on an A/B test and so on, right. It was a very different time, right. And also, it was probably a defining time for Microsoft because we moved into online services, right. We moved into large-scale internet services. So it must have been exciting to be in the middle of all of this. 

KICIMAN: It really was. I mean, there was a lot of change happening both inside Microsoft and outside Microsoft. That’s when … soon after this is when social networking started to become big, right. You started seeing Facebook and Twitter show up, and search became a bigger deal for Microsoft when we started investing in Windows Live and then Bing, and that’s actually … my manager, Yi-Min Wang, actually joined up with Harry Shum to create the Internet Services Research Center with the specific focus of helping Bing. And so that also shifted my focus a little bit and so had me looking more at some of the social data that would, kind of, take my trajectory on a little bit further.

GEHRKE: Right. I mean, so you’re unique in that, you know, people very often, they come in here and, you know, they’re specialists in systems, and they branch out within systems a little bit and, you know, of course, move with time. Maybe now they do, you know, AI infrastructure. But you have really moved quite a bit, right. I mean, you did your PhD on systems … I mean, systems and AI really, the way I understand it. Then you worked here a little bit more on systems in wide area and large-scale systems. But then, you know, you really became also an expert in causality and looked at, sort of, the social side. And now you, of course, have started to move very deeply into LLMs. So rather than talking about the topics itself, how do you decide? How do you make these decisions? How do you … you know, you’re a world expert on x, and how do you, in some sense, throw it all away and go to y? Do you decide one day, “I’m interested in y“? Do you, sort of, shift over time a little bit? How do you do it? 

KICIMAN: I’ve done it, I think, two or maybe three times, depending on if you count now, and some transitions have gone better than others. I think my transition from systems to social data and computational social science, it was driven by a project that we did for search at the time. Shuo Chen, another researcher here at Microsoft Research, built a web application that lets you give very concrete feedback back to Windows Live. You could drag and drop the results around and say, this is what I wanted it to look like. And this made, you know, feedback much more actionable and helped really understand DSATs and where they’re coming from. DSAT being dissatisfactions. And I looked at that and I was like, I want to be able to move search results around and share with my friends. And I, kind of, poked at Shuo, you know, asked him if he would build this, and he said no. He said he’s busy. So eventually, I—because I knew something about JavaScript applications—decided to just drop things and spend six months building out this application. So I built out this social search application where you could drag and drop search results around, share it with your friends, and we put it out, actually. We got it deployed as an external service. We had maybe 10,000 people kick the tires.  

GEHRKE: Within Microsoft or …?  

KICIMAN: No, externally.  

GEHRKE: OK.  

KICIMAN: Yeah. There was a great headline that, like, Google then fast followed with a similar feature, and the headline was like, Google fast follows, basically, on Microsoft. Our PR folks were very excited about that. I say this all … I mean, it’s all history now. But certainly, it was fun at the time. But now we’re … I’m giving this demo, this talk, about this prototype that we built and what we’re learning about, you know, what’s in people’s way, what’s friction, what do they like and not like, etc. And I’m standing up and, you know, giving this presentation, this demo, and someone says, hey could you, could you go back to, you know, go back in the browser? On the bottom right corner, it says Mike did something on this search page; he edited some search results. Could you click on that? I want to know what he did. I’m like, OK, yeah, sure. I click on it. And [it’s like], OK, that’s great. That’s, that’s really interesting. And this happened multiple times. Like, in a formal presentation, for someone to interrupt you and ask a personal question just out of their own curiosity, that’s what showed me … that’s what got me really thinking deeply about the value of this social data and, like, why is it locked up in a very specific interface. What else could you do with this data if it’s so engaging, so fascinating, that people are willing to interrupt a speaker for some totally irrelevant, basically, question? And that’s when I switched to really trying to figure out what to do with social data. 

GEHRKE: I see. So it was this, kind of, really personal experience of people being so excited about that social interaction on the demos that you’re giving. 

KICIMAN: Exactly. They cared about their friends and what their friends did, and that was super clear.

GEHRKE: So, so coming back, let’s go there in a second, but coming back to the story that you told, you said you had 10,000 external users. 

KICIMAN: Yeah.

GEHRKE: So I’m still, you know, also always trying to learn what we can do better because we sometimes have prototypes that are incredibly valuable. They’re prototypes that have fans; they’re prototypes that, you know, the fans even want to contribute. But then somehow, we get stuck in the middle; and they don’t scale, and they don’t become a business. What happened with that?

KICIMAN: Yeah. 

GEHRKE: Also in [retrospect], … 

KICIMAN: In retrospect … 

GEHRKE: … what, what … should we have done something different, or did it live up to its potential? 

KICIMAN: I think we learned something. I think that there were a couple of things we learned. One was that, you know, every extra click that people wanted to do, you know, took the number of interactions down by, you know, an order of magnitude. So starring something and bringing it to the top, that was very popular. Dragging and dropping? Little bit less so. Dragging and dropping from one search to a different search? So maybe I’ll search for, you know, “Johannes,” find your homepage, and then drag and drop it to, like, people’s, you know, publications list to, like, keep an eye on or something. Like that, almost never. And people were very wary about editing the page. Like, what if I make a mistake? What if it’s just, just me, like, who wants this, and I’m messing up search for the rest of the world? And it’s like, no, no, it’s just your friends, like just you and your friends who are going to see this. And so we learned a lot about people’s mental models and, like, what stood in the way of, you know, interactions on the web. There were lots of challenges to doing this at scale. I mean, we needed, for example, a way of tracking users. We needed a way of very quickly, within 100 milliseconds, getting information about a user’s past edits to search pages into, you know, into memory if we were going to do this for real on Windows Live. And we just didn’t have the infrastructure.

GEHRKE: I see. And those problems were hard in those days. 

KICIMAN: Yeah. A prototype is fine. People, you know, will handle a little bit of latency if it’s a research prototype, but for everyday use, you need something more. 

GEHRKE: And there was no push to try it, to land it somehow, or what … ?  

KICIMAN: There were big pushes, but the infrastructure, it was really … 

GEHRKE: I see. It was really an infrastructure problem, then? 

KICIMAN: Yeah, yeah. 

GEHRKE: OK. Interesting because it sounds to me like, wow, there’s an exciting research problem there; now you need the infrastructure to try to make all of these things really, really fast. It’s always fascinating to see, you know, where things get stuck and how they, how they proceed. 

KICIMAN: Yeah, I think it’d be a lot easier to build that—from an infrastructure point of view—today. But, of course, then there’s lots of other questions, like is this really what, you know, the best thing to do. Like I mentioned, Google had this fast follow feature. They also removed it afterwards, as well.  

GEHRKE: OK. Yeah, hindsight is always, you know, twenty-twenty. So, OK, so you’re now starting to move into social computing, right, and trying to understand more about social interactions between users. How did you end up in causality, and then how did you make the switch to LLMs? And maybe even more about this; I mean, I understand here this was, sort of, this personal story that you really saw that, you know, the audience was really asking you about what’s happening here and that, sort of, motivated you. Was it always this personal drive, or was it always others who pulled you? And how did you make these switches? 

KICIMAN: I think the switch from systems into social, it was about trying to get closer to problems that really mattered to people. I really enjoy working on systems problems, but oftentimes, they feel like they’re in the back end. And so I wanted something where, you know, even if I’m not the domain expert working on something, I can feel like I’m making a contribution to that problem. The transition with social data then into causality and, um, and LLMs, that was a bit smoother. So working with social data, trying to understand what it meant and what it said about the world in aggregate, was super-fascinating problems. So much information is embedded in the digital traces that people leave behind. But it was really difficult for people to come to solid conclusions. So there was one conference I went to where almost every presentation that day gave some fascinating insight. This is how people make friendships. This is how, you know, we’re seeing, like, signs of disease spread in, you know, through real-world interactions as they’re in social data. Here’s how people spend their time. And then people would, and then people would close; their conclusion slide every time was, “And, of course, correlation is not causation, so anything could actually be happening.” Like, that is such, that is such a bummer. Like, beautiful theory, great understanding. You spent so much time. I feel like I got some insight. And then you pull the rug out and say, but maybe not. And I’d heard about this work on … that there was work on causal analysis and that there were certain conditions and ways to get actual learned causal relationships from data. So that’s the day I decided I’m going to go figure out what that is and how to apply it to social data for these types of questions. And I went out, and the first work there was a collaboration with Munmun De Choudhury, faculty at Georgia Tech, looking at online traces related to mental health and suicidal ideation and trying to understand what some of the factors were in a more, in a more solid and causal fashion. And so this really became, like, this was … this interest in computational social science really ended up branching out into two areas. One, obviously, I’m caring about, what can we learn about the world? Part of this is, of course, thinking deeply about the implications of AI on society, like what is it going to mean that we have this data for all of these, you know, societal challenges? And then causality. So the AI and its implications on society is what led towards the work on the security of AI systems and now security of AI as it relates to large language models. And then causality was the other branch that split off from there. Both of them really stemming from this desire to see that we have a positive impact with AI.

GEHRKE: So you mentioned that, you know, you were sitting in these talks and people are talking about the correlation, and now you finally have this new tool, which is causation. So what are some of the examples where, you know, with correlation you came out with answer A, but now causation gave you some better, some real deep insights? 

KICIMAN: I haven’t gone looking to refute studies, so … 

GEHRKE: I see. OK.  

KICIMAN: … but there are many well-known studies in the past where people have made mistakes because they didn’t account for the right confounding variables. Ronny Kohavi has a great list of these on one of his websites. But a fun one is a study that came out in the late ’90s on the influence of night lights on myopia in children. So this was a big splash. I think it made it to like Newsweek or 60 Minutes and stuff, that if you have night lights in the house, your kids are more likely to need glasses. And this was wrong. 

GEHRKE: My parents told me all the time, don’t read in bed, you know, with your flashlight because your eyes are going to get bad. 

KICIMAN: Yes.  

GEHRKE: That’s the story basically, right? 

KICIMAN: This was, yeah, the night lights that plug in the wall.  

GEHRKE: But that’s the …  

KICIMAN: That’s the idea, the same thing. 

GEHRKE: The same thing, right. 

KICIMAN: And so these people analyzed a bunch of data, and they found that there was a correlation, and they said that, you know, it’s a cause; you know, this is a cause. And the problem was that they didn’t account for the parents’ myopia. Apparently, parents who had myopia were more likely to install night lights. And then you have the genetic factor then actually causing the myopia. Very simple. But, you know, people have to replicate this study to, you know, to realize it was a mistake. Others were things like correlations, I think, around vitamin C have been reported repeatedly and then refuted in randomized control trials. But there’s many of these. Medicine, in particular, has a long history of false correlations leading people astray. 

GEHRKE: Do you have a story where here at Microsoft your work in causation had a really big impact? 

KICIMAN: You know, the one—it’s still ongoing—but one of the ones that I’m really excited about now, and thinking also from the broader societal impact lens, is a collaboration with Ranveer Chandra and his group. So with a close collaborator at MSR India, Amit Sharma, we’ve developed a connection between representation learning and underlying causal representation of the data-generating process that’s driving something. So if you imagine, like, we want to learn a classifier on an object, on an image, and we want that classifier to generalize to other settings, there’s lots of reasons why this can go wrong. You know, you have, you know, like a classic example is the question of, is this picture showing you a camel, or is it showing you a cow? The classifier is much more likely to look at the background, and if it’s green grass, it’s probably a cow. If it’s sandy desert, it’s probably a camel. But then you fail if you look at a camel in the zoo or a cow on a beach, right. So how do you make sure that you’re looking at the real features? People have developed algorithms for these. But no algorithm actually is robust across all the different kinds of distribution shifts that people see in the real world. Some algorithms work on these kinds of distribution shifts. Some algorithms work on those kinds of distribution shifts. And it was a bit of an interesting, I think, puzzle as to why. And so we realized that these distribution shifts, if you look at them from a causal perspective, you can see that the algorithms are actually imposing different statistical independence constraints. And you can read those statistical independence constraints off of a causal graph. And the reason that some algorithms worked well in some settings was that the underlying causal graph implied a different set of statistical independence constraints in that setting. And so that algorithm was the right one for that setting. If you have a different causal graph with different statistical independence constraints, the other algorithm was better. And so now you can see that no one algorithm is going to work well across all of them. So we built an adaptive algorithm that looks at the causal graph, picks the right statistical independencies, and applies them, and now what we’re doing with this algorithm is we’re applying it to satellite imagery to help us build a more generalizable, more robust model of carbon in farm fields so we can remotely sense and predict what the carbon level is in a field. And so, the early results …  

GEHRKE: And that’s important for what?

KICIMAN: And so this is important because soil is seen as a very promising method for sequestering carbon for a climate change perspective. And it’s also the more carbon there is … the higher your soil carbon, usually the healthier the soil is, as well. It’s able to absorb more water, so less flooding; your crops are more productive because of the microbial growth that’s happening. And so people want to adopt policies and methods that increase the soil carbon in the fields for all of these reasons. But measuring soil carbon is really intensive. You have to go sample it, take it off to a lab, and it’s too expensive for people to do regularly. And so if we can develop remote-sensing methods that are able to take a satellite image and, you know, really robustly predict what the real soil carbon measurement would be, that’s really game changing. That’s something that, you know, will help us evaluate policies and whether they’re working; help us evaluate, you know, what the right practices should be for a particular field. So I’m really excited about that.  

GEHRKE: That’s really exciting. You’d mentioned when we talked before that you’d benefited in your career from several good mentors. How do you think about mentoring, and what are the ways that you benefited from it? And how do you, you know, live that now in your daily life as you’re a mentor now to the next generation? 

KICIMAN: Yeah, the way I look at all the people—and there’s so many—who have, you know, given me a hand and advice and, you know, along the way, I often find I pick up on some attributes of my mentors, of a particular mentor, and find that it’s something that I want to emulate. So recognizing, you know, everyone is complicated and no one is perfect, but, you know, there’s so many ways that, you know, individuals get things right and trying to understand what it is that they’re doing right and how I can try and repeat that for, like, you said, the next generation, I think, is really, really important. It’s like one story, for example, around 2008, while I was still working on large-scale internet services, I was going around the company to, kind of, get a sense of, you know, what’s the current state of the reliability of our services and how we architect them and run them. And so I was talking to developers and architects and Ops folks around the company, and James Hamilton was a great mentor at that moment, helping me to connect, helping suggest questions that I might ask. 

GEHRKE: So he was working on SQL Server reliability, right, at that point in time or on Windows reliability? 

KICIMAN: He was already starting to move over into datacenter reliability. I think at the time, right before he moved over to the research side of things, I think he was one of the heads of the, of our enterprise email businesses, and then he came over to research to focus on, I think, datacenters in general. And, yeah, and he just donated so much of his time. He was so generous with, you know, reviewing this large report that I was writing and just helping me out with insights. That struck me as, like … he’s a very busy person. He’s doing all this stuff, and he’s spending, you know, I sent him an email with, you know, 15 pages, and he responds with feedback within a couple of hours every morning. That was astonishing to me, especially in hindsight, and so … but that kind of generosity of time and trying to help direct people’s work in a way that’s going to be most impactful for what they want to achieve, that’s something I try and emulate today. 

GEHRKE: So, so, you know, you’ve benefited from a lot of great mentors and you said you’re now also a mentor to others. Do you have any last piece of advice for any of our listeners? 

KICIMAN: I think it’s really important for people to find passion and joy in the work that they do and, at some point, do the work for the work’s sake. I think this will drive you through the challenges that you’ll inevitably face with any sort of project and give you the persistence that you need to really have the impact that you want to have. 

GEHRKE: Well, thanks for that advice. And thanks for being in What’s Your Story, Emre. 

KICIMAN: Thanks very much, Johannes. Great to be here.  

[MUSIC] 

To learn more about Emre or to see photos of Emre as a child in California, visit aka.ms/ResearcherStories. 

[MUSIC FADES] 


[1] Kiciman later noted the year he interned at Netscape was 1997. 

The post What’s Your Story: Emre Kiciman appeared first on Microsoft Research.

Read More