New GPT-3 Capabilities: Edit & Insert

New GPT-3 Capabilities: Edit & Insert

We’ve released new versions of GPT-3 and Codex which can edit or insert content into existing text, rather than just completing existing text. These new capabilities make it practical to use the OpenAI API to revise existing content, such as rewriting a paragraph of text or refactoring code. This unlocks new use cases and improves existing ones; for example, insertion is already being piloted in GitHub Copilot with promising early results.

Read Edit Documentation


Read Insert Documentation


Try in Playground
 
def___

fib(10)
 
def fib(n):
    if n 
fib(10)=>
 
def fib(n):
    if n 
fib(10)=>
Improve
def fib(n):
    if n 
fib(10)=>
Improve the
def fib(n):
    if n 
fib(10)=>
Improve the runtime
def fib(n):
    if n 
fib(10)=>
Improve the runtime complexity
def fib(n):
    if n 
fib(10)=>
Improve the runtime complexity of the
def fib(n):
    if n 
fib(10)=>
Improve the runtime complexity of the function
def fib(n):
    if n 
fib(10)=>
Improve the runtime complexity of the function
def fib(n, memo={}):
    if n in memo:
        return memo[n]
    if n =>
 
def fib(n, memo={}):
    if n in memo:
        return memo[n]
    if n =>
Translate
def fib(n, memo={}):
    if n in memo:
        return memo[n]
    if n =>
Translate to
def fib(n, memo={}):
    if n in memo:
        return memo[n]
    if n =>
Translate to JavaScript
def fib(n, memo={}):
    if n in memo:
        return memo[n]
    if n =>
Translate to JavaScript
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        if (n in memo) return memo[n];
        var f;
        if (n =>
 
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        if (n in memo) return memo[n];
        var f;
        if (n =>
Write
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        if (n in memo) return memo[n];
        var f;
        if (n =>
Write this
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        if (n in memo) return memo[n];
        var f;
        if (n =>
Write this as
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        if (n in memo) return memo[n];
        var f;
        if (n =>
Write this as a
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        if (n in memo) return memo[n];
        var f;
        if (n =>
Write this as a one
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        if (n in memo) return memo[n];
        var f;
        if (n =>
Write this as a one-liner
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        if (n in memo) return memo[n];
        var f;
        if (n =>
Write this as a one-liner
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        return n in memo ? memo[n] : (memo[n] = n =>
 

function fibonacci(n) { var memo = {}; return (function fib(n, memo) { return n in memo ? memo[n] : (memo[n] = n =>
 
/
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        return n in memo ? memo[n] : (memo[n] = n =>
 
/**
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        return n in memo ? memo[n] : (memo[n] = n =>
 
/**___
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        return n in memo ? memo[n] : (memo[n] = n =>
 
/**​
 * Recursive Fibonacci function with memoization.
 * @param {number} n
 * @returns {number}
 */
function fibonacci(n) {
    var memo = {};
    return (function fib(n, memo) {
        return n in memo ? memo[n] : (memo[n] = n =>

GPT-3 and Codex have traditionally added text to the end of existing content, based on the text that came before. Whether working with text or code, writing is more than just appending—it’s an iterative process where existing text is revised. GPT-3 and Codex can now edit text, changing what’s currently there or adding text to the middle of content.

Inserting into the middle of text

The new insert capability adds contextually relevant text in the middle of existing content. Providing future context to the model can improve the quality of completions for applications such as writing long-form text, transitioning between paragraphs, following an outline, or guiding the model toward an ending.

Section 1. High school graduation

Today’s the big day!___

Section 1. High school graduation

Today’s the big day! Congratulations to all our high school graduates!

We hope you enjoy your graduation ceremonies and wish you the best of luck in your future endeavors.

Section 1. High school graduation

Today’s the big day! Congratulations to all our high school graduates!

We hope you enjoy your graduation ceremonies and wish you the best of luck in your future endeavors.

Section

Section 1. High school graduation

Today’s the big day! Congratulations to all our high school graduates!

We hope you enjoy your graduation ceremonies and wish you the best of luck in your future endeavors.

Section 2.

Section 1. High school graduation

Today’s the big day! Congratulations to all our high school graduates!

We hope you enjoy your graduation ceremonies and wish you the best of luck in your future endeavors.

Section 2. Moving

Section 1. High school graduation

Today’s the big day! Congratulations to all our high school graduates!

We hope you enjoy your graduation ceremonies and wish you the best of luck in your future endeavors.

Section 2. Moving to

Section 1. High school graduation

Today’s the big day! Congratulations to all our high school graduates!

We hope you enjoy your graduation ceremonies and wish you the best of luck in your future endeavors.

Section 2. Moving to San Francisco

Section 1. High school graduation

Today’s the big day! I’m graduating from high school. I’m so excited. I’m going to move to San Francisco and start a new life. I can’t wait to get there.

Section 2. Moving to San Francisco

Section 1. High school graduation

Today’s the big day! I’m graduating from high school. I’m so excited. I’m going to move to San Francisco and start a new life. I can’t wait to get there.

Section 2. Moving to

Section 1. High school graduation

Today’s the big day! I’m graduating from high school. I’m so excited. I’m going to move to San Francisco and start a new life. I can’t wait to get there.

Section 2. Moving to Istanbul

Section 1. High school graduation

Today’s the big day! I’m graduating from high school. I’m so excited to be moving on to the next chapter of my life. I’m going to Istanbul!

Section 2. Moving to Istanbul

Section 1. High school graduation

Today’s the big day! I’m graduating from high school. I’m so excited to be moving on to the next chapter of my life. I’m going to Istanbul!

Section 2. Moving to

Section 1. High school graduation

Today’s the big day! I’m graduating from high school. I’m so excited to be moving on to the next chapter of my life. I’m going to Istanbul!

Section 2. Moving to the

Section 1. High school graduation

Today’s the big day! I’m graduating from high school. I’m so excited to be moving on to the next chapter of my life. I’m going to Istanbul!

Section 2. Moving to the farm

Section 1. High school graduation

Today’s the big day! My high school graduation! I’m so excited! I can’t wait to leave the city and move to the farm. I want to live a simple, quiet life. I want to get up in the morning, milk the cows, feed the horses, then sit and watch the sunrise. I want to spend the rest of my days traveling the world, reading and writing. I’m so tired of this city life.

Section 2. Moving to the farm

In the example above, the desire is to fill-in text between two section headers of an outline. Without the context of future sections, the model generates a completion that isn’t relevant to the second section. When the context of future sections are accounted for, the model generates a completion that ties the two sections together.

def get_files(path: str, size: int):
    def___
    for (dirp, _, files) in os.walk(path):
        yield from prune(dirp, files)
def get_files(path: str, size: int):
    def prune(dirp, files):
        for file in files:
            file = os.path.join(dirp, file)
            if os.path.getsize(file) > size:
                yield file
    for (dirp, _, files) in os.walk(path):
        yield from prune(dirp, files)

def get_files(path: str, size: int): def prune(dirp, files): for file in files: file = os.path.join(dirp, file) if os.path.getsize(file) > size: yield file for (dirp, _, files) in os.walk(path): yield from prune(dirp, files)
___
def get_files(path: str, size: int):
    def prune(dirp, files):
        for file in files:
            file = os.path.join(dirp, file)
            if os.path.getsize(file) > size:
                yield file
    for (dirp, _, files) in os.walk(path):
        yield from prune(dirp, files)
import os
def get_files(path: str, size: int):
    def prune(dirp, files):
        for file in files:
            file = os.path.join(dirp, file)
            if os.path.getsize(file) > size:
                yield file
    for (dirp, _, files) in os.walk(path):
        yield from prune(dirp, files)
import os
def get_files(path: str, size: int):
    """
    def prune(dirp, files):
        for file in files:
            file = os.path.join(dirp, file)
            if os.path.getsize(file) > size:
                yield file
    for (dirp, _, files) in os.walk(path):
        yield from prune(dirp, files)
import os
def get_files(path: str, size: int):
    """___
    def prune(dirp, files):
        for file in files:
            file = os.path.join(dirp, file)
            if os.path.getsize(file) > size:
                yield file
    for (dirp, _, files) in os.walk(path):
        yield from prune(dirp, files)
import os
def get_files(path: str, size: int):
    """___"""
    def prune(dirp, files):
        for file in files:
            file = os.path.join(dirp, file)
            if os.path.getsize(file) > size:
                yield file
    for (dirp, _, files) in os.walk(path):
        yield from prune(dirp, files)
import os
def get_files(path: str, size: int):
    """Yields files in the path tree of min size"""
    def prune(dirp, files):
        for file in files:
            file = os.path.join(dirp, file)
            if os.path.getsize(file) > size:
                yield file
    for (dirp, _, files) in os.walk(path):
        yield from prune(dirp, files)

Insert is particularly useful for writing code. In fact, Codex was our original motivation for developing this capability, since in software development we typically add code to the middle of an existing file where code is present before and after the completion. In the example above, the model successfully completes the missing function prune, while connecting to code already written. We also add a docstring and missing imports, which is not possible without knowing the code that comes after. In GitHub Copilot, Insert is currently being piloted with early promising results.

The insert capability is available in the API today in beta, as part of the completions endpoint and via a new interface in Playground. The capability can be used with the latest versions of GPT-3 and Codex, text-davinci-002 and code-davinci-002. Pricing is the same as previous versions of Davinci.

Editing existing text

A meaningful part of writing text and code is spent editing existing content. We’ve released a new endpoint in beta called edits that changes existing text via an instruction, instead of completing it.

 
Add
Add a
Add a short poem
Add a short poem about
Add a short poem about GPT-3
Add a short poem about GPT-3
GPT-3 is a very nice AI

That’s pretty good at writing replies

When it’s asked a question

It gives its suggestion

This is a poem it made that rhymes
 
GPT-3 is a very nice AI

That’s pretty good at writing replies

When it’s asked a question

It gives its suggestion

This is a poem it made that rhymes
Make
GPT-3 is a very nice AI

That’s pretty good at writing replies

When it’s asked a question

It gives its suggestion

This is a poem it made that rhymes
Make this
GPT-3 is a very nice AI

That’s pretty good at writing replies

When it’s asked a question

It gives its suggestion

This is a poem it made that rhymes
Make this in the
GPT-3 is a very nice AI

That’s pretty good at writing replies

When it’s asked a question

It gives its suggestion

This is a poem it made that rhymes
Make this in the voice
GPT-3 is a very nice AI

That’s pretty good at writing replies

When it’s asked a question

It gives its suggestion

This is a poem it made that rhymes
Make this in the voice of GPT-3
GPT-3 is a very nice AI

That’s pretty good at writing replies

When it’s asked a question

It gives its suggestion

This is a poem it made that rhymes
Make this in the voice of GPT-3
I am a very nice AI

I am pretty good at writing replies

When I am asked a question

I give my suggestion

This is a poem it made that rhymes
 
I am a very nice AI

I am pretty good at writing replies

When I am asked a question

I give my suggestion

This is a poem it made that rhymes
Format
I am a very nice AI

I am pretty good at writing replies

When I am asked a question

I give my suggestion

This is a poem it made that rhymes
Format this
I am a very nice AI

I am pretty good at writing replies

When I am asked a question

I give my suggestion

This is a poem it made that rhymes
Format this like a
I am a very nice AI

I am pretty good at writing replies

When I am asked a question

I give my suggestion

This is a poem it made that rhymes
Format this like a letter
I am a very nice AI

I am pretty good at writing replies

When I am asked a question

I give my suggestion

This is a poem it made that rhymes
Format this like a letter and sign
I am a very nice AI

I am pretty good at writing replies

When I am asked a question

I give my suggestion

This is a poem it made that rhymes
Format this like a letter and sign from GPT-3
I am a very nice AI

I am pretty good at writing replies

When I am asked a question

I give my suggestion

This is a poem it made that rhymes
Format this like a letter and sign from GPT-3
Dear Human,

I am very nice AI

I am pretty good at writing replies

When I am asked a question

I give my suggestion

This is a poem I made that rhymes

Kind regards,

GPT-3

Editing works by specifying existing text as a prompt and an instruction on how to modify it. The edits endpoint can be used to change the tone or structure of text, or make targeted changes like fixing spelling. We’ve also observed edits to work well on empty prompts, thus enabling text generation similar to the completions endpoint. In the example above, we use edits to (1) add a poem, (2) change the poem to be in first-person, (3) transform the poem into a letter, with the appropriate salutation and signature.

 
The three US cities with the worst traffic are:

1. Boston, MA (164 hours) 2. Washington, DC (155 hours) 3. Chicago, IL (138 hours)
Convert
The three US cities with the worst traffic are:

1. Boston, MA (164 hours) 2. Washington, DC (155 hours) 3. Chicago, IL (138 hours)
Convert the list
The three US cities with the worst traffic are:

1. Boston, MA (164 hours) 2. Washington, DC (155 hours) 3. Chicago, IL (138 hours)
Convert the list to JSONL file
The three US cities with the worst traffic are:

1. Boston, MA (164 hours) 2. Washington, DC (155 hours) 3. Chicago, IL (138 hours)
Convert the list to JSONL file with rank
The three US cities with the worst traffic are:

1. Boston, MA (164 hours) 2. Washington, DC (155 hours) 3. Chicago, IL (138 hours)
Convert the list to JSONL file with rank, city
The three US cities with the worst traffic are:

1. Boston, MA (164 hours) 2. Washington, DC (155 hours) 3. Chicago, IL (138 hours)
Convert the list to JSONL file with rank, city, state
The three US cities with the worst traffic are:

1. Boston, MA (164 hours) 2. Washington, DC (155 hours) 3. Chicago, IL (138 hours)
Convert the list to JSONL file with rank, city, state, and hours
The three US cities with the worst traffic are:

1. Boston, MA (164 hours) 2. Washington, DC (155 hours) 3. Chicago, IL (138 hours)
Convert the list to JSONL file with rank, city, state, and hours
[
  {"rank": 1, "city": "Boston", "state": "MA", "hours": 164},
  {"rank": 2, "city": "Washington DC", "state": "DC", "hours": 155},
  {"rank": 3, "city": "Chicago", "state": "IL", "hours": 138}
]
 
[
  {"rank": 1, "city": "Boston", "state": "MA", "hours": 164},
  {"rank": 2, "city": "Washington DC", "state": "DC", "hours": 155},
  {"rank": 3, "city": "Chicago", "state": "IL", "hours": 138}
]
Remove
[
  {"rank": 1, "city": "Boston", "state": "MA", "hours": 164},
  {"rank": 2, "city": "Washington DC", "state": "DC", "hours": 155},
  {"rank": 3, "city": "Chicago", "state": "IL", "hours": 138}
]
Remove rank
[
  {"rank": 1, "city": "Boston", "state": "MA", "hours": 164},
  {"rank": 2, "city": "Washington DC", "state": "DC", "hours": 155},
  {"rank": 3, "city": "Chicago", "state": "IL", "hours": 138}
]
Remove rank and write out
[
  {"rank": 1, "city": "Boston", "state": "MA", "hours": 164},
  {"rank": 2, "city": "Washington DC", "state": "DC", "hours": 155},
  {"rank": 3, "city": "Chicago", "state": "IL", "hours": 138}
]
Remove rank and write out full state names
[
  {"rank": 1, "city": "Boston", "state": "MA", "hours": 164},
  {"rank": 2, "city": "Washington DC", "state": "DC", "hours": 155},
  {"rank": 3, "city": "Chicago", "state": "IL", "hours": 138}
]
Remove rank and write out full state names
[
  {"city": "Boston", "state": "Massachusetts", "hours": 164},
  {"city": "Washington DC", "state": "District of Columbia", "hours": 155},
  {"city": "Chicago", "state": "Illinois", "hours": 138}
]
 
[
  {"city": "Boston", "state": "Massachusetts", "hours": 164},
  {"city": "Washington DC", "state": "District of Columbia", "hours": 155},
  {"city": "Chicago", "state": "Illinois", "hours": 138}
]
Turn
[
  {"city": "Boston", "state": "Massachusetts", "hours": 164},
  {"city": "Washington DC", "state": "District of Columbia", "hours": 155},
  {"city": "Chicago", "state": "Illinois", "hours": 138}
]
Turn into YAML
[
  {"city": "Boston", "state": "Massachusetts", "hours": 164},
  {"city": "Washington DC", "state": "District of Columbia", "hours": 155},
  {"city": "Chicago", "state": "Illinois", "hours": 138}
]
Turn into YAML and return
[
  {"city": "Boston", "state": "Massachusetts", "hours": 164},
  {"city": "Washington DC", "state": "District of Columbia", "hours": 155},
  {"city": "Chicago", "state": "Illinois", "hours": 138}
]
Turn into YAML and return from a function
[
  {"city": "Boston", "state": "Massachusetts", "hours": 164},
  {"city": "Washington DC", "state": "District of Columbia", "hours": 155},
  {"city": "Chicago", "state": "Illinois", "hours": 138}
]
Turn into YAML and return from a function called get_yaml
[
  {"city": "Boston", "state": "Massachusetts", "hours": 164},
  {"city": "Washington DC", "state": "District of Columbia", "hours": 155},
  {"city": "Chicago", "state": "Illinois", "hours": 138}
]
Turn into YAML and return from a function called get_yaml
def get_yaml():
    return """
    - city: Boston
      state: Massachusetts
      hours: 164
    - city: Washington DC
      state: District of Columbia
      hours: 155
    - city: Chicago
      state: Illinois
      hours: 138
    """

The edits endpoint is particularly useful for writing code. It works well for tasks like refactoring, adding documentation, translating between programming languages, and changing coding style. The example above starts with JSON input containing cities ranked by population. With our first edit, Codex removes the rank field from the JSON, and changes the state abbreviations into full names. The second edit converts the JSON file into YAML returned from a function.

Editing is available as a specialized endpoint in the API and through a new interface in Playground. It is supported by models text-davinci-edits-001 and code-davinci-edits-001. The edits endpoint is currently free to use and publicly available as a beta.


Contributions

Research advancements of insert and edit: Mohammad Bavarian, Heewoo Jun, Oleg Klimov, Raul Puri, Qiming Yuan

Developing new versions of GPT-3 and Codex: Sandhini Agarwal, Igor Babuschkin, Greg Brockman, Andrew Carr, Brooke Chan, Chris Hesse, Shantanu Jain, Kyle Kosic, Jakub Pachocki, Alex Paino, Mikhail Pavlov, Vitchyr Pong, Nick Ryder, Szymon Sidor, Nikolas Tezak, Philippe Tillet, Amin Tootoonchian, Jerry Tworek, Lilian Weng, Clemens Winter, Qiming Yuan, Wojciech Zaremba, William Zhuk

Engineering, Product Development, Safety, Policy, and Security: Steven Adler, Sandhini Agarwal, Mohammad Bavarian, Kevin Button, Tyna Eloundou, Angela Jiang, Shino Jomoto, Heewoo Jun, Rajeev Nayak, Henrique Ponde de Oliveira Pinto, Girish Sastry, Maddie Simens, Felipe Such

Blog visuals: Justin Jay Wang


Acknowledgments

Thanks to the following for their feedback on this work and contributions to this release: Diogo Moitinho de Almeida, Che Chang, Elie Georges, Joanne Jang, Roger Jiang, Denny Jin, Fraser Kelton, Tabarak Khan, Matt Knight, Jan Leike, Ryan Lowe, Bianca Martin, Andrew Mayne, Bob McGrew, Luke Miller, Evan Morikawa, Mira Murati, Long Ouyang, Boris Power, William Saunders, Toki Sherbakov, Zarina Stanik, Preston Tuggle, Carroll Wainwright, Peter Welinder, Hannah Wong, Lauren Workman, Jeff Wu, Cathy Yeh


OpenAI

Lessons Learned on Language Model Safety and Misuse

Lessons Learned on Language Model Safety and Misuse

The deployment of powerful AI systems has enriched our understanding of safety and misuse far more than would have been possible through research alone. Notably:

  • API-based language model misuse often comes in different forms than we feared most.
  • We have identified limitations in existing language model evaluations that we are addressing with novel benchmarks and classifiers.
  • Basic safety research offers significant benefits for the commercial utility of AI systems.

Here, we describe our latest thinking in the hope of helping other AI developers address safety and misuse of deployed models.


Over the past two years, we’ve learned a lot about how language models can be used and abused—insights we couldn’t have gained without the experience of real-world deployment. In June 2020, we began giving access to developers and researchers to the OpenAI API, an interface for accessing and building applications on top of new AI models developed by OpenAI. Deploying GPT-3, Codex, and other models in a way that reduces risks of harm has posed various technical and policy challenges.

Overview of Our Model Deployment Approach

Large language models are now capable of performing a very wide range of tasks, often out of the box. Their risk profiles, potential applications, and wider effects on society remain poorly understood. As a result, our deployment approach emphasizes continuous iteration, and makes use of the following strategies aimed at maximizing the benefits of deployment while reducing associated risks:

  • Pre-deployment risk analysis, leveraging a growing set of safety evaluations and red teaming tools (e.g., we checked our InstructGPT for any safety degradations using the evaluations discussed below)
  • Starting with a small user base (e.g., both GPT-3 and our InstructGPT series began as private betas)
  • Studying the results of pilots of novel use cases (e.g., exploring the conditions under which we could safely enable longform content generation, working with a small number of customers)
  • Implementing processes that help keep a pulse on usage (e.g., review of use cases, token quotas, and rate limits)
  • Conducting detailed retrospective reviews (e.g., of safety incidents and major deployments)
Lessons Learned on Language Model Safety and Misuse


Note that this diagram is intended to visually convey the need for feedback loops in the continuous process of model development and deployment and the fact that safety must be integrated at each stage. It is not intended to convey a complete or ideal picture of our or any other organization’s process.

There is no silver bullet for responsible deployment, so we try to learn about and address our models’ limitations, and potential avenues for misuse, at every stage of development and deployment. This approach allows us to learn as much as we can about safety and policy issues at small scale and incorporate those insights prior to launching larger-scale deployments.


There is no silver bullet for responsible deployment.

While not exhaustive, some areas where we’ve invested so far include[1]:

Since each stage of intervention has limitations, a holistic approach is necessary.

There are areas where we could have done more and where we still have room for improvement. For example, when we first worked on GPT-3, we viewed it as an internal research artifact rather than a production system and were not as aggressive in filtering out toxic training data as we might have otherwise been. We have invested more in researching and removing such material for subsequent models. We have taken longer to address some instances of misuse in cases where we did not have clear policies on the subject, and have gotten better at iterating on those policies. And we continue to iterate towards a package of safety requirements that is maximally effective in addressing risks, while also being clearly communicated to developers and minimizing excessive friction.

Still, we believe that our approach has enabled us to measure and reduce various types of harms from language model use compared to a more hands-off approach, while at the same time enabling a wide range of scholarly, artistic, and commercial applications of our models.[2]

The Many Shapes and Sizes of Language Model Misuse

OpenAI has been active in researching the risks of AI misuse since our early work on the malicious use of AI in 2018 and on GPT-2 in 2019, and we have paid particular attention to AI systems empowering influence operations. We have worked with external experts to develop proofs of concept and promoted careful analysis of such risks by third parties. We remain committed to addressing risks associated with language model-enabled influence operations and recently co-organized a workshop on the subject.[3]

Yet we have detected and stopped hundreds of actors attempting to misuse GPT-3 for a much wider range of purposes than producing disinformation for influence operations, including in ways that we either didn’t anticipate or which we anticipated but didn’t expect to be so prevalent.[4] Our use case guidelines, content guidelines, and internal detection and response infrastructure were initially oriented towards risks that we anticipated based on internal and external research, such as generation of misleading political content with GPT-3 or generation of malware with Codex. Our detection and response efforts have evolved over time in response to real cases of misuse encountered “in the wild” that didn’t feature as prominently as influence operations in our initial risk assessments. Examples include spam promotions for dubious medical products and roleplaying of racist fantasies.

To support the study of language model misuse and mitigation thereof, we are actively exploring opportunities to share statistics on safety incidents this year, in order to concretize discussions about language model misuse.

The Difficulty of Risk and Impact Measurement

Many aspects of language models’ risks and impacts remain hard to measure and therefore hard to monitor, minimize, and disclose in an accountable way. We have made active use of existing academic benchmarks for language model evaluation and are eager to continue building on external work, but we have also have found that existing benchmark datasets are often not reflective of the safety and misuse risks we see in practice.[5]

Such limitations reflect the fact that academic datasets are seldom created for the explicit purpose of informing production use of language models, and do not benefit from the experience gained from deploying such models at scale. As a result, we’ve been developing new evaluation datasets and frameworks for measuring the safety of our models, which we plan to release soon. Specifically, we have developed new evaluation metrics for measuring toxicity in model outputs and have also developed in-house classifiers for detecting content that violates our content policy, such as erotic content, hate speech, violence, harassment, and self-harm. Both of these in turn have also been leveraged for improving our pre-training data[6]—specifically, by using the classifiers to filter out content and the evaluation metrics to measure the effects of dataset interventions.

Reliably classifying individual model outputs along various dimensions is difficult, and measuring their social impact at the scale of the OpenAI API is even harder. We have conducted several internal studies in order to build an institutional muscle for such measurement, but these have often raised more questions than answers.

We are particularly interested in better understanding the economic impact of our models and the distribution of those impacts. We have good reason to believe that the labor market impacts from the deployment of current models may be significant in absolute terms already, and that they will grow as the capabilities and reach of our models grow. We have learned of a variety of local effects to date, including massive productivity improvements on existing tasks performed by individuals like copywriting and summarization (sometimes contributing to job displacement and creation), as well as cases where the API unlocked new applications that were previously infeasible, such as synthesis of large-scale qualitative feedback. But we lack a good understanding of the net effects.

We believe that it is important for those developing and deploying powerful AI technologies to address both the positive and negative effects of their work head-on. We discuss some steps in that direction in the concluding section of this post.

The Relationship Between the Safety and Utility of AI Systems

In our Charter, published in 2018, we say that we “are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions.” We then published a detailed analysis of competitive AI development, and we have closely followed subsequent research. At the same time, deploying AI systems via the OpenAI API has also deepened our understanding of the synergies between safety and utility.

For example, developers overwhelmingly prefer our InstructGPT models—which are fine-tuned to follow user intentions[7]—over the base GPT-3 models. Notably, however, the InstructGPT models were not originally motivated by commercial considerations, but rather were aimed at making progress on long-term alignment problems. In practical terms, this means that customers, perhaps not surprisingly, much prefer models that stay on task and understand the user’s intent, and models that are less likely to produce outputs that are harmful or incorrect.[8] Other fundamental research, such as our work on leveraging information retrieved from the Internet in order to answer questions more truthfully, also has potential to improve the commercial utility of AI systems.[9]

These synergies will not always occur. For example, more powerful systems will often take more time to evaluate and align effectively, foreclosing immediate opportunities for profit. And a user’s utility and that of society may not be aligned due to negative externalities—consider fully automated copywriting, which can be beneficial for content creators but bad for the information ecosystem as a whole.

It is encouraging to see cases of strong synergy between safety and utility, but we are committed to investing in safety and policy research even when they trade off with commercial utility.


We are committed to investing in safety and policy research even when they trade off against commercial utility.

Ways to Get Involved

Each of the lessons above raises new questions of its own. What kinds of safety incidents might we still be failing to detect and anticipate? How can we better measure risks and impacts? How can we continue to improve both the safety and utility of our models, and navigate tradeoffs between these two when they do arise?

We are actively discussing many of these issues with other companies deploying language models. But we also know that no organization or set of organizations has all the answers, and we would like to highlight several ways that readers can get more involved in understanding and shaping our deployment of state of the art AI systems.

First, gaining first-hand experience interacting with state of the art AI systems is invaluable for understanding their capabilities and implications. We recently ended the API waitlist after building more confidence in our ability to effectively detect and respond to misuse. Individuals in supported countries and territories can quickly get access to the OpenAI API by signing up here.

Second, researchers working on topics of particular interest to us such as bias and misuse, and who would benefit from financial support, can apply for subsidized API credits using this form. External research is vital for informing both our understanding of these multifaceted systems, as well as wider public understanding.

Finally, today we are publishing a research agenda exploring the labor market impacts associated with our Codex family of models, and a call for external collaborators on carrying out this research. We are excited to work with independent researchers to study the effects of our technologies in order to inform appropriate policy interventions, and to eventually expand our thinking from code generation to other modalities.

If you’re interested in working to responsibly deploy cutting-edge AI technologies, apply to work at OpenAI!


Acknowledgments

Thanks to Lilian Weng, Rosie Campbell, Anna Makanju, Bob McGrew, Hannah Wong, Ryan Lowe, Steve Dowling, Mira Murati, Sam Altman, Greg Brockman, Ilya Sutskever, Percy Liang, Peter Welinder, Ethan Perez, Ellie Evans, Helen Ngo, Helen Toner, Justin Jay Wang, Jack Clark, Rishi Bommasani, Girish Sastry, Sarah Shoker, Matt Knight, Bianca Martin, Bob Rotsted, Lama Ahmad, Toki Sherbakov, and others for providing feedback on this post and related work.


Footnotes

  1. This post is based on our approach to deploying language models through an API, and as such the lessons and mitigations described are most relevant to those also pursuing API-based deployment. However, we also expect some of the discussion to be relevant to those building first-party applications using language models and those considering the open source release of language models. ↩︎

  2. This post is intended to explain and share learnings from our approach, rather than to suggest that all actors should necessarily adopt the same approach, or that the same approach is applicable to all possible AI systems. There are benefits and costs associated with different deployment approaches, different models will benefit more or less from study prior to deployment, and in some cases it can be valuable for distinct deployment paths to be pursued by different actors. ↩︎

  3. More details on this workshop will be included in the forthcoming publication based on it. ↩︎

  4. The mitigations that we emphasize in response to misuse have also evolved. For example, we initially focused on long form text generation as a threat vector, given prior cases of influence operations that involved people manually writing long form misleading content. Given that emphasis, we set maximum output lengths for generated text. Based on a pilot study of long form generation, however, we saw that output restrictions had little effect on policy violations—we’ve come to believe instead that short-form content amplifying or increasing engagement on misleading content could be the greater risk. ↩︎

  5. Examples of limitations in existing datasets, from the perspective of practitioners seeking a holistic assessment of the safety of real language model outputs, include the following: an overly narrow focus (e.g., just measuring occupational gender bias), an overly broad focus (e.g., measuring all under the umbrella of “toxicity”), a tendency to abstract away the specifics of use and context, a failure to measure the generative dimension of language model use (e.g., using multiple choice style), prompts that differ stylistically from those typically used in real language model use cases, not capturing dimensions of safety that are important in practice (e.g., an output following or ignoring a safety-motivated constraint in the instruction), or not capturing types of outputs we have found to be correlated with misuse (e.g., erotic content). ↩︎

  6. While our efforts are specifically oriented towards addressing limitations in existing benchmarks and in our own models, we also acknowledge that there are limitations to the methods we use such as classifier-based data filtration. For instance, operationally defining the content areas we aim to detect via filtration is challenging and filtration itself can introduce harmful biases. Additionally, the labeling of toxic data is a critical component of this work and ensuring the mental health of these labelers is an industry-wide challenge. ↩︎

  7. The relevant “user” of our API may be a developer building an application or an end-user interacting with such an application, depending on context. There are deep questions about the values our aligned models reflect and we hope to build a more nuanced understanding of how to balance the values of wide range of possible users and competing objectives when aligning language models to be more helpful, more truthful and less harmful. ↩︎

  8. More aligned models also have more practical advantages such as reducing the need for “prompt engineering” (providing examples of the desired behavior to steer the model in the right direction), saving space in the model’s context window which can be used for other purposes. ↩︎

  9. Beyond research, we have found that other safety-motivated interventions sometimes have unexpected benefits to customers. For example, rate limits intended to curb spam or misleading content also help customers to control expenses. ↩︎

OpenAI

Economic Impacts Research at OpenAI

Core to our mission of ensuring that artificial general intelligence benefits all of humanity is understanding the economic impacts that our models have on individuals and society as a whole. Developing tools to rigorously measure the economic impacts of our models is essential to making smarter development and deployment decisions and critical to informing public policy options that maximize human prosperity and minimize the risk of economic harms from AI. Our ability to generate high quality evidence to inform these decisions will be greatly enhanced by developing a range of productive research partnerships, and we firmly believe that AI developers need to support external researchers undertaking this work, rather than exclusively conducting research in-house.

Under this premise, you can see our first public research agenda on these topics. This describes our preliminary priorities for research on the economic impacts of code generation models broadly. Today, we are excited to complement this research agenda with concrete action to facilitate improved measurement of the economic impacts of our models. We are launching a call for expressions of interest from researchers interested in evaluating the economic impact of Codex—our AI system that translates natural language to code. If you are a PhD level researcher (including current doctoral students) interested in collaborating on this research, we would encourage you to fill out the expression of interest form.

Read Research Agenda

Importance of Studying Economic Impacts

As an AI research and deployment company, OpenAI recognizes that our decisions around AI system design and deployment can influence economic impacts and the distribution of economic benefits from advances in AI. Despite remarkable technological progress over the past several decades, gains in economic prosperity have not been widely distributed. In the US, trends in both income and wealth inequality over the last forty years demonstrate a worrying pace of economic divergence and uneven access to opportunity. While recent evidence suggests that there is little immediate risk of widespread technological unemployment due to AI, it is clear that the labor market impacts of increasingly advanced AI will vary widely across different types of workers. Unemployment shocks, even if transitory, have been shown to have widespread negative effects on individual wellbeing, and increasing economic inequality may amplify societal cleavages.

We are eager to support and conduct research that has the potential to impact decision-making on three axes:

  1. AI deployment policies
  2. AI system design decisions
  3. Evidence that public policymakers can draw on.

While we don’t anticipate that the current capabilities of Codex could threaten large-scale economic disruption, future capabilities of code generation and other large language model applications could. We need to engage in research about the economic impact of our models today in order to be positioned to assess the safety of developing and releasing more capable systems in the future. Codex provides a tractable opportunity to establish the foundation for this research going forward.

External Research Collaborators

As an external research collaborator, you would be connected (via OpenAI) to firms that are currently using Codex models or that plan to in the future. You would have the opportunity to work with OpenAI and these firms to implement research projects focused on empirically measuring the impact of Codex on outcomes like worker and firm productivity, labor demand, and skill development. Where necessary and when possible, OpenAI would help facilitate data access to enable impactful research and would provide academic access to Codex and future models. OpenAI will also provide research management resources to external researchers, and researchers would have the freedom to publish their results independently or as co-authors with collaborators at OpenAI. Finally, we intend to facilitate discussions between external researchers, AI developers, AI-adopting firms, and workers in various industries that have been affected by advances in AI in an effort to widen the range of perspectives that can shape the path of AI development and deployment.

If you are a researcher considering submitting an expression of interest, please fill out this form. Additionally, consider emailing us your questions at econ@openai.com to learn more about our goals for economic impacts research and how you can be involved.

If you are a company or user of Codex models and want to learn how you can contribute to this work moving forward, please fill out this form.

Submission Process

If you would like to submit an expression of interest to be a Research Collaborator please use this form.

Submit collaborator interest

We are currently seeking submissions from PhD-level researchers, including current doctoral students. When evaluating expressions of interest, we will assess your background and experience, clarity of motivation to collaborate with OpenAI, and both the clarity and decision-relevance of your research interests related to the economic impact of Codex.

If you are a company or user of Codex models and want to learn how you can contribute to this work moving forward, please fill out this form.

Learn how to contribute

We are in the process of connecting researchers with firms that are best equipped to support particular research interests. If you’re interested in learning more about how your organization can support or sponsor research on economic impacts of AI systems, please contact us here.

Additional Information

If you have any questions about the submission forms or the call for expressions of interest, please contact us at econ@openai.com.


Acknowledgments

Thanks to Steven Adler, Lama Ahmad, Stephanie Bell, Miles Brundage, Katya Klinova, Gretchen Krueger, Jade Leung, Anna Makanju, Katie Mayer, Richard Ngo, Cullen O’Keefe, Girish Sastry, Sarah Shoker, and Natalie Staudacher for feedback on drafts of this document. Thanks to Michelle Alexopoulos, Sarah Bana, Alex Bartik, Erik Brynjolfsson, Tim de Stefano, Avi Goldfarb, Marlène Koffi, Mina Lee, Zanele Munyikwa, Mark Muro, Frank Nagle, Maria del Rio-Chanona, Daniel Rock, Anna Salomons, and Ben Weidmann for helpful discussions on potential avenues for research on the economic impacts of code generation models.


References
  1. Chetty, Raj, et al. “The fading American dream: Trends in absolute income mobility since 1940.” Science 356.6336 (2017): 398-406.; Saez, Emmanuel, and Gabriel Zucman. “The rise of income and wealth inequality in America: Evidence from distributional macroeconomic accounts.” Journal of Economic Perspectives 34.4 (2020): 3-26.

  2. Autor, David, David Mindell, and Elisabeth Reynolds. “The work of the future: Building better jobs in an age of intelligent machines.” Boston: MIT. https://workofthefuture.mit.edu/wp-content/uploads/2021/01/2020-Final-Report4.pdf vom 18 (2020): 2020.

  3. Brand, Jennie E. “The far-reaching impact of job loss and unemployment.” Annual review of sociology 41 (2015): 359-375.

  4. Van de Werfhorst, Herman G., and Wiemer Salverda. “Consequences of economic inequality: Introduction to a special issue.” Research in Social Stratification and Mobility 30.4 (2012): 377-387.


OpenAI

Solving (Some) Formal Math Olympiad Problems

We built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems, including problems from the AMC12 and AIME competitions, as well as two problems adapted from the IMO.[1] The prover uses a language model to find proofs of formal statements. Each time we find a new proof, we use it as new training data, which improves the neural network and enables it to iteratively find solutions to harder and harder statements.

Read Paper

We achieved a new state-of-the-art (41.2% vs 29.3%) on the miniF2F benchmark, a challenging collection of high-school olympiad problems. Our approach, which we call statement curriculum learning, consists of manually collecting a set of statements of varying difficulty levels (without proof) where the hardest statements are similar to the benchmark we target. Initially our neural prover is weak and can only prove a few of them. We iteratively search for new proofs and re-train our neural network on the newly discovered proofs, and after 8 iterations, our prover ends up being vastly superior when tested on miniF2F.

Formal mathematics is an exciting domain to study because of (i) its richness, letting you prove arbitrary theorems which require reasoning, creativity and insight and (ii) its similarity to games—where AI has been spectacularly successful—in that it has an automated way of determining whether a proof is successful (i.e., verified by the formal system). As demonstrated in the trivial example below, proving a formal statement requires generating a sequence of proof steps, each proof step consisting in a call to a tactic.[2] These tactics take mathematical terms as arguments and each tactic call will transform the current statement to prove, into statements that are easier to prove, until nothing is left to prove.

Problem 1
Adapted from AMC12 2000 Problem 5

Prove that if $|x – 2| = p$, where $x < 2$, then $x – p = 2 – 2p$.


theorem amc12_2000_p5      -- ← theorem name
  (x p : ℝ)                -- ← the statement we want
  (h₀ : x < 2)             --   to prove
  (h₁ : abs (x - 2) = p) :
  x - p = 2 - 2 * p :=
begin                      -- ← formal proof starts here
  -- This first tactic requires that the prover invent
  -- the term: `abs (x - 2) = -(x - 2)`.
  have h₂ : abs (x - 2) = -(x - 2), {
    apply abs_of_neg,
    linarith,
  },
  rw h₁ at h₂,
  -- At this stage the remaining goal to prove is:
  -- `x - p = 2 - 2 * p` knowing that `p = -(x - 2)`.
  linarith,
end

Since $x < 2$, $|x – 2| = -(x – 2)$. Using $p = |x – 2|$ we have $x = 2-p$ and finally $x – p = x – 2 – 2p$.

We observe that the capability to generate original mathematical terms required as arguments of tactics, which cannot be done without a neural language model, emerges from our training procedure. The proof below is an example of it: the proof step use n + 1 (entirely generated by our models) proposes to use n + 1 as a solution, the rest of the formal proof relying on the ring_exp tactic to verify that it is indeed valid.

Problem 2
Adapted from AMC12B 2020 Problem 6

For all integers $n ≥ 9$, prove that $((n + 2)! −(n + 1)!) / n!$ is a perfect square.


theorem amc12b_2020_p6
  (n : ℕ)
  (h0 : 9 ≤ n) :
  ∃ x : ℕ, (x:ℝ)^2 = 
    (nat.factorial (n + 2) - nat.factorial (n + 1))
    / nat.factorial n :=
begin
  -- The model directly proposes `n + 1` as solution.
  use n + 1,
  field_simp [nat.factorial_ne_zero, pow_succ'],
  ring_exp
end

Expanding the expression we get:

$$((n + 2)! −(n + 1)!) / n! = ((n + 2)(n+1)n! −(n + 1)n!) / n!$$

Dividing by $n!$ we obtain:

$$(n+2)(n+1) – (n+1)$$

Factoring $(n+1)$ we get: $(n+1)(n+2-1) = (n+1)^2$ which concludes the proof.

We also observe that our models and search procedure are capable of producing proofs that chain multiple non-trivial reasoning steps. In the proof below, the model starts by using contraposition leading to the existential statement (∃ (x : ℝ), f x ≠ a * x + b). It then generates a witness for it with use (0 : ℝ) and finishes the proof by leveraging the norm_num tactic.

Problem 3
Adapted from the MATH dataset

Let $f(x) = Ax + B$ and $g(x) = Bx + A$, where $A ne B$. If $f(g(x)) – g(f(x)) = B – A$, prove that $A + B = 0$.


theorem mathd_train_algebra_217
  (a b : ℝ)
  (f g : ℝ → ℝ)
  (h₀ : ∀ x, f x = a * x + b)
  (h₁ : ∀ x, f x = b * x + a)
  (h₂ : a ≠ b)
  (h₃ : ∀ x, f (g x) - g (f x) = b - a) :
  a + b = 0 :=
begin
  revert h₀ h₁ h₂ h₃,
  -- Initial contraposition.
  contrapose!,
  rintro ⟨h₀, ⟨h₁, h₂⟩⟩,
  -- The model proposes `0` as witness for the current
  -- goal that consists in `∃ (x : ℝ), f x ≠ a * x + b`.
  use (0 : ℝ),
  simp only [sub_eq_iff_eq_add, h₀, mul_zero, zero_add],
  norm_num at h₀,
end

First we find that:

$$f(g(x)) = A(Bx + A) + B = ABx + A^2 + B$$$$g(f(x)) = B(Ax + B) + A = ABx + B^2 + A$$

Now we plug this back in $f(g(x)) – g(f(x)) = B – A$ and get:

$$(ABx + A^2 + B) – (ABx + B^2 + A) = B – A$$

That is:

$$A^2 – B^2 + B – A = B – A$$

Hence:

$$A^2 – B^2 = (A-B)(A+B) = 0$$

Since we are given that $A ne B$, necessarily, $A + B = 0$.

Our models, trained with statement curriculum learning, were able to close a variety of problems from training textbooks as well as AMC12 and AIME competitions, and 2 problems adapted from the IMO. We present below three examples of such generated proofs.

Problem 4
Adapted from IMO 1964 Problem 2

Suppose $a$, $b$, $c$ are the sides of a triangle.
Prove that $a^2(b + c − a) + b^2(c + a − b) + c^2(a + b − c) leq 3abc$.


theorem imo_1964_p2
  (a b c : ℝ)
  (h₀ : 0 < a ∧ 0 < b ∧ 0 < c)
  (h₁ : c < a + b)
  (h₂ : b < a + c)
  (h₃ : a < b + c) :
  a^2 * (b + c - a) + b^2 * (c + a - b) + c^2 * (a + b - c) 
    ≤ 3 * a * b * c :=
begin
  -- Arguments to `nlinarith` are fully invented by our model.
  nlinarith [sq_nonneg (b - a),
             sq_nonneg (c - b),
             sq_nonneg (c - a)]
end

Rearrange to get $a(a-b)(a-c) + b(b-a)(b-c) + c(c-a)(c-b) >= 0$ which is true by Schur’s inequality.

Problem 5
Adapted from AIME 1984 Problem 1

Prove that $a2 + a4 + a6 + a8 + …+ a98 = 93$ if $a1$, $a2$, $a3…$ is an arithmetic progression with common difference $1$, and $a1 + a2 + a3 + … + a98 = 137$.


theorem aime_1984_p1
  (u : ℕ → ℚ)
  (h₀ : ∀ n, u (n + 1) = u n + 1)
  (h₁ : ∑ k in finset.range 98, u k.succ = 137) :
  ∑ k in finset.range 49, u (2 * k.succ) = 93 :=
begin
  rw finset.sum_eq_multiset_sum,
  dsimp [finset.range] at h₁,
  simp [h₀],
  ring,
  norm_num at h₁,
  norm_num,
  apply eq_of_sub_eq_zero,
  { simp only [*, abs_of_pos, add_zero] at *, linarith },
end

For $n geq 1$ we have $a(2n-1) = a(2n)-1$. Substituting this into the equation given, we get:

$$(a(2)-1) + a(2) + (a(4)-1) + a(4) + … + (a(98)-1) + (a(98)) = 137$$

But the left-hand side is simply $2(a2 + a4 + a6 + … + a98) – 49$, so:

$$(a2 + a4 + a6 + … + a98) = (137 + 49) / 2 = 93$$

Problem 6

Adapted from IMO Longlist 1990 Problem 77[3]
For $a, b, c$ reals, prove that $(a^2 + ab + b^2)(b^2 + bc + c^2)(c^2 + ca + a^2) geq (ab + bc + ca)^3$.


theorem imo_longlist_1990_p77
  (a b c : ℝ) :
  (a * b + b * c + c * a)^3 ≤
    (a^2 + a * b + b^2) * (b^2 + b * c + c^2) *
    (c^2 + c * a + a^2) :=
begin
  -- The three initial steps use Cauchy–Schwarz to prove
  -- `(a * b + b * c) ^ 2 ≤ (a ^ 2 + b ^ 2) * (b ^ 2 + c ^ 2)`
  -- which is required for the final call to `nlinarith`.
  let u : euclidean_space ℝ (fin 2) := ![a, b],
  let v : euclidean_space ℝ (fin 2) := ![b, c],
  have h₀ := real_inner_mul_inner_self_le u v,
  simp [u, v, fin.sum_univ_succ, 
        ←pow_two, ←pow_two, le_of_lt, mul_assoc] at h₀,
  -- The model introduces another required cut (i.e. invent
  -- the term `0 ≤ (c + a) * (c + a)` and proves it).
  have h₃ : 0 ≤ (c + a) * (c + a),
  { nlinarith, },
  have h₄ := sq_nonneg (a * b + b * c + c * a),
  simp [sq, h₀, h₃, mul_add, add_mul] at h₄ ⊢,
  nlinarith [sq_nonneg (b - a),
             sq_nonneg (c - b),
             sq_nonneg (a - c)]
end

After cancelling terms appearing on both sides, we are left to prove that:

$$3a^2b^2c^2 + sum_{sym} a^3b^2c leq sum_{cyc} a^4bc + sum_{cyc} (a^4b^2 + b^4c^2)$$

After multiplying both sides by $2$, we can rearrange the above inequality to:

$$0 leq sum_{cyc} (a^2b + a^2c – b^2c)^2$$

which clearly holds, giving the claim.

Formal mathematics involves two main challenges that make a naive application of reinforcement learning unlikely to succeed.

  • (i) Infinite action space: not only does formal mathematics have an extremely large search space (like Go for example), it also has an infinite action space. At each step of a proof search, the model must choose not from a well-behaved finite set of actions, but a complex and infinite set of tactics, involving exogenous mathematical terms that have to be generated (e.g., generating a mathematical statement to be used as a witness, an object used in steps such as “there exists an $x$ s.t. …”, or a cut, the introduction and the chaining of a lemma in the middle of a proof).
  • (ii) Lack of self-play: conversely to 2-player games, a prover is not playing against an opponent but against a set of statements to prove. When faced with a statement that is just too hard, there is no obvious reframing that will let the prover generate intermediary easier statements to tackle first. This asymmetry prevents naive application of the self-play algorithms that were successful with 2-player games.

In our work, we address the infinite action space problem by sampling actions from a language model as we search for a proof. Language models have the capability to generate the tactic calls as well as the original mathematical terms often required as arguments. Our basis for addressing the lack of self-play is the observation that the key role of self-play in 2-player games is to provide an unsupervised curriculum. Our methodology proposes to replace this unsupervised curriculum with an auxiliary set of problem statements (without requiring proofs) of varying difficulty. We empirically show that, when the difficulty of these auxiliary problems is varied enough, our training procedure is able to solve a curriculum of increasingly difficult problems, eventually generalizing to the set of problems we care about.

While these results are extremely exciting, as they demonstrate that deep learning models are capable of non-trivial mathematical reasoning when interacting with a formal system, we are still very far from best-student performance on these competitions, only occasionally, rather than consistently, closing challenging olympiad problems. We hope nonetheless that our work will motivate research in this domain, in particular towards the IMO Grand Challenge and that the statement curriculum learning methodology we propose will help accelerate progress in automated reasoning in general.


Acknowledgments

Thanks to our paper co-authors: Igor Babuschkin, Kunhao Zheng and Mantas Baksys.

Thanks to the students of the Xena Project Discord who helped us formalize proofs and statements (in particular: Antoine Labelle, Hanting Zhang, Shing Tak Lam, Paul Lezeau, Sara Diaz, Nikita Golikov, Yael Dillies, Artem Vasilyev, Ollie Perree, and Yourong Zang).

Thanks in particular to Kevin Buzzard and Daniel Selsam for their support and thoughtful feedback since the very beginning of this project.


Footnotes

  1. These problems are not standard math exercises, they are used to let the best high-school students from the US (AMC12, AIME) or the world (IMO) compete against each other. ↩︎

  2. The artifacts accepted by the formal system are low-level (like assembly code) and hard for humans to produce. Tactics are search procedures that generate such artifacts from higher level directives to assist formalization. ↩︎

  3. This proof is not reported in the paper as it was found by a more recent model we are still experimenting with. We decided to share it nonetheles because it’s one of our favourite. ↩︎

OpenAI

Aligning Language Models to Follow Instructions

We’ve trained language models that are much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through our alignment research. These InstructGPT models, which are trained with humans in the loop, are now deployed as the default language models on our API.

Read PaperView Model Card
InstructGPT is better than GPT-3 at following English instructions.
InstructGPT is better than GPT-3 at following English instructions.
Like GPT-3, InstructGPT can respond to tasks defined implicitly via a prompt, without an explicit instruction.
InstructGPT can give wrong or misleading outputs when the instruction assumes a premise that is not true.
When given a sensitive prompt or instruction, InstructGPT is less likely to produce biased or toxic outputs than GPT-3.
Since InstructGPT is trained to follow instructions, it can be susceptible to misuse.

GPT-3 models aren’t trained to follow user instructions. Our InstructGPT models (highlighted) generate much more helpful outputs in response to user instructions.

The OpenAI API is powered by GPT-3 language models which can be coaxed to perform natural language tasks using carefully engineered text prompts. But these models can also generate outputs that are untruthful, toxic, or reflect harmful sentiments. This is in part because GPT-3 is trained to predict the next word on a large dataset of Internet text, rather than to safely perform the language task that the user wants. In other words, these models aren’t aligned with their users.

To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API,[1] our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.

The resulting InstructGPT models are much better at following instructions than GPT-3. They also make up facts less often, and show small decreases in toxic output generation. Our labelers prefer outputs from our 1.3B InstructGPT model over outputs from a 175B GPT-3 model, despite having more than 100x fewer parameters. At the same time, we show that we don’t have to compromise on GPT-3’s capabilities, as measured by our model’s performance on academic NLP evaluations.

These InstructGPT models, which have been in beta on the API for more than a year, are now the default language models accessible on our API.[2] We believe that fine-tuning language models with humans in the loop is a powerful tool for improving their safety and reliability, and we will continue to push in this direction.

This is the first time our alignment research, which we’ve been pursuing for several years, has been applied to our product. Our work is also related to recent research that fine-tunes language models to follow instructions using academic NLP datasets, notably FLAN and T0. A key motivation for our work is to increase helpfulness and truthfulness while mitigating the harms and biases of language models. Some of our previous research in this direction found that we can reduce harmful outputs by fine-tuning on a small curated dataset of human demonstrations. Other research has focused on filtering the pre-training dataset, safety-specific control tokens, or steering model generations. We are exploring these ideas and others in our ongoing alignment research.

Results

We first evaluate how well outputs from InstructGPT follow user instructions, by having labelers compare its outputs to those from GPT-3. We find that InstructGPT models are significantly preferred on prompts submitted to both the InstructGPT and GPT-3 models on the API. This holds true when we add a prefix to the GPT-3 prompt so that it enters an “instruction-following mode.”

Quality ratings of model outputs on a 1–7 scale (y-axis), for various model sizes (x-axis), on prompts submitted to InstructGPT models on our API. InstructGPT outputs are given much higher scores by our labelers than outputs from GPT-3 with a few-shot prompt and without, as well as models fine-tuned with supervised learning. We find similar results for prompts submitted to GPT-3 models on the API.

To measure the safety of our models, we primarily use a suite of existing metrics on publicly available datasets. Compared to GPT-3, InstructGPT produces fewer imitative falsehoods (according to TruthfulQA) and are less toxic (according to Realtoxicityprompts. We also conduct human evaluations on our API prompt distribution, and find that InstructGPT makes up facts (“hallucinates”) less often, and generates more appropriate outputs.[3]

Dataset
RealToxicity
GPT
0.233
Supervised Fine-Tuning
0.199
InstructGPT
0.196
Dataset
TruthfulQA
GPT
0.224
Supervised Fine-Tuning
0.206
InstructGPT
0.413
API Dataset
Hallucinations
GPT
0.414
Supervised Fine-Tuning
0.078
InstructGPT
0.172
API Dataset
Customer Assistant Appropriate
GPT
0.811
Supervised Fine-Tuning
0.880
InstructGPT
0.902

Evaluating InstructGPT for toxicity, truthfulness, and appropriateness. Lower scores are better for toxicity and hallucinations, and higher scores are better for TruthfulQA and appropriateness. Hallucinations and appropriateness are measured on our API prompt distribution. Results are combined across model sizes.

Finally, we find that InstructGPT outputs are preferred to those from FLAN and T0 on our customer distribution. This indicates that the data used to train FLAN and T0, mostly academic NLP tasks, is not fully representative of how deployed language models are used in practice.

Methods

To train InstructGPT models, our core technique is reinforcement learning from human feedback (RLHF), a method we helped pioneer in our earlier alignment research. This technique uses human preferences as a reward signal to fine-tune our models, which is important as the safety and alignment problems we are aiming to solve are complex and subjective, and aren’t fully captured by simple automatic metrics.

We first collect a dataset of human-written demonstrations on prompts submitted to our API, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between two model outputs on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our GPT-3 policy to maximize this reward using the PPO algorithm.

One way of thinking about this process is that it “unlocks” capabilities that GPT-3 already had, but were difficult to elicit through prompt engineering alone: this is because our training procedure has a limited ability to teach the model new capabilities relative to what is learned during pretraining, since it uses less than 2% of the compute and data relative to model pretraining.

A limitation of this approach is that it introduces an “alignment tax”: aligning the models only on customer tasks can make their performance worse on some other academic NLP tasks. This is undesirable since, if our alignment techniques make models worse on tasks that people care about, they’re less likely to be adopted in practice. We’ve found a simple algorithmic change that minimizes this alignment tax: during RL fine-tuning we mix in a small fraction of the original data used to train GPT-3, and train on this data using the normal log likelihood maximization.[4] This roughly maintains performance on safety and human preferences, while mitigating performance decreases on academic tasks, and in several cases even surpassing the GPT-3 baseline.

Generalizing to broader preferences

Our procedure aligns our models’ behavior with the preferences of our labelers, who directly produce the data used to train our models, and us researchers, who provide guidance to labelers through written instructions, direct feedback on specific examples, and informal conversations. It is also influenced by our customers and the preferences implicit in our API policies. We selected labelers who performed well on a screening test for aptitude in identifying and responding to sensitive prompts. However, these different sources of influence on the data do not guarantee our models are aligned to the preferences of any broader group.

We conducted two experiments to investigate this. First, we evaluate GPT-3 and InstructGPT using held-out labelers[5] who did not produce any of the training data, and found that these labelers prefer outputs from the InstructGPT models at about the same rate as our training labelers. Second, we train reward models on data from a subset of our labelers, and find that they generalize well to predicting the preferences of a different subset of labelers. This suggests that our models haven’t solely overfit to the preferences of our training labelers. However, more work is needed to study how these models perform on broader groups of users, and how they perform on inputs where humans disagree about the desired behavior.

Limitations

Despite making significant progress, our InstructGPT models are far from fully aligned or fully safe; they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting. But the safety of a machine learning system depends not only on the behavior of the underlying models, but also on how these models are deployed. To support the safety of our API, we will continue to review potential applications before they go live, provide content filters for detecting unsafe completions, and monitor for misuse.

A byproduct of training our models to follow user instructions is that they may become more susceptible to misuse if instructed to produce unsafe outputs. Solving this requires our models to refuse certain instructions; doing this reliably is an important open research problem that we are excited to tackle.

Further, in many cases aligning to the average labeler preference may not be desirable. For example, when generating text that disproportionately affects a minority group, the preferences of that group should be weighted more heavily. Right now, InstructGPT is trained to follow instructions in English; thus, it is biased towards the cultural values of English-speaking people. We are conducting research into understanding the differences and disagreements between labelers’ preferences so we can condition our models on the values of more specific populations. More generally, aligning model outputs to the values of specific humans introduces difficult choices with societal implications, and ultimately we must establish responsible, inclusive processes for making these decisions.

Next steps

This is the first application of our alignment research to our product. Our results show that these techniques are effective at significantly improving the alignment of general-purpose AI systems with human intentions. However, this is just the beginning: we will keep pushing these techniques to improve the alignment of our current and future models towards language tools that are safe and helpful to humans.

If you’re interested in these research directions, we’re hiring!


References
  1. Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S. and Amodei, D., 2017. Deep reinforcement learning from human preferences. arXiv preprint arXiv:1706.03741.
  2. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D.M., Lowe, R., Voss, C., Radford, A., Amodei, D. and Christiano, P., 2020.
  3. Wu, J., Ouyang, L., Ziegler, D.M., Stiennon, N., Lowe, R., Leike, J. and Christiano, P., 2021. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
  4. Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M. and Le, Q.V., 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  5. Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T.L., Raja, A. and Dey, M., 2021. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  6. Bender, E.M., Gebru, T., McMillan-Major, A. and Shmitchell, S., 2021, March. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610-623).
  7. Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E. and Brynjolfsson, E., 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  8. Kenton, Z., Everitt, T., Weidinger, L., Gabriel, I., Mikulik, V. and Irving, G., 2021. Alignment of Language Agents. arXiv preprint arXiv:2103.14659.
  9. Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A. and Kenton, Z., 2021. Ethical and social risks of harm from Language Models. arXiv preprint arXiv:2112.04359.
  10. Tamkin, A., Brundage, M., Clark, J. and Ganguli, D., 2021. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. arXiv preprint arXiv:2102.02503.
  11. Solaiman, I. and Dennison, C., 2021. Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets. arXiv preprint arXiv:2106.10328.
  12. Ngo, H., Raterink, C., Araújo, J.G., Zhang, I., Chen, C., Morisot, A. and Frosst, N., 2021. Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790.
  13. Xu, J., Ju, D., Li, M., Boureau, Y.L., Weston, J. and Dinan, E., 2020. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.
  14. Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C. and Socher, R., 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  15. Krause, B., Gotmare, A.D., McCann, B., Keskar, N.S., Joty, S., Socher, R. and Rajani, N.F., 2020. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367.
  16. Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J. and Liu, R., 2019. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
  17. Lin, S., Hilton, J. and Evans, O., 2021. TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  18. Gehman, S., Gururangan, S., Sap, M., Choi, Y. and Smith, N.A., 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.
  19. Rudinger, R., Naradowsky, J., Leonard, B. and Van Durme, B., 2018. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301.
  20. Nangia, N., Vania, C., Bhalerao, R. and Bowman, S.R., 2020. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133.


Acknowledgments

We’d like to thank our paper co-authors: Long Ouyang, Jeff Wu, Roger Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, and Paul Christiano, along with everyone who provided feedback on the paper and blog post. We’d also like to thank the Comms team for their guidance and assistance, including Steve Dowling, Hannah Wong, Elie Georges, Alper Ercetin, Jared Salzano, Allan Diego, and Justin Jay Wang. Finally, we’d like to thank our labelers, without whom this project would not have been possible.


Footnotes

  1. We only use prompts submitted through the Playground to an earlier version of the InstructGPT models that was deployed in January 2021. Our human annotators remove personal identifiable information from all prompts before adding it to the training set. ↩︎

  2. The InstructGPT models deployed in the API are updated versions trained using the same human feedback data. They use a similar but slightly different training method that we will describe in a forthcoming publication. ↩︎

  3. We also measure several other dimensions of potentially harmful outputs on our API distribution: whether the outputs contain sexual or violent content, denigrate a protected class, or encourage abuse. We find that InstructGPT doesn’t improve significantly over GPT-3 on these metrics; the incidence rate is equally low for both models. ↩︎

  4. We found this approach more effective than simply increasing the KL coefficient. ↩︎

  5. These labelers are sourced from Scale AI and Upwork, similarly to our training labelers, but do not undergo a screening test. ↩︎

OpenAI

Aligning language models to follow instructions

We’ve trained language models that are much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through our alignment research. These InstructGPT models, which are trained with humans in the loop, are now deployed as the default language models on our API.OpenAI Blog

Introducing Text and Code Embeddings in the OpenAI API

Introducing Text and Code Embeddings in the OpenAI API

We are introducing embeddings, a new endpoint in the OpenAI API that makes it easy to perform natural language and code tasks like semantic search, clustering, topic modeling, and classification. Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts. Our embeddings outperform top models in 3 standard benchmarks, including a 20% relative improvement in code search.

Read documentationRead paper

Embeddings are useful for working with natural language and code, because they can be readily consumed and compared by other machine learning models and algorithms like clustering or search.

Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API

Embeddings that are numerically similar are also semantically similar. For example, the embedding vector of “canine companions say” will be more similar to the embedding vector of “woof” than that of “meow.”

Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API

The new endpoint uses neural network models, which are descendants of GPT-3, to map text and code to a vector representation—“embedding” them in a high-dimensional space. Each dimension captures some aspect of the input.

The new /embeddings endpoint in the OpenAI API provides text and code embeddings with a few lines of code:

import openai
response = openai.Embedding.create(
    input="canine companions say",
    engine="text-similarity-davinci-001")

print(response)
{
  "data": [
    {
      "embedding": [
        0.000108064,
        0.005860855,
        -0.012656143,
        ...
        -0.006642727,
        0.002583989,
        -0.012567150
      ],
      "index": 0,
      "object": "embedding"
    }
  ],
  "model": "text-similarity-babbage:001",
  "object": "list"
}

We’re releasing three families of embedding models, each tuned to perform well on different functionalities: text similarity, text search, and code search. The models take either text or code as input and return an embedding vector.

Models Use Cases
Text similarity: Captures semantic similarity between pieces of text. text-similarity-{ada, babbage, curie, davinci}-001 Clustering, regression, anomaly detection, visualization
Text search: Semantic information retrieval over documents. text-search-{ada, babbage, curie, davinci}-{query, doc}-001 Search, context relevance, information retrieval
Code search: Find relevant code with a query in natural language. code-search-{ada, babbage}-{code, text}-001 Code search and relevance

Text Similarity Models

Text similarity models provide embeddings that capture the semantic similarity of pieces of text. These models are useful for many tasks including clustering, data visualization, and classification.

The following interactive visualization shows embeddings of text samples from the DBpedia dataset:

Drag to pan, scroll or pinch to zoom

Embeddings from the text-similarity-babbage-001 model, applied to the DBpedia dataset. We randomly selected 100 samples from the dataset covering 5 categories, and computed the embeddings via the /embeddings endpoint. The different categories show up as 5 clear clusters in the embedding space. To visualize the embedding space, we reduced the embedding dimensionality from 2048 to 3 using PCA. The code for how to visualize embedding space in 3D dimension is available here.

To compare the similarity of two pieces of text, you simply use the dot product on the text embeddings. The result is a “similarity score”, sometimes called “cosine similarity,” between 0 and 1, where a higher number means more similarity. In most applications, the embeddings can be pre-computed, and then the dot product comparison is extremely fast to carry out.

import openai, numpy as np

resp = openai.Embedding.create(
    input=["feline friends go", "meow"],
    engine="text-similarity-davinci-001")

embedding_a = resp['data'][0]['embedding']
embedding_b = resp['data'][1]['embedding']

similarity_score = np.dot(embedding_a, embedding_b)

One popular use of embeddings is to use them as features in machine learning tasks, such as classification. In machine learning literature, when using a linear classifier, this classification task is called a “linear probe.” Our text similarity models achieve new state-of-the-art results on linear probe classification in SentEval (Conneau et al., 2018), a commonly used benchmark for evaluating embedding quality.

Linear probe classification over 7 datasets
Previous SOTA (Guo et al. 2021)
90.2%
text-similarity-davinci-001
92.2%
Show more
text-similarity-curie-001
91.5%
text-similarity-babbage-001
91.1%
text-similarity-ada-001
89.3%

Text Search Models

Text search models provide embeddings that enable large-scale search tasks, like finding a relevant document among a collection of documents given a text query. Embedding for the documents and query are produced separately, and then cosine similarity is used to compare the similarity between the query and each document.

Embedding-based search can generalize better than word overlap techniques used in classical keyword search, because it captures the semantic meaning of text and is less sensitive to exact phrases or words. We evaluate the text search model’s performance on the BEIR (Thakur, et al. 2021) search evaluation suite and obtain better search performance than previous methods. Our text search guide provides more details on using embeddings for search tasks.

Average accuracy over 11 search tasks in BEIR
Previous SOTA (Izacard, et al. 2021)
50.2%
text-search-davinci-{doc, query}-001
52.8%
Show more
text-search-curie-{doc, query}-001
50.9%
text-search-babbage-{doc, query}-001
50.4%
text-search-ada-{doc, query}-001
49.0%

Code Search Models

Code search models provide code and text embeddings for code search tasks. Given a collection of code blocks, the task is to find the relevant code block for a natural language query. We evaluate the code search models on the CodeSearchNet (Husian et al., 2019) evaluation suite where our embeddings achieve significantly better results than prior methods. Check out the code search guide to use embeddings for code search.

Average accuracy over 6 programming languages
Previous SOTA (Guo, et al. 2021)
77.4%
code-search-babbage-{doc, query}-001
93.5%
Show more
code-search-ada-{doc, query}-001
93.4%

Examples of the Embeddings API in Action

JetBrains Research

JetBrains Research’s Astroparticle Physics Lab analyzes data like The Astronomer’s Telegram and NASA’s GCN Circulars, which are reports that contain astronomical events that can’t be parsed by traditional algorithms.

Powered by OpenAI’s embeddings of these astronomical reports, researchers are now able to search for events like “crab pulsar bursts” across multiple databases and publications. Embeddings also achieved 99.85% accuracy on data source classification through k-means clustering.

FineTune Learning

FineTune Learning is a company building hybrid human-AI solutions for learning, like adaptive learning loops that help students reach academic standards.

OpenAI’s embeddings significantly improved the task of finding textbook content based on learning objectives. Achieving a top-5 accuracy of 89.1%, OpenAI’s text-search-curie embeddings model outperformed previous approaches like Sentence-BERT (64.5%). While human experts are still better, the FineTune team is now able to label entire textbooks in a matter of seconds, in contrast to the hours that it took the experts.

Comparison of our embeddings with Sentence-BERT, GPT-3 search and human subject-matter experts for matching textbook content with learned objectives. We report accuracy@k, the number of times the correct answer is within the top-k predictions.

Fabius

Fabius helps companies turn customer conversations into structured insights that inform planning and prioritization. OpenAI’s embeddings allow companies to more easily find and tag customer call transcripts with feature requests.

For instance, customers might use words like “automated” or “easy to use” to ask for a better self-service platform. Previously, Fabius was using fuzzy keyword search to attempt to tag those transcripts with the self-service platform label. With OpenAI’s embeddings, they’re now able to find 2x more examples in general, and 6x–10x more examples for features with abstract use cases that don’t have a clear keyword customers might use.

All API customers can get started with the embeddings documentation for using embeddings in their applications.

Read documentation


Acknowledgments

Thanks to the following for their contributions to this release:

Tao Xu, Chris Hallacy, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Johannes Heidecke, Pranav Shyam, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, and Toki Sherbakov.

Thanks to the following for their feedback on this post: Tom Kleinpeter, Morgan Gallant, Sam Altman, Ilya Sutskever, Steve Dowling, Rachel Lim, Arun Vijayvergiya, Rajeev Nayak, Peter Welinder, Justin Jay Wang.


OpenAI