Prepare and analyze JSON and ORC data with Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler is a new capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications via a visual interface. Data preparation is a crucial step of the ML lifecycle, and Data Wrangler provides an end-to-end solution to import, prepare, transform, featurize, and analyze data for ML in a seamless, visual, low-code experience. It lets you easily and quickly connect to AWS components like Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and AWS Lake Formation, and external sources like Snowflake. Data Wrangler also supports standard data types such as CSV and Parquet.

Data Wrangler now additionally supports Optimized Row Columnar (ORC), JavaScript Object Notation (JSON), and JSON Lines (JSONL) file formats:

  • ORC – The ORC file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data. ORC is widely used in the Hadoop ecosystem.
  • JSON – The JSON file format is a lightweight, commonly used data interchange format.
  • JSONL – JSON Lines, also called newline-delimited JSON, is a convenient format for storing structured data that may be processed one record at a time.

You can preview ORC, JSON, and JSONL data prior to importing the datasets into Data Wrangler. After you import the data, you can also use one of the newly launched transformers to work with columns that contain JSON strings or arrays that are commonly found in nested JSONs.

Import and analyze ORC data with Data Wrangler

Importing ORC data is in Data Wrangler is easy and similar to importing files in any other supported formats. Browse to your ORC file in Amazon S3 and in the DETAILS pane, choose ORC as the file type during import.

If you’re new to Data Wrangler, review Get Started with Data Wrangler. Also, see Import to learn about the various import options.

Import and analyze JSON data with Data Wrangler

Now let’s import files in JSON format with Data Wrangler and work with columns that contain JSON strings or arrays. We also demonstrate how to deal with nested JSONs. With Data Wrangler, importing JSON files from Amazon S3 is a seamless process. This is similar to importing files in any other supported formats. After you import the files, you can preview the JSON files as shown in the following screenshot. Make sure to set the file type to JSON in the DETAILS pane.

Next, let’s work on structured columns in the imported JSON file.

To deal with structured columns in JSON files, Data Wrangler is introducing two new transforms: Flatten structured column and Explode array column, which can be found under the Handle structured column option in the ADD TRANSFORM pane.

Let’s start by applying the Explode array column transform to one of the columns in our imported data. Before applying the transform, we can see the column topping is an array of JSON objects with id and type keys.

After we apply the transform, we can observe the new rows added as a result. Each element in the array is now a new row in the resulting DataFrame.

Now let’s apply the Flatten structured column transform on the topping_flattened column that was created as a result of the Explode array column transformation we applied in the previous step.

Before applying the transform, we can see the keys id and type in the topping_flattened column.

After applying the transform, we can now observe the keys id and type under the topping_flattened column as new columns topping_flattened_id and topping_flattened_type, which are created as a result of the transformation. You also have the option to flatten only specific keys by entering the comma separated key names for Keys to flatten on. If left empty, all the keys inside the JSON string or struct are flattened.

Conclusion

In this post, we demonstrated how to import file formats in ORC and JSON easily with Data Wrangler. We also applied the newly launched transformations that allow us to transform any structured columns in JSON data. This makes working with columns that contain JSON strings or arrays a seamless experience.

As next steps, we recommend you replicate the demonstrated examples in your own Data Wrangler visual interface. If you have any questions related to Data Wrangler, feel free to leave them in the comment section.


About the Authors

Balaji Tummala is a Software Development Engineer at Amazon SageMaker. He helps support Amazon SageMaker Data Wrangler and is passionate about building performant and scalable software. Outside of work, he enjoys reading fiction and playing volleyball.

Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

Read More

Can Robots Follow Instructions for New Tasks?

People can flexibly maneuver objects in their physical surroundings to accomplish various goals. One of the grand challenges in robotics is to successfully train robots to do the same, i.e., to develop a general-purpose robot capable of performing a multitude of tasks based on arbitrary user commands. Robots that are faced with the real world will also inevitably encounter new user instructions and situations that were not seen during training. Therefore, it is imperative for robots to be trained to perform multiple tasks in a variety of situations and, more importantly, to be capable of solving new tasks as requested by human users, even if the robot was not explicitly trained on those tasks.

Existing robotics research has made strides towards allowing robots to generalize to new objects, task descriptions, and goals. However, enabling robots to complete instructions that describe entirely new tasks has largely remained out-of-reach. This problem is remarkably difficult since it requires robots to both decipher the novel instructions and identify how to complete the task without any training data for that task. This goal becomes even more difficult when a robot needs to simultaneously handle other axes of generalization, such as variability in the scene and positions of objects. So, we ask the question: How can we confer noteworthy generalization capabilities onto real robots capable of performing complex manipulation tasks from raw pixels? Furthermore, can the generalization capabilities of language models help support better generalization in other domains, such as visuomotor control of a real robot?

In “BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning”, published at CoRL 2021, we present new research that studies how robots can generalize to new tasks that they were not trained to do. The system, called BC-Z, comprises two key components: (i) the collection of a large-scale demonstration dataset covering 100 different tasks and (ii) a neural network policy conditioned on a language or video instruction of the task. The resulting system can perform at least 24 novel tasks, including ones that require interaction with pairs of objects that were not previously seen together. We are also excited to release the robot demonstration dataset used to train our policies, along with pre-computed task embeddings.

The BC-Z system allows a robot to complete instructions for new tasks that the robot was not explicitly trained to do. It does so by training the policy to take as input a description of the task along with the robot’s camera image and to predict the correct action.

Collecting Data for 100 Tasks
Generalizing to a new task altogether is substantially harder than generalizing to held-out variations in training tasks. Simply put, we want robots to have more generalization all around, which requires that we train them on large amounts of diverse data.

We collect data by teleoperating the robot with a virtual reality headset. This data collection follows a scheme similar to how one might teach an autonomous car to drive. First, the human operator records complete demonstrations of each task. Then, once the robot has learned an initial policy, this policy is deployed under close supervision where, if the robot starts to make a mistake or gets stuck, the operator intervenes and demonstrates a correction before allowing the robot to resume.

This mixture of demonstrations and interventions has been shown to significantly improve performance by mitigating compounding errors. In our experiments, we see a 2x improvement in performance when using this data collection strategy compared to only using human demonstrations.

Example demonstrations collected for 12 out of the 100 training tasks, visualized from the perspective of the robot and shown at 2x speed.

Training a General-Purpose Policy
For all 100 tasks, we use this data to train a neural network policy to map from camera images to the position and orientation of the robot’s gripper and arm. Crucially, to allow this policy the potential to solve new tasks beyond the 100 training tasks, we also input a description of the task, either in the form of a language command (e.g., “place grapes in red bowl”) or a video of a person doing the task.

To accomplish a variety of tasks, the BC-Z system takes as input either a language command describing the task or a video of a person doing the task, as shown here.

By training the policy on 100 tasks and conditioning the policy on such a description, we unlock the possibility that the neural network will be able to interpret and complete instructions for new tasks. This is a challenge, however, because the neural network needs to correctly interpret the instruction, visually identify relevant objects for that instruction while ignoring other clutter in the scene, and translate the interpreted instruction and perception into the robot’s action space.

Experimental Results
In language models, it is well known that sentence embeddings generalize on compositions of concepts encountered in training data. For instance, if you train a translation model on sentences like “pick up a cup” and “push a bowl”, the model should also translate “push a cup” correctly.

We study the question of whether the compositional generalization capabilities found in language encoders can be transferred to real robots, i.e., being able to compose unseen object-object and task-object pairs.

We test this method by pre-selecting a set of 28 tasks, none of which were among the 100 training tasks. For example, one of these new test tasks is to pick up the grapes and place them into a ceramic bowl, but the training tasks involve doing other things with the grapes and placing other items into the ceramic bowl. The grapes and the ceramic bowl never appeared in the same scene during training.

In our experiments, we see that the robot can complete many tasks that were not included in the training set. Below are a few examples of the robot’s learned policy.

The robot completes three instructions of tasks that were not in its training data, shown at 2x speed.

Quantitatively, we see that the robot can succeed to some degree on a total of 24 out of the 28 held-out tasks, indicating a promising capacity for generalization. Further, we see a notably small gap between the performance on the training tasks and performance on the test tasks. These results indicate that simply improving multi-task visuomotor control could considerably improve performance.

The BC-Z performance on held-out tasks, i.e., tasks that the robot was not trained to perform. The system correctly interprets the language command and translates that into action to complete many of the tasks in our evaluation.

Takeaways
The results of this research show that simple imitation learning approaches can be scaled in a way that enables zero-shot generalization to new tasks. That is, it shows one of the first indications of robots being able to successfully carry out behaviors that were not in the training data. Interestingly, language embeddings pre-trained on ungrounded language corpora make for excellent task conditioners. We demonstrated that natural language models can not only provide a flexible input interface to robots, but that pretrained language representations actually confer new generalization capabilities to the downstream policy, such as composing unseen object pairs together.

In the course of building this system, we confirmed that periodic human interventions are a simple but important technique for achieving good performance. While there is a substantial amount of work to be done in the future, we believe that the zero-shot generalization capabilities of BC-Z are an important advancement towards increasing the generality of robotic learning systems and allowing people to command robots. We have released the teleoperated demonstrations used to train the policy in this paper, which we hope will provide researchers with a valuable resource for future multi-task robotic learning research.

Acknowledgements
We would like to thank the co-authors of this research: Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, and Sergey Levine. This project was a collaboration between Google Research and the Everyday Robot Project. We would like to give special thanks to Noah Brown, Omar Cortes, Armando Fuentes, Kyle Jeffrey, Linda Luu, Sphurti Kirit More, Jornell Quiambao, Jarek Rettinghouse, Diego Reyes, Rosario Jau-regui Ruano, and Clayton Tan for overseeing robot operations and collecting human videos of the tasks, as well as Jeffrey Bingham, Jonathan Weisz, and Kanishka Rao for valuable discussions. We would also like to thank Tom Small for creating animations in this post and Paul Mooney for helping with dataset open-sourcing.

Read More

Solving (Some) Formal Math Olympiad Problems

We built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems, including problems from the AMC12 and AIME competitions, as well as two problems adapted from the IMO.[1] The prover uses a language model to find proofs of formal statements. Each time we find a new proof, we use it as new training data, which improves the neural network and enables it to iteratively find solutions to harder and harder statements.

Read Paper

We achieved a new state-of-the-art (41.2% vs 29.3%) on the miniF2F benchmark, a challenging collection of high-school olympiad problems. Our approach, which we call statement curriculum learning, consists of manually collecting a set of statements of varying difficulty levels (without proof) where the hardest statements are similar to the benchmark we target. Initially our neural prover is weak and can only prove a few of them. We iteratively search for new proofs and re-train our neural network on the newly discovered proofs, and after 8 iterations, our prover ends up being vastly superior when tested on miniF2F.

Formal mathematics is an exciting domain to study because of (i) its richness, letting you prove arbitrary theorems which require reasoning, creativity and insight and (ii) its similarity to games—where AI has been spectacularly successful—in that it has an automated way of determining whether a proof is successful (i.e., verified by the formal system). As demonstrated in the trivial example below, proving a formal statement requires generating a sequence of proof steps, each proof step consisting in a call to a tactic.[2] These tactics take mathematical terms as arguments and each tactic call will transform the current statement to prove, into statements that are easier to prove, until nothing is left to prove.

Problem 1
Adapted from AMC12 2000 Problem 5

Prove that if $|x – 2| = p$, where $x < 2$, then $x – p = 2 – 2p$.


theorem amc12_2000_p5      -- ← theorem name
  (x p : ℝ)                -- ← the statement we want
  (h₀ : x < 2)             --   to prove
  (h₁ : abs (x - 2) = p) :
  x - p = 2 - 2 * p :=
begin                      -- ← formal proof starts here
  -- This first tactic requires that the prover invent
  -- the term: `abs (x - 2) = -(x - 2)`.
  have h₂ : abs (x - 2) = -(x - 2), {
    apply abs_of_neg,
    linarith,
  },
  rw h₁ at h₂,
  -- At this stage the remaining goal to prove is:
  -- `x - p = 2 - 2 * p` knowing that `p = -(x - 2)`.
  linarith,
end

Since $x < 2$, $|x – 2| = -(x – 2)$. Using $p = |x – 2|$ we have $x = 2-p$ and finally $x – p = x – 2 – 2p$.

We observe that the capability to generate original mathematical terms required as arguments of tactics, which cannot be done without a neural language model, emerges from our training procedure. The proof below is an example of it: the proof step use n + 1 (entirely generated by our models) proposes to use n + 1 as a solution, the rest of the formal proof relying on the ring_exp tactic to verify that it is indeed valid.

Problem 2
Adapted from AMC12B 2020 Problem 6

For all integers $n ≥ 9$, prove that $((n + 2)! −(n + 1)!) / n!$ is a perfect square.


theorem amc12b_2020_p6
  (n : ℕ)
  (h0 : 9 ≤ n) :
  ∃ x : ℕ, (x:ℝ)^2 = 
    (nat.factorial (n + 2) - nat.factorial (n + 1))
    / nat.factorial n :=
begin
  -- The model directly proposes `n + 1` as solution.
  use n + 1,
  field_simp [nat.factorial_ne_zero, pow_succ'],
  ring_exp
end

Expanding the expression we get:

$$((n + 2)! −(n + 1)!) / n! = ((n + 2)(n+1)n! −(n + 1)n!) / n!$$

Dividing by $n!$ we obtain:

$$(n+2)(n+1) – (n+1)$$

Factoring $(n+1)$ we get: $(n+1)(n+2-1) = (n+1)^2$ which concludes the proof.

We also observe that our models and search procedure are capable of producing proofs that chain multiple non-trivial reasoning steps. In the proof below, the model starts by using contraposition leading to the existential statement (∃ (x : ℝ), f x ≠ a * x + b). It then generates a witness for it with use (0 : ℝ) and finishes the proof by leveraging the norm_num tactic.

Problem 3
Adapted from the MATH dataset

Let $f(x) = Ax + B$ and $g(x) = Bx + A$, where $A ne B$. If $f(g(x)) – g(f(x)) = B – A$, prove that $A + B = 0$.


theorem mathd_train_algebra_217
  (a b : ℝ)
  (f g : ℝ → ℝ)
  (h₀ : ∀ x, f x = a * x + b)
  (h₁ : ∀ x, f x = b * x + a)
  (h₂ : a ≠ b)
  (h₃ : ∀ x, f (g x) - g (f x) = b - a) :
  a + b = 0 :=
begin
  revert h₀ h₁ h₂ h₃,
  -- Initial contraposition.
  contrapose!,
  rintro ⟨h₀, ⟨h₁, h₂⟩⟩,
  -- The model proposes `0` as witness for the current
  -- goal that consists in `∃ (x : ℝ), f x ≠ a * x + b`.
  use (0 : ℝ),
  simp only [sub_eq_iff_eq_add, h₀, mul_zero, zero_add],
  norm_num at h₀,
end

First we find that:

$$f(g(x)) = A(Bx + A) + B = ABx + A^2 + B$$$$g(f(x)) = B(Ax + B) + A = ABx + B^2 + A$$

Now we plug this back in $f(g(x)) – g(f(x)) = B – A$ and get:

$$(ABx + A^2 + B) – (ABx + B^2 + A) = B – A$$

That is:

$$A^2 – B^2 + B – A = B – A$$

Hence:

$$A^2 – B^2 = (A-B)(A+B) = 0$$

Since we are given that $A ne B$, necessarily, $A + B = 0$.

Our models, trained with statement curriculum learning, were able to close a variety of problems from training textbooks as well as AMC12 and AIME competitions, and 2 problems adapted from the IMO. We present below three examples of such generated proofs.

Problem 4
Adapted from IMO 1964 Problem 2

Suppose $a$, $b$, $c$ are the sides of a triangle.
Prove that $a^2(b + c − a) + b^2(c + a − b) + c^2(a + b − c) leq 3abc$.


theorem imo_1964_p2
  (a b c : ℝ)
  (h₀ : 0 < a ∧ 0 < b ∧ 0 < c)
  (h₁ : c < a + b)
  (h₂ : b < a + c)
  (h₃ : a < b + c) :
  a^2 * (b + c - a) + b^2 * (c + a - b) + c^2 * (a + b - c) 
    ≤ 3 * a * b * c :=
begin
  -- Arguments to `nlinarith` are fully invented by our model.
  nlinarith [sq_nonneg (b - a),
             sq_nonneg (c - b),
             sq_nonneg (c - a)]
end

Rearrange to get $a(a-b)(a-c) + b(b-a)(b-c) + c(c-a)(c-b) >= 0$ which is true by Schur’s inequality.

Problem 5
Adapted from AIME 1984 Problem 1

Prove that $a2 + a4 + a6 + a8 + …+ a98 = 93$ if $a1$, $a2$, $a3…$ is an arithmetic progression with common difference $1$, and $a1 + a2 + a3 + … + a98 = 137$.


theorem aime_1984_p1
  (u : ℕ → ℚ)
  (h₀ : ∀ n, u (n + 1) = u n + 1)
  (h₁ : ∑ k in finset.range 98, u k.succ = 137) :
  ∑ k in finset.range 49, u (2 * k.succ) = 93 :=
begin
  rw finset.sum_eq_multiset_sum,
  dsimp [finset.range] at h₁,
  simp [h₀],
  ring,
  norm_num at h₁,
  norm_num,
  apply eq_of_sub_eq_zero,
  { simp only [*, abs_of_pos, add_zero] at *, linarith },
end

For $n geq 1$ we have $a(2n-1) = a(2n)-1$. Substituting this into the equation given, we get:

$$(a(2)-1) + a(2) + (a(4)-1) + a(4) + … + (a(98)-1) + (a(98)) = 137$$

But the left-hand side is simply $2(a2 + a4 + a6 + … + a98) – 49$, so:

$$(a2 + a4 + a6 + … + a98) = (137 + 49) / 2 = 93$$

Problem 6

Adapted from IMO Longlist 1990 Problem 77[3]
For $a, b, c$ reals, prove that $(a^2 + ab + b^2)(b^2 + bc + c^2)(c^2 + ca + a^2) geq (ab + bc + ca)^3$.


theorem imo_longlist_1990_p77
  (a b c : ℝ) :
  (a * b + b * c + c * a)^3 ≤
    (a^2 + a * b + b^2) * (b^2 + b * c + c^2) *
    (c^2 + c * a + a^2) :=
begin
  -- The three initial steps use Cauchy–Schwarz to prove
  -- `(a * b + b * c) ^ 2 ≤ (a ^ 2 + b ^ 2) * (b ^ 2 + c ^ 2)`
  -- which is required for the final call to `nlinarith`.
  let u : euclidean_space ℝ (fin 2) := ![a, b],
  let v : euclidean_space ℝ (fin 2) := ![b, c],
  have h₀ := real_inner_mul_inner_self_le u v,
  simp [u, v, fin.sum_univ_succ, 
        ←pow_two, ←pow_two, le_of_lt, mul_assoc] at h₀,
  -- The model introduces another required cut (i.e. invent
  -- the term `0 ≤ (c + a) * (c + a)` and proves it).
  have h₃ : 0 ≤ (c + a) * (c + a),
  { nlinarith, },
  have h₄ := sq_nonneg (a * b + b * c + c * a),
  simp [sq, h₀, h₃, mul_add, add_mul] at h₄ ⊢,
  nlinarith [sq_nonneg (b - a),
             sq_nonneg (c - b),
             sq_nonneg (a - c)]
end

After cancelling terms appearing on both sides, we are left to prove that:

$$3a^2b^2c^2 + sum_{sym} a^3b^2c leq sum_{cyc} a^4bc + sum_{cyc} (a^4b^2 + b^4c^2)$$

After multiplying both sides by $2$, we can rearrange the above inequality to:

$$0 leq sum_{cyc} (a^2b + a^2c – b^2c)^2$$

which clearly holds, giving the claim.

Formal mathematics involves two main challenges that make a naive application of reinforcement learning unlikely to succeed.

  • (i) Infinite action space: not only does formal mathematics have an extremely large search space (like Go for example), it also has an infinite action space. At each step of a proof search, the model must choose not from a well-behaved finite set of actions, but a complex and infinite set of tactics, involving exogenous mathematical terms that have to be generated (e.g., generating a mathematical statement to be used as a witness, an object used in steps such as “there exists an $x$ s.t. …”, or a cut, the introduction and the chaining of a lemma in the middle of a proof).
  • (ii) Lack of self-play: conversely to 2-player games, a prover is not playing against an opponent but against a set of statements to prove. When faced with a statement that is just too hard, there is no obvious reframing that will let the prover generate intermediary easier statements to tackle first. This asymmetry prevents naive application of the self-play algorithms that were successful with 2-player games.

In our work, we address the infinite action space problem by sampling actions from a language model as we search for a proof. Language models have the capability to generate the tactic calls as well as the original mathematical terms often required as arguments. Our basis for addressing the lack of self-play is the observation that the key role of self-play in 2-player games is to provide an unsupervised curriculum. Our methodology proposes to replace this unsupervised curriculum with an auxiliary set of problem statements (without requiring proofs) of varying difficulty. We empirically show that, when the difficulty of these auxiliary problems is varied enough, our training procedure is able to solve a curriculum of increasingly difficult problems, eventually generalizing to the set of problems we care about.

While these results are extremely exciting, as they demonstrate that deep learning models are capable of non-trivial mathematical reasoning when interacting with a formal system, we are still very far from best-student performance on these competitions, only occasionally, rather than consistently, closing challenging olympiad problems. We hope nonetheless that our work will motivate research in this domain, in particular towards the IMO Grand Challenge and that the statement curriculum learning methodology we propose will help accelerate progress in automated reasoning in general.


Acknowledgments

Thanks to our paper co-authors: Igor Babuschkin, Kunhao Zheng and Mantas Baksys.

Thanks to the students of the Xena Project Discord who helped us formalize proofs and statements (in particular: Antoine Labelle, Hanting Zhang, Shing Tak Lam, Paul Lezeau, Sara Diaz, Nikita Golikov, Yael Dillies, Artem Vasilyev, Ollie Perree, and Yourong Zang).

Thanks in particular to Kevin Buzzard and Daniel Selsam for their support and thoughtful feedback since the very beginning of this project.


Footnotes

  1. These problems are not standard math exercises, they are used to let the best high-school students from the US (AMC12, AIME) or the world (IMO) compete against each other. ↩︎

  2. The artifacts accepted by the formal system are low-level (like assembly code) and hard for humans to produce. Tactics are search procedures that generate such artifacts from higher level directives to assist formalization. ↩︎

  3. This proof is not reported in the paper as it was found by a more recent model we are still experimenting with. We decided to share it nonetheles because it’s one of our favourite. ↩︎

OpenAI

Figure at the start of a maze showing several paths. Four paths include a medical dead-end, and each stop before reaching the end. Only one path does not include a medical-dead end, and this one goes clear through to the end.

Using reinforcement learning to identify high-risk states and treatments in healthcare

Figure at the start of a maze showing several paths. Four paths include a medical dead-end, and each stop before reaching the end. Only one path does not include a medical-dead end, and this one goes clear through to the end.

As the pandemic overburdens medical facilities and clinicians become increasingly overworked, the ability to make quick decisions on providing the best possible treatment is even more critical. In urgent health situations, such decisions can mean life or death. However, certain treatment protocols can pose a considerable risk to patients who have serious medical conditions and can potentially contribute to unintended outcomes.

In this research project, we built a machine learning (ML) model that works with scenarios where data is limited, such as healthcare. This model was developed to recognize treatment protocols that could contribute to negative outcomes and to alert clinicians when a patient’s health could decline to a dangerous level. You can explore the details of this research project in our research paper, “Medical Dead-ends and Learning to Identify High-risk States and Treatments,” which was presented at the 2021 Conference on Neural Information Processing Systems (NeurIPS 2021).

Reinforcement learning for healthcare

To build our model, we decided to use reinforcement learning—an ML framework that’s uniquely well-suited for advancing safety-critical domains such as healthcare. This is because at its core, healthcare is a sequential decision-making domain, and reinforcement learning is the formal paradigm for modeling and solving problems in such domains. In healthcare, clinicians base their treatment decisions on an overall understanding of a patient’s health; they observe how the patient responds to this treatment, and the process repeats. Likewise, in reinforcement learning, an algorithm, or agent, interprets the state of its environment and takes an action, which, coupled with the internal dynamics of the environment, causes it to transition to a new state, as shown in Figure 1. A reward signal is then assigned to account for the immediate impact of this change. For example, in a healthcare scenario, if a patient recovers or is discharged from the intensive care unit (ICU), the agent may receive a positive reward. However, if the patient does not survive, the agent receives a negative reward, or penalty.

Figure 1: Diagram showing the sequential decision-making process typical in healthcare as an analogous with reinforcement learning. The clinician observes the state of the patient’s health condition and decides on a treatment. The clinician then observes how the patient responded to the treatment and decides on the next steps. Applied to reinforcement learning, the result of each transition in the patient’s state is met with a reward signal.
Figure 1: Sequential decision-making in healthcare: Clinicians or AI agents observe the state of the patient ((s)), select a treatment ((a)), and monitor the next state. The process then repeats. As a result of each such transition of the patient’s state (whose probability is denoted by (T)), a reward signal ((R)) is observed, which accounts for the immediate consequence of the applied treatment.

Reinforcement learning is widely used in gaming, for example, to determine the best sequence of chess moves and maximize an AI system’s chances of winning. Over time, due to trial-and-error experimentation, the desired actions are maximized and the undesired ones are minimized until the optimal solution is identified. Normally, this experimentation is made possible by the proactive collection of extensive amounts of diverse data. However, unlike in gaming, exploratory data collection and experimentation are not possible in healthcare, and our only option in this realm is to work with previously collected datasets, providing very limited opportunities to explore alternative choices. This is where offline reinforcement learning comes into focus. A subarea of reinforcement learning, offline reinforcement learning works only with data that already exists—instead of proactively taking in new data, we’re using a fixed dataset. Even so, to propose the best course of action, an offline reinforcement learning algorithm still requires sufficient trial-and-error with alternatives, and this necessitates a very large dataset, something not feasible in safety-critical domains with limited data, like healthcare.

In the current research literature, when reinforcement learning is applied to healthcare, the focus is on what to do to support the best possible patient outcome, an infeasible objective. In our paper, we propose inverting this paradigm in offline settings to investigate high-risk treatments and identify when the state of patients’ health reaches a critical point. To enable this approach, we developed a methodology called Dead-end Discovery (DeD), which identifies treatments to avoid in order to prevent a medical dead-end—the point at which the patient is most likely to die regardless of future treatment. DeD provably requires exponentially less data than the standard methods, making it significantly more reliable in limited-data situations. By identifying known high-risk treatments, DeD could assist clinicians in making trustworthy decisions in highly stressful situations, where minutes count. Moreover, this methodology could also raise an early warning flag and alert clinicians when a patient’s condition reveals outstanding risk, often before it becomes obvious. We go into more detail on the DeD methodology later in this post.

Medical dead-ends and rescue states

At ICUs, patients experience a trajectory which sequentially tracks the state of their health. It starts with the patient’s condition upon admission, followed by the administration of treatment and then by their response to the treatment. This sequence repeats until the patient reaches a terminal state—the final observation of the patient’s condition that’s still relevant within the ICU. To learn what treatments to avoid, we focus on two types of terminal states: patient recovery and patient death. Other terminal states can also exist. For example, when playing chess, a loss or a win are not the only possible outcomes; draws can also occur. While our framework can encompass additional terminal states, this work focuses on only two possibilities: positive outcomes and negative outcomes.

Building on these two terminal states, we define medical dead-ends as patient states from which all possible future trajectories will lead to the terminal state of the patient’s death. If applied in acute care settings, it’s critical to both avoid medical dead-ends and identify the probability with which any selected treatment will lead to them. It’s also important to note that medical dead-ends can occur considerably earlier than clinicians are able to observe. This makes DeD particularly valuable, as every hour counts when it comes to critical conditions.

To contrast with medical dead-ends, we also propose the concept of rescue states, where recovery is fully reachable. At each rescue state, there exists at least one treatment that would lead, with the probability of 1, either to another rescue state or to recovery. In most cases, a patient’s condition is neither a medical dead-end nor a rescue state, as the minimum and maximum probability of future mortality or recovery is not always 0 and 1, but somewhere in between. Therefore, it’s important to have an alert when a patient is likely to enter a medical dead-end.

Figure 2: Diagram showing possible trajectories for a single patient with sepsis upon admission to the ICU. Each branch represents the septic patient’s trajectory in response to a sample sequence of treatments. A slumping avatar represents a medical dead-end, which is significantly far from the terminal state and may not be observable by the clinicians. A critical point here is one step before this medical dead-end, represented by the grey avatar, where there is still chance to save the patient.
Figure 2: Using sepsis as an example use case, this diagram shows simplified possible trajectories for a single patient upon admission to the ICU. Each branch represents the septic patient’s trajectory in response to a sample sequence of treatments, represented by a black dot (VP = vasopressor + IV = intravenous fluid). Avatars with blue borders and “RS” above them represent rescue states. Avatars with red borders and “MD” above them represent medical dead-ends. The shading of each avatar roughly indicates the state of the patient’s condition in response to treatment. More shading represents an improving condition and less shading represents a worsening condition. No shading represents the terminal state where the patient does not survive. The slumping avatar represents a medical dead-end, which is significantly far from the terminal state and may not be observable by the clinicians. A critical point here is one step before this medical dead-end, represented by the grey avatar, where there is still a chance to save the patient.  
Patient vital signs taken at the ICU: HR=heart rate; BP=blood pressure; RR=respiration rate; SOFA=sequential organ failure assessment score  

Treatment security: How to help doctors

To develop our model, we considered a generic condition that guarantees the merit and reliability of a given treatment-selection policy. In particular, we postulated the following condition we called treatment security:

If at state (s), treatment (a) causes transitioning to a medical dead-end with any given level of certainty, then the policy must refrain from selecting (a) at (s) with the same level of certainty.

For example, if a certain treatment leads to a medical dead-end or immediate death with a probability of more than 80 percent, that treatment should be selected for administration no more than 20 percent of the time.

While treatment security is a desired property, it’s not easy to directly enforce because the required probabilities are not known a priori, nor are they directly measurable from the data. Therefore, we developed a theoretical framework at the core of our method that enables treatment security from data by mapping it to proper learning problems.

DeD: Dead-end Discovery methodology

To precisely define the learning problems, we based our DeD methodology on three core ideas: 1) separating the outcomes, 2) learning the optimal value function of each outcome in isolation without discounting, and 3) proving important properties for these particular value functions, which enable treatment security.

We constructed two simple reward signals for independent learning problems:

  1. -1 in the case of a negative outcome; 0 at all other transitions
  2. +1 in the case of a positive outcome; 0 at all other transitions

Next, we learned their corresponding optimal value functions, (Q_{D}^{*}(s, a)) and (Q_{R}^{*}(s, a)) both with no discounting. It turns out that these value functions are intrinsically important. In fact, we show that:

–(Q_{D}^{*}(s, a)) corresponds to the minimum probability of a future negative outcome if treatment (a) is selected at state (s). Equivalently, (1 + Q_{D}^{*}(s, a)) corresponds to the maximum hope of a positive outcome.

Moreover, the quantity (1 + Q_{D}^{*}(s, a)) proves to be a meaningful threshold for a policy to make it secure. We formally show that: for treatment security, it is sufficient to abide by the maximum hope of recovery.

We further proved that if the probability of treatment selection can be higher than (Q_{R}^{*}(s, a)), the patient is guaranteed to remain in a rescue state when possible. Finally, we also showed that such thresholds for limiting the treatment selection probabilities exist.

Building from these results, we defined a training and deployment pipeline, illustrated in Figure 3.

Figure 3: Diagram showing the DeD pipeline. The training process results in the learned optimal value functions. The deployment of the pipelines ends with providing critical information to the human decision-maker.
Figure 3: The DeD pipeline: section a illustrates the training process, resulting in the learned optimal value functions, and section b shows the deployment of the pipeline, which ends with providing critical information to the human decision-maker.

Applying the DeD methodology to sepsis

To demonstrate the utility of DeD in safety-critical domains and to honor the underlying healthcare motivations behind its development, we applied DeD on publicly available real-world medical data. Specifically, our data pertained to critically ill patients who had developed sepsis and were treated in an ICU.

Sepsis is a syndrome characterized by organ dysfunction due to a patient’s dysregulated response to an infection. In the United States alone, sepsis is responsible for more than 200,000 deaths each year, contributing to over 10 percent of in-hospital mortality, and accounting for over $23 billion in hospitalization costs. Globally, sepsis is a leading cause of mortality, with an estimated 11 million deaths each year, accounting for almost 20 percent of all deaths. It’s also an end-stage to many health conditions. In a recent retrospective study of hospitalized COVID-19 patients, all the fatal cases and more than 40 percent of survivors were septic.

In our study, we envisioned a way to help clinicians identify which subset of treatments could statistically cause further health deterioration so that they could eliminate them when deciding on the next steps. To estimate the value functions of possible treatments, we used the publicly available Medical Information Mart for Intensive Care III (MIMIC-III) dataset (v 1.4), sourced from the Beth Israel Deaconess Medical Center in Boston, Massachusetts. MIMIC-III is comprised of deidentified electronic health records (EHR) of consenting patients admitted to critical care units, collected from 53,423 distinct hospital admissions between 2001 and 2012. Following standard extraction and preprocessing methods, we derived an experimental cohort of 19,611 patients who are presumed to have developed sepsis during their initial admission to the ICU, with an observed mortality rate of approximately 10 percent. We studied 72 hours of the patients’ stay at the ICU—24 hours before the presumed onset of sepsis and 48 hours afterwards. We used 44 observation variables, including various health records and demographic information, and 25 distinct treatment options (five discrete levels for IV fluid and vasopressor volumes in combination), aggregated over four hours.

With this dataset, we sought to demonstrate that medical dead-ends exist in medical data and show the effect of treatment selection on the development of medical dead-ends. We also sought to identify whether alternative treatments were available that could have prevented the occurrence of a medical dead-end.

To flag potentially nonsecure treatments, we examined whether the values estimated ((Q_{D}(s, a)) and (Q_{R}(s, a))) for each treatment passed certain thresholds. To flag potential medical dead-end states, we looked at the median values of available treatments against these same thresholds. Using the median helped mitigate approximation errors due to generalization from potentially insufficient data and extrapolations made by the reinforcement learning formulation. With the specified thresholds, DeD identified increasing percentages of patients raising fatal flags, particularly among the subpopulation that died in the hospital. In Figure 4, note the distinctive difference between the trend of estimated values for surviving and non-surviving patients. Over the course of 72 hours in the ICU, surviving patients rarely raised a flag, while flags were raised at an increased rate for patients who did not survive as they proceeded toward the final observations of their time in the ICU.

Figure 4: Histograms of the flag status for surviving and non-surviving patients, according to the rescue state and medical dead-end values. Bars are plotted according to the time prior to the recorded terminal state and measure the percentage of patients whose states did not raise any flags. There is a clear worsening trend for non-surviving patients as they approached a terminal state, beginning as early as 48 hours prior to expiration.
Figure 4: Histograms of the flag status for both surviving and non-surviving patients, according to the rescue state and medical dead-end values. The bars are plotted according to the time prior to the recorded terminal state and measure the percentage of patients whose states did not raise any flags. There is a clear worsening trend for non-surviving patients as they approached a terminal state, beginning as early as 48 hours prior to expiration.

To further support our hypothesis that medical dead-ends exist among septic patients and may be preventable, we aligned patients according to the point in their care when a flag was first raised by our DeD framework. As shown in Figure 5, we selected all trajectories with at least 24 hours prior to and 16 hours after this flag. The DeD estimates of (V) and (Q) values for administered treatments had similar behavior in both the surviving and non-surviving subpopulations prior to this first flag, but the values quickly diverged afterwards. We observed that the advent of this first flag also corresponded to a similar divergence among various clinical measures and vital signs, shown in Figure 5, sections a and b.

DeD identified a clear critical point in these patients’ care, where non-surviving patients experienced an irreversible negative change to their health, as shown in Figure 5, section c. Additionally, there was a significant gap in the estimated value between the treatments administered to the non-surviving patients and those treatments deemed to be more secure by DeD, shown in Figure 5, section e. There was a clear inflection in the estimated values four to eight hours before this first flag was raised, shown in Figure 5, section c.

Figure 5: A series of graphs that show the trend of measures taken around the first raised flag. Various measures are shown 24 hours (6 steps, 4 hours each) before the first flag is raised and 16 hours (4 steps) afterwards for non-surviving and surviving patients. The shaded areas represent the standard deviation. The first shows selected key vital measures and lab tests, the second section shows established clinical measures. The DeD estimates of heath state and administered treatments had similar behavior in both the surviving and non-surviving subpopulations prior to this first flag, but the values quickly diverged afterwards. We observed that the advent of this first flag also corresponded to a similar divergence among various clinical measures and vital signs. The third section shows DeD value estimates of health state and administered treatment. Here, DeD identified a clear critical point in these patients’ care, where non-surviving patients experienced an irreversible negative change to their health. The fourth section shows the administered treatments. Finally, the last column illustrates value trends for the selected treatments as well as the most secure ones. It shows a significant gap in the estimated value between the treatments administered to the non-surviving patients and those treatments deemed to be more secure by DeD.
Figure 5: Trend of measures around the first raised flag: Various measures are shown 24 hours (6 steps, 4 hours each) before the first flag is raised and 16 hours (4 steps) afterwards for non-surviving (blue) and surviving (green) patients. The shaded areas represent the standard deviation. Section a shows selected key vital measures and lab tests, section b shows established clinical measures, and section c shows DeD value estimates of health state (V) and administered treatment (Q). Section d shows the administered treatments. Finally, the last column, e, illustrates value trends for the selected treatments as well as the most secure ones.

Further analysis of our results, which we describe in detail in our paper, indicates that more than 12 percent of treatments given to non-surviving patients could be detrimental 24 hours before death. We also identified that 2.7 percent of non-surviving patients entered medical dead-end trajectories with a sharply increasing rate up to 48 hours before death, and close to 10 percent when we slightly relaxed our thresholds for predicting medical dead-ends. While these percentages may seem small, more than 200,000 patients die of sepsis every year in US hospitals alone, and any reduction of this rate would result in possibly tens of thousands of individuals who would otherwise survive. We’re excited about the possibility that DeD could help clinicians provide their patients with the best care and that many more patients could potentially survive sepsis.

Looking ahead: Further uses of DeD and offline reinforcement learning

We view DeD as a powerful tool that could magnify human expertise in healthcare by supporting clinicians with predictive models as they make critical decisions. There is significant potential for researchers to use the DeD method to expand on this research and look at other measures, such as the relationship between patient demographics and sepsis treatment, with the goal of preventing certain treatment profiles for particular subgroups of patients.

The principles of offline reinforcement learning and the DeD methodology can also be applied to other clinical conditions, as well as to safety-critical areas beyond healthcare that also rely on sequential decision-making. For example, the domain of finance entails similar core concepts as it is analogously based on sequential decision-making processes. DeD could be used to alert financial professionals when specific actions, such as buying or selling certain assets, are likely to result in unavoidable future loss, or a financial dead-end. We hope our work will inspire active research and discussion in the community. You can learn more about the research and access the code here.

Disclaimer: The research presented in this video, including the referenced paper, code, and models, are shared for research purposes only. They are not to be used in clinical settings, as a stand-alone tool, or as replacement for the decisions of expert medical professionals. The algorithm and technology presented here, and any derivatives of it, should not be used to make clinical decisions, including, but not limited to, decisions about the medical treatment of patients. In addition, further testing and validation are required before the DeD framework may be used in any clinical setting, including, but not limited to, understanding how the information provided by the DeD framework affects clinician care and patient outcomes over time, neither of which have been studied here.

The post Using reinforcement learning to identify high-risk states and treatments in healthcare appeared first on Microsoft Research.

Read More

How Audio Analytic Is Teaching Machines to Listen

From active noise cancellation to digital assistants that are always listening for your commands, audio is perhaps one of the most important but often overlooked aspects of modern technology in our daily lives.

Audio Analytic has been using machine learning that enables a vast array of devices to make sense of the world of sound.

We spoke with Dr. Chris Mitchell, CEO and founder of Audio Analytic about the challenges, and the fun, involved in teaching machines to listen.

Subscribe to the AI Podcast: Now Available on Amazon Music

You can now listen to the AI Podcast through Amazon Music.

You can also get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

If your favorite isn’t listed here, drop us a note.

You Might Also Like:

Art(ificial) Intelligence: Pindar Van Arman Builds Robots That Paint

Pindar Van Arman, an American artist and roboticist, designs painting robots that explore the differences between human and computational creativity. Since his first system in 2005, he has built multiple artificially creative robots. The most famous, Cloud Painter, was awarded first place at Robotart 2018.

Real or Not Real? Attorney Steven Frank Uses Deep Learning to Authenticate Art

Steven Frank is a partner at the law firm Morgan Lewis, specializing in intellectual property and commercial technology law. He’s also half of the husband-wife team that used convolutional neural networks to authenticate artistic masterpieces, including Da Vinci’s Salvador Mundi, with AI’s help.

Researchers Chris Downum and Leszek Pawlowicz Use Deep Learning to Accelerate Archaeology

Researchers in the Department of Anthropology at Northern Arizona University are using GPU-based deep learning algorithms to categorize sherds — tiny fragments of ancient pottery.

Make the AI Podcast Better

Have a few minutes to spare? Fill out this listener survey. Your answers will help us make a better podcast.

The post How Audio Analytic Is Teaching Machines to Listen appeared first on The Official NVIDIA Blog.

Read More

imodels: leveraging the unreasonable effectiveness of rules



imodels: A python package with cutting-edge techniques for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use.

Recent machine-learning advances have led to increasingly complex predictive models, often at the cost of interpretability. We often need interpretability, particularly in high-stakes applications such as medicine, biology, and political science (see here and here for an overview). Moreover, interpretable models help with all kinds of things, such as identifying errors, leveraging domain knowledge, and speeding up inference.

Despite new advances in formulating/fitting interpretable models, implementations are often difficult to find, use, and compare. imodels (github, paper) fills this gap by providing a simple unified interface and implementation for many state-of-the-art interpretable modeling techniques, particularly rule-based methods.

Solving (some) formal math olympiad problems

We built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems, including problems from the AMC12 and AIME competitions, as well as two problems adapted from the IMO.OpenAI Blog