At this year’s ACL, Amazon researchers won an outstanding-paper award for showing that knowledge distillation using contrastive decoding in the teacher model and counterfactual reasoning in the student model improves the consistency of “chain of thought” reasoning.Read More
Enel automates large-scale power grid asset management and anomaly detection using Amazon SageMaker
This is a guest post by Mario Namtao Shianti Larcher, Head of Computer Vision at Enel.
Enel, which started as Italy’s national entity for electricity, is today a multinational company present in 32 countries and the first private network operator in the world with 74 million users. It is also recognized as the first renewables player with 55.4 GW of installed capacity. In recent years, the company has invested heavily in the machine learning (ML) sector by developing strong in-house know-how that has enabled them to realize very ambitious projects such as automatic monitoring of its 2.3 million kilometers of distribution network.
Every year, Enel inspects its electricity distribution network with helicopters, cars, or other means; takes millions of photographs; and reconstructs the 3D image of its network, which is a point cloud 3D reconstruction of the network, obtained using LiDAR technology.
Examination of this data is critical for monitoring the state of the power grid, identifying infrastructure anomalies, and updating databases of installed assets, and it allows granular control of the infrastructure down to the material and status of the smallest insulator installed on a given pole. Given the amount of data (more than 40 million images each year just in Italy), the number of items to be identified, and their specificity, a completely manual analysis is very costly, both in terms of time and money, and error prone. Fortunately, thanks to enormous advances in the world of computer vision and deep learning and the maturity and democratization of these technologies, it’s possible to automate this expensive process partially or even completely.
Of course, the task remains very challenging, and, like all modern AI applications, it requires computing power and the ability to handle large volumes of data efficiently.
Enel built its own ML platform (internally called the ML factory) based on Amazon SageMaker, and the platform is established as the standard solution to build and train models at Enel for different use cases, across different digital hubs (business units) with tens of ML projects being developed on Amazon SageMaker Training, Amazon SageMaker Processing, and other AWS services like AWS Step Functions.
Enel collects imagery and data from two different sources:
- Aerial network inspections:
- LiDAR point clouds – They have the advantage of being an extremely accurate and geo-localized 3D reconstruction of the infrastructure, and therefore are very useful for calculating distances or taking measurements with an accuracy not obtainable from 2D image analysis.
- High-resolution images – These images of the infrastructure are taken within seconds of each other. This makes it possible to detect elements and anomalies that are too small to be identified in the point cloud.
- Satellite images – Although these can be more affordable than a power line inspection (some are available for free or for a fee), their resolution and quality is often not on par with images taken directly by Enel. The characteristics of these images make them useful for certain tasks like evaluating forest density and macro-category or finding buildings.
In this post, we discuss the details of how Enel uses these three sources, and share how Enel automates their large-scale power grid assessment management and anomaly detection process using SageMaker.
Analyzing high-resolution photographs to identify assets and anomalies
As with other unstructured data collected during inspections, the photographs taken are stored on Amazon Simple Storage Service (Amazon S3). Some of these are manually labeled with the goal of training different deep learning models for different computer vision tasks.
Conceptually, the processing and inference pipeline involves a hierarchical approach with multiple steps: first, the regions of interest in the image are identified, then these are cropped, assets are identified within them, and finally these are classified according to the material or presence of anomalies on them. Because the same pole often appears in more than one image, it’s also necessary to be able to group its images to avoid duplicates, an operation called reidentification.
For all these tasks, Enel uses the PyTorch framework and the latest architectures for image classification and object detection, such as EfficientNet/EfficientDet or others for the semantic segmentation of certain anomalies, such as oil leaks on transformers. For the reidentification task, if they can’t do it geometrically because they lack camera parameters, they use SimCLR-based self-supervised methods or Transformer-based architectures are used. It would be impossible to train all these models without having access to a large number of instances equipped with high-performance GPUs, so all the models were trained in parallel using Amazon SageMaker Training jobs with GPU accelerated ML instances. Inference has the same structure and is orchestrated by a Step Functions state machine that governs several SageMaker processing and training jobs that, despite the name, are as usable in training as in inference.
The following is a high-level architecture of the ML pipeline with its main steps.
This diagram shows the simplified architecture of the ODIN image inference pipeline, which extracts and analyzes ROIs (such as electricity posts) from dataset images. The pipeline further drills down on ROIs, extracting and analyzing electrical elements (transformers, insulators, and so on). After the components (ROIs and elements) are finalized, the reidentification process begins: images and poles in the network map are matched based on 3D metadata. This allows the clustering of ROIs referring to the same pole. After that, anomalies get finalized and reports are generated.
Extracting precise measurements using LiDAR point clouds
High-resolution photographs are very useful, but because they’re 2D, it’s impossible to extract precise measurements from them. LiDAR point clouds come to the rescue here, because they are 3D and have each point in the cloud a position with an associated error of less than a handful of centimeters.
However, in many cases, a raw point cloud is not useful, because you can’t do much with it if you don’t know whether a set of points represents a tree, a power line, or a house. For this reason, Enel uses KPConv, a semantic point cloud segmentation algorithm, to assign a class to each point. After the cloud is classified, it’s possible to figure out whether vegetation is too close to the power line rather than measuring the tilt of poles. Due to the flexibility of SageMaker services, the pipeline of this solution is not much different from the one already described, with the only difference being that in this case it is necessary to use GPU instances for inference as well.
The following are some examples of point cloud images.
Looking at the power grid from space: Mapping vegetation to prevent service disruptions
Inspecting the power grid with helicopters and other means is generally very expensive and can’t be done too frequently. On the other hand, having a system to monitor vegetation trends in short time intervals is extremely useful for optimizing one of the most expensive processes of an energy distributor: tree pruning. This is why Enel also included in its solution the analysis of satellite images, from which with a multitask approach is identified where vegetation is present, its density, and the type of plants divided into macro classes.
For this use case, after experimenting with different resolutions, Enel concluded that the free Sentinel 2 images provided by the Copernicus program had the best cost-benefit ratio. In addition to vegetation, Enel also uses satellite imagery to identify buildings, which is useful information to understand if there are discrepancies between their presence and where Enel delivers power and therefore any irregular connections or problems in the databases. For the latter use case, the resolution of Sentinel 2, where one pixel represents an area of 10 square meters, is not sufficient, and so paid-for images with a resolution of 50 square centimeters are purchased. This solution also doesn’t differ much from the previous ones in terms of services used and flow.
The following is an aerial picture with identification of assets (pole and insulators).
Angela Italiano, Director of Data Science at ENEL Grid, says,
“At Enel, we use computer vision models to inspect our electricity distribution network by reconstructing 3D images of our network using tens of millions of high-quality images and LiDAR point clouds. The training of these ML models requires access to a large number of instances equipped with high-performance GPUs and the ability to handle large volumes of data efficiently. With Amazon SageMaker, we can quickly train all of our models in parallel without needing to manage the infrastructure as Amazon SageMaker training scales the compute resources up and down as needed. Using Amazon SageMaker, we are able to build 3D images of our systems, monitor for anomalies, and serve over 60 million customers efficiently.”
Conclusion
In this post, we saw how a top player in the energy world like Enel used computer vision models and SageMaker training and processing jobs to solve one of the main problems of those who have to manage an infrastructure of this colossal size, keep track of installed assets, and identify anomalies and sources of danger for a power line such as vegetation too close to it.
Learn more about the related features of SageMaker.
About the Authors
Mario Namtao Shianti Larcher is the Head of Computer Vision at Enel. He has a background in mathematics, statistics, and a profound expertise in machine learning and computer vision, he leads a team of over ten professionals. Mario’s role entails implementing advanced solutions that effectively utilize the power of AI and computer vision to leverage Enel’s extensive data resources. In addition to his professional endeavors, he nurtures a personal passion for both traditional and AI-generated art.
Cristian Gavazzeni is a Senior Solution Architect at Amazon Web Services. He has more than 20 years of experience as a pre-sales consultant focusing on Data Management, Infrastructure and Security. During his spare time he likes playing golf with friends and travelling abroad with only fly and drive bookings.
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering an ML background, he works with customers of any size to deeply understand their business and technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, Computer Vision, NLP, and involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
Using societal context knowledge to foster the responsible application of AI
AI-related products and technologies are constructed and deployed in a societal context: that is, a dynamic and complex collection of social, cultural, historical, political and economic circumstances. Because societal contexts by nature are dynamic, complex, non-linear, contested, subjective, and highly qualitative, they are challenging to translate into the quantitative representations, methods, and practices that dominate standard machine learning (ML) approaches and responsible AI product development practices.
The first phase of AI product development is problem understanding, and this phase has tremendous influence over how problems (e.g., increasing cancer screening availability and accuracy) are formulated for ML systems to solve as well many other downstream decisions, such as dataset and ML architecture choice. When the societal context in which a product will operate is not articulated well enough to result in robust problem understanding, the resulting ML solutions can be fragile and even propagate unfair biases.
When AI product developers lack access to the knowledge and tools necessary to effectively understand and consider societal context during development, they tend to abstract it away. This abstraction leaves them with a shallow, quantitative understanding of the problems they seek to solve, while product users and society stakeholders — who are proximate to these problems and embedded in related societal contexts — tend to have a deep qualitative understanding of those same problems. This qualitative–quantitative divergence in ways of understanding complex problems that separates product users and society from developers is what we call the problem understanding chasm.
This chasm has repercussions in the real world: for example, it was the root cause of racial bias discovered by a widely used healthcare algorithm intended to solve the problem of choosing patients with the most complex healthcare needs for special programs. Incomplete understanding of the societal context in which the algorithm would operate led system designers to form incorrect and oversimplified causal theories about what the key problem factors were. Critical socio-structural factors, including lack of access to healthcare, lack of trust in the health care system, and underdiagnosis due to human bias, were left out while spending on healthcare was highlighted as a predictor of complex health need.
To bridge the problem understanding chasm responsibly, AI product developers need tools that put community-validated and structured knowledge of societal context about complex societal problems at their fingertips — starting with problem understanding, but also throughout the product development lifecycle. To that end, Societal Context Understanding Tools and Solutions (SCOUTS) — part of the Responsible AI and Human-Centered Technology (RAI-HCT) team within Google Research — is a dedicated research team focused on the mission to “empower people with the scalable, trustworthy societal context knowledge required to realize responsible, robust AI and solve the world’s most complex societal problems.” SCOUTS is motivated by the significant challenge of articulating societal context, and it conducts innovative foundational and applied research to produce structured societal context knowledge and to integrate it into all phases of the AI-related product development lifecycle. Last year we announced that Jigsaw, Google’s incubator for building technology that explores solutions to threats to open societies, leveraged our structured societal context knowledge approach during the data preparation and evaluation phases of model development to scale bias mitigation for their widely used Perspective API toxicity classifier. Going forward SCOUTS’ research agenda focuses on the problem understanding phase of AI-related product development with the goal of bridging the problem understanding chasm.
Bridging the AI problem understanding chasm
Bridging the AI problem understanding chasm requires two key ingredients: 1) a reference frame for organizing structured societal context knowledge and 2) participatory, non-extractive methods to elicit community expertise about complex problems and represent it as structured knowledge. SCOUTS has published innovative research in both areas.
An illustration of the problem understanding chasm. |
A societal context reference frame
An essential ingredient for producing structured knowledge is a taxonomy for creating the structure to organize it. SCOUTS collaborated with other RAI-HCT teams (TasC, Impact Lab), Google DeepMind, and external system dynamics experts to develop a taxonomic reference frame for societal context. To contend with the complex, dynamic, and adaptive nature of societal context, we leverage complex adaptive systems (CAS) theory to propose a high-level taxonomic model for organizing societal context knowledge. The model pinpoints three key elements of societal context and the dynamic feedback loops that bind them together: agents, precepts, and artifacts.
- Agents: These can be individuals or institutions.
- Precepts: The preconceptions — including beliefs, values, stereotypes and biases — that constrain and drive the behavior of agents. An example of a basic precept is that “all basketball players are over 6 feet tall.” That limiting assumption can lead to failures in identifying basketball players of smaller stature.
- Artifacts: Agent behaviors produce many kinds of artifacts, including language, data, technologies, societal problems and products.
The relationships between these entities are dynamic and complex. Our work hypothesizes that precepts are the most critical element of societal context and we highlight the problems people perceive and the causal theories they hold about why those problems exist as particularly influential precepts that are core to understanding societal context. For example, in the case of racial bias in a medical algorithm described earlier, the causal theory precept held by designers was that complex health problems would cause healthcare expenditures to go up for all populations. That incorrect precept directly led to the choice of healthcare spending as the proxy variable for the model to predict complex healthcare need, which in turn led to the model being biased against Black patients who, due to societal factors such as lack of access to healthcare and underdiagnosis due to bias on average, do not always spend more on healthcare when they have complex healthcare needs. A key open question is how can we ethically and equitably elicit causal theories from the people and communities who are most proximate to problems of inequity and transform them into useful structured knowledge?
Illustrative version of societal context reference frame. |
Taxonomic version of societal context reference frame. |
Working with communities to foster the responsible application of AI to healthcare
Since its inception, SCOUTS has worked to build capacity in historically marginalized communities to articulate the broader societal context of the complex problems that matter to them using a practice called community based system dynamics (CBSD). System dynamics (SD) is a methodology for articulating causal theories about complex problems, both qualitatively as causal loop and stock and flow diagrams (CLDs and SFDs, respectively) and quantitatively as simulation models. The inherent support of visual qualitative tools, quantitative methods, and collaborative model building makes it an ideal ingredient for bridging the problem understanding chasm. CBSD is a community-based, participatory variant of SD specifically focused on building capacity within communities to collaboratively describe and model the problems they face as causal theories, directly without intermediaries. With CBSD we’ve witnessed community groups learn the basics and begin drawing CLDs within 2 hours.
Data 4 Black Lives community members learning system dynamics. |
There is a huge potential for AI to improve medical diagnosis. But the safety, equity, and reliability of AI-related health diagnostic algorithms depends on diverse and balanced training datasets. An open challenge in the health diagnostic space is the dearth of training sample data from historically marginalized groups. SCOUTS collaborated with the Data 4 Black Lives community and CBSD experts to produce qualitative and quantitative causal theories for the data gap problem. The theories include critical factors that make up the broader societal context surrounding health diagnostics, including cultural memory of death and trust in medical care.
The figure below depicts the causal theory generated during the collaboration described above as a CLD. It hypothesizes that trust in medical care influences all parts of this complex system and is the key lever for increasing screening, which in turn generates data to overcome the data diversity gap.
Causal loop diagram of the health diagnostics data gap |
These community-sourced causal theories are a first step to bridge the problem understanding chasm with trustworthy societal context knowledge.
Conclusion
As discussed in this blog, the problem understanding chasm is a critical open challenge in responsible AI. SCOUTS conducts exploratory and applied research in collaboration with other teams within Google Research, external community, and academic partners across multiple disciplines to make meaningful progress solving it. Going forward our work will focus on three key elements, guided by our AI Principles:
- Increase awareness and understanding of the problem understanding chasm and its implications through talks, publications, and training.
- Conduct foundational and applied research for representing and integrating societal context knowledge into AI product development tools and workflows, from conception to monitoring, evaluation and adaptation.
- Apply community-based causal modeling methods to the AI health equity domain to realize impact and build society’s and Google’s capability to produce and leverage global-scale societal context knowledge to realize responsible AI.
SCOUTS flywheel for bridging the problem understanding chasm. |
Acknowledgments
Thank you to John Guilyard for graphics development, everyone in SCOUTS, and all of our collaborators and sponsors.
Efficiently train, tune, and deploy custom ensembles using Amazon SageMaker
Artificial intelligence (AI) has become an important and popular topic in the technology community. As AI has evolved, we have seen different types of machine learning (ML) models emerge. One approach, known as ensemble modeling, has been rapidly gaining traction among data scientists and practitioners. In this post, we discuss what ensemble models are and why their usage can be beneficial. We then provide an example of how you can train, optimize, and deploy your custom ensembles using Amazon SageMaker.
Ensemble learning refers to the use of multiple learning models and algorithms to gain more accurate predictions than any single, individual learning algorithm. They have been proven to be efficient in diverse applications and learning settings such as cybersecurity [1] and fraud detection, remote sensing, predicting best next steps in financial decision-making, medical diagnosis, and even computer vision and natural language processing (NLP) tasks. We tend to categorize ensembles by the techniques used to train them, their composition, and the way they merge the different predictions into a single inference. These categories include:
- Boosting – Training sequentially multiple weak learners, where each incorrect prediction from previous learners in the sequence is given a higher weight and input to the next learner, thereby creating a stronger learner. Examples include AdaBoost, Gradient Boosting, and XGBoost.
- Bagging – Uses multiple models to reduce the variance of a single model. Examples include Random Forest and Extra Trees.
- Stacking (blending) – Often uses heterogenous models, where predictions of each individual estimator are stacked together and used as input to a final estimator that handles the prediction. This final estimator’s training process often uses cross-validation.
There are multiple methods of combining the predictions into the single one that the model finally produce, for example, using a meta-estimator such as linear learner, a voting method that uses multiple models to make a prediction based on majority voting for classification tasks, or an ensemble averaging for regression.
Although several libraries and frameworks provide implementations of ensemble models, such as XGBoost, CatBoost, or scikit-learn’s random forest, in this post we focus on bringing your own models and using them as a stacking ensemble. However, instead of using dedicated resources for each model (dedicated training and tuning jobs and hosting endpoints per model), we train, tune, and deploy a custom ensemble (multiple models) using a single SageMaker training job and a single tuning job, and deploy to a single endpoint, thereby reducing possible cost and operational overhead.
BYOE: Bring your own ensemble
There are several ways to train and deploy heterogenous ensemble models with SageMaker: you can train each model in a separate training job and optimize each model separately using Amazon SageMaker Automatic Model Tuning. When hosting these models, SageMaker provides various cost-effective ways to host multiple models on the same tenant infrastructure. Detailed deployment patterns for this kind of settings can be found in Model hosting patterns in Amazon SageMaker, Part 1: Common design patterns for building ML applications on Amazon SageMaker. These patterns include using multiple endpoints (for each trained model) or a single multi-model endpoint, or even a single multi-container endpoint where the containers can be invoked individually or chained in a pipeline. All these solutions include a meta-estimator (for example in an AWS Lambda function) that invokes each model and implements the blending or voting function.
However, running multiple training jobs might introduce operational and cost overhead, especially if your ensemble requires training on the same data. Similarly, hosting different models on separate endpoints or containers and combining their prediction results for better accuracy requires multiple invocations, and therefore introduces additional management, cost, and monitoring efforts. For example, SageMaker supports ensemble ML models using Triton Inference Server, but this solution requires the models or model ensembles to be supported by the Triton backend. Additionally, additional efforts are required from the customer to set up the Triton server and additional learning to understand how different Triton backends work. Therefore, customers prefer a more straightforward way to implement solutions where they only need to send the invocation once to the endpoint and have the flexibility to control how the results are aggregated to generate the final output.
Solution overview
To address these concerns, we walk through an example of ensemble training using a single training job, optimizing the model’s hyperparameters and deploying it using a single container to a serverless endpoint. We use two models for our ensemble stack: CatBoost and XGBoost (both of which are boosting ensembles). For our data, we use the diabetes dataset [2] from the scikit-learn library: It consists of 10 features (age, sex, body mass, blood pressure, and six blood serum measurements), and our model predicts the disease progression 1 year after baseline features were collected (a regression model).
The full code repository can be found on GitHub.
Train multiple models in a single SageMaker job
For training our models, we use SageMaker training jobs in Script mode. With Script mode, you can write custom training (and later inference code) while using SageMaker framework containers. Framework containers enable you to use ready-made environments managed by AWS that include all necessary configuration and modules. To demonstrate how you can customize a framework container, as an example, we use the pre-built SKLearn container, which doesn’t include the XGBoost and CatBoost packages. There are two options to add these packages: either extend the built-in container to install CatBoost and XGBoost (and then deploy as a custom container), or use the SageMaker training job script mode feature, which allows you to provide a requirements.txt
file when creating the training estimator. The SageMaker training job installs the listed libraries in the requirements.txt
file during run time. This way, you don’t need to manage your own Docker image repository and it provides more flexibility to running training scripts that need additional Python packages.
The following code block shows the code we use to start the training. The entry_point
parameter points to our training script. We also use two of the SageMaker SDK API’s compelling features:
- First, we specify the local path to our source directory and dependencies in the
source_dir
anddependencies
parameters, respectively. The SDK will compress and upload those directories to Amazon Simple Storage Service (Amazon S3) and SageMaker will make them available on the training instance under the working directory/opt/ml/code
. - Second, we use the SDK
SKLearn
estimator object with our preferred Python and framework version, so that SageMaker will pull the corresponding container. We have also defined a custom training metric ‘validation:rmse
‘, which will be emitted in the training logs and captured by SageMaker. Later, we use this metric as the objective metric in the tuning job.
hyperparameters = {"num_round": 6, "max_depth": 5}
estimator_parameters = {
"entry_point": "multi_model_hpo.py",
"source_dir": "code",
"dependencies": ["my_custom_library"],
"instance_type": training_instance_type,
"instance_count": 1,
"hyperparameters": hyperparameters,
"role": role,
"base_job_name": "xgboost-model",
"framework_version": "1.0-1",
"keep_alive_period_in_seconds": 60,
"metric_definitions":[
{'Name': 'validation:rmse', 'Regex': 'validation-rmse:(.*?);'}
]
}
estimator = SKLearn(**estimator_parameters)
Next, we write our training script (multi_model_hpo.py). Our script follows a simple flow: capture hyperparameters with which the job was configured and train the CatBoost model and XGBoost model. We also implement a k-fold cross validation function. See the following code:
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Sagemaker specific arguments. Defaults are set in the environment variables.
parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
parser.add_argument("--validation", type=str, default=os.environ["SM_CHANNEL_VALIDATION"])
.
.
.
"""
Train catboost
"""
K = args.k_fold
catboost_hyperparameters = {
"max_depth": args.max_depth,
"eta": args.eta,
}
rmse_list, model_catboost = cross_validation_catboost(train_df, K, catboost_hyperparameters)
.
.
.
"""
Train the XGBoost model
"""
hyperparameters = {
"max_depth": args.max_depth,
"eta": args.eta,
"objective": args.objective,
"num_round": args.num_round,
}
rmse_list, model_xgb = cross_validation(train_df, K, hyperparameters)
After the models are trained, we calculate the mean of both the CatBoost and XGBoost predictions. The result, pred_mean
, is our ensemble’s final prediction. Then, we determine the mean_squared_error
against the validation set. val_rmse
is used for the evaluation of the whole ensemble during training. Notice that we also print the RMSE value in a pattern that fits the regex we used in the metric_definitions
. Later, SageMaker Automatic Model Tuning will use that to capture the objective metric. See the following code:
pred_mean = np.mean(np.array([pred_catboost, pred_xgb]), axis=0)
val_rmse = mean_squared_error(y_validation, pred_mean, squared=False)
print(f"Final evaluation result: validation-rmse:{val_rmse}")
Finally, our script saves both model artifacts to the output folder located at /opt/ml/model
.
When a training job is complete, SageMaker packages and copies the content of the /opt/ml/model
directory as a single object in compressed TAR format to the S3 location that you specified in the job configuration. In our case, SageMaker bundles the two models in a TAR file and uploads it to Amazon S3 at the end of the training job. See the following code:
model_file_name = 'catboost-regressor-model.dump'
# Save CatBoost model
path = os.path.join(args.model_dir, model_file_name)
print('saving model file to {}'.format(path))
model.save_model(path)
.
.
.
# Save XGBoost model
model_location = args.model_dir + "/xgboost-model"
pickle.dump(model, open(model_location, "wb"))
logging.info("Stored trained model at {}".format(model_location))
In summary, you should notice that in this procedure we downloaded the data one time and trained two models using a single training job.
Automatic ensemble model tuning
Because we’re building a collection of ML models, exploring all of the possible hyperparameter permutations is impractical. SageMaker offers Automatic Model Tuning (AMT), which looks for the best model hyperparameters by focusing on the most promising combinations of values within ranges that you specify (it’s up to you to define the right ranges to explore). SageMaker supports multiple optimization methods for you to choose from.
We start by defining the two parts of the optimization process: the objective metric and hyperparameters we want to tune. In our example, we use the validation RMSE as the target metric and we tune eta
and max_depth
(for other hyperparameters, refer to XGBoost Hyperparameters and CatBoost hyperparameters):
from sagemaker.tuner import (
IntegerParameter,
ContinuousParameter,
HyperparameterTuner,
)
hyperparameter_ranges = {
"eta": ContinuousParameter(0.2, 0.3),
"max_depth": IntegerParameter(3, 4)
}
metric_definitions = [{"Name": "validation:rmse", "Regex": "validation-rmse:([0-9\.]+)"}]
objective_metric_name = "validation:rmse"
We also need to ensure in the training script that our hyperparameters are not hardcoded and are pulled from the SageMaker runtime arguments:
catboost_hyperparameters = {
"max_depth": args.max_depth,
"eta": args.eta,
}
SageMaker also writes the hyperparameters to a JSON file and can be read from /opt/ml/input/config/hyperparameters.json
on the training instance.
Like CatBoost, we also capture the hyperparameters for the XGBoost model (notice that objective
and num_round
aren’t tuned):
catboost_hyperparameters = {
"max_depth": args.max_depth,
"eta": args.eta,
}
Finally, we launch the hyperparameter tuning job using these configurations:
tuner = HyperparameterTuner(
estimator,
objective_metric_name,
hyperparameter_ranges,
max_jobs=4,
max_parallel_jobs=2,
objective_type='Minimize'
)
tuner.fit({"train": train_location, "validation": validation_location}, include_cls_metadata=False)
When the job is complete, you can retrieve the values for the best training job (with minimal RMSE):
job_name=tuner.latest_tuning_job.name
attached_tuner = HyperparameterTuner.attach(job_name)
attached_tuner.describe()["BestTrainingJob"]
For more information on AMT, refer to Perform Automatic Model Tuning with SageMaker.
Deployment
To deploy our custom ensemble, we need to provide a script to handle the inference request and configure SageMaker hosting. In this example, we used a single file that includes both the training and inference code (multi_model_hpo.py). SageMaker uses the code under if _ name _ == "_ main _"
for the training and the functions model_fn
, input_fn
, and predict_fn
when deploying and serving the model.
Inference script
As with training, we use the SageMaker SKLearn framework container with our own inference script. The script will implement three methods required by SageMaker.
First, the model_fn
method reads our saved model artifact files and loads them into memory. In our case, the method returns our ensemble as all_model
, which is a Python list, but you can also use a dictionary with model names as keys.
def model_fn(model_dir):
catboost_model = CatBoostRegressor()
catboost_model.load_model(os.path.join(model_dir, model_file_name))
model_file = "xgboost-model"
model = pickle.load(open(os.path.join(model_dir, model_file), "rb"))
all_model = [catboost_model, model]
return all_model
Second, the input_fn
method deserializes the request input data to be passed to our inference handler. For more information about input handlers, refer to Adapting Your Own Inference Container.
def input_fn(input_data, content_type):
dtype=None
payload = StringIO(input_data)
return np.genfromtxt(payload, dtype=dtype, delimiter=",")
Third, the predict_fn
method is responsible for getting predictions from the models. The method takes the model and the data returned from input_fn
as parameters and returns the final prediction. In our example, we get the CatBoost result from the model list first member (model[0]
) and the XGBoost from the second member (model[1]
), and we use a blending function that returns the mean of both predictions:
def predict_fn(input_data, model):
predictions_catb = model[0].predict(input_data)
dtest = xgb.DMatrix(input_data)
predictions_xgb = model[1].predict(dtest,
ntree_limit=getattr(model, "best_ntree_limit", 0),
validate_features=False)
return np.mean(np.array([predictions_catb, predictions_xgb]), axis=0)
Now that we have our trained models and inference script, we can configure the environment to deploy our ensemble.
SageMaker Serverless Inference
Although there are many hosting options in SageMaker, in this example, we use a serverless endpoint. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic. This takes away the undifferentiated heavy lifting of managing servers. This option is ideal for workloads that have idle periods between traffic spurts and can tolerate cold starts.
Configuring the serverless endpoint is straightforward because we don’t need to choose instance types or manage scaling policies. We only need to provide two parameters: memory size and maximum concurrency. The serverless endpoint automatically assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs. You should always choose your endpoint’s memory size according to your model size. The second parameter we need to provide is maximum concurrency. For a single endpoint, this parameter can be set up to 200 (as of this writing, the limit for total number of serverless endpoints in a Region is 50). You should note that the maximum concurrency for an individual endpoint prevents that endpoint from taking up all the invocations allowed for your account, because any endpoint invocations beyond the maximum are throttled (for more information about the total concurrency for all serverless endpoints per Region, refer to Amazon SageMaker endpoints and quotas).
from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig
serverless_config = ServerlessInferenceConfig(
memory_size_in_mb=6144,
max_concurrency=1,
)
Now that we configured the endpoint, we can finally deploy the model that was selected in our hyperparameter optimization job:
estimator=attached_tuner.best_estimator()
predictor = estimator.deploy(serverless_inference_config=serverless_config)
Clean up
Even though serverless endpoints have zero cost when not being used, when you have finished running this example, you should make sure to delete the endpoint:
predictor.delete_endpoint(predictor.endpoint)
Conclusion
In this post, we covered one approach to train, optimize, and deploy a custom ensemble. We detailed the process of using a single training job to train multiple models, how to use automatic model tuning to optimize the ensemble hyperparameters, and how to deploy a single serverless endpoint that blends the inferences from multiple models.
Using this method solves potential cost and operational issues. The cost of a training job is based on the resources you use for the duration of usage. By downloading the data only once for training the two models, we reduced by half the job’s data download phase and the used volume that stores the data, thereby reducing the training job’s overall cost. Furthermore, the AMT job ran four training jobs, each with the aforementioned reduced time and storage, so that represent 4 times in cost saving! With regard to model deployment on a serverless endpoint, because you also pay for the amount of data processed, by invoking the endpoint only once for two models, you pay half of the I/O data charges.
Although this post only showed the benefits with two models, you can use this method to train, tune, and deploy numerous ensemble models to see an even greater effect.
References
[1] Raj Kumar, P. Arun; Selvakumar, S. (2011). “Distributed denial of service attack detection using an ensemble of neural classifier”. Computer Communications. 34 (11): 1328–1341. doi:10.1016/j.comcom.2011.01.012. [2] Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499. (https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)About the Authors
Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.
Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers to design, build, and operate ML workloads at scale. In his spare time, he enjoys cycling, hiking, and minimizing RMSEs.
So, So Fresh: Play the Newest Games in the Cloud on Day One
It’s a party this GFN Thursday with several newly launched titles streaming on GeForce NOW. Revel in gaming goodness with Xenonauts 2, Viewfinder and Techtonica, among the four new games joining the cloud this week.
Portal fans, stay tuned — the Portal: Prelude RTX mod will be streaming on GeForce NOW to members soon.
Plus, find out how members can score an upcoming Guild Wars 2 premium reward.
Get ‘Em While They’re Hot!
Choose from over 1,600 games in the GeForce NOW library, starting off with the titles making their cloud debut. Be among the first to experience Xenonauts 2, Viewfinder and Techtonica from a high-performance GeForce RTX gaming rig in the cloud, without worrying about download times or system specs.
Take on a different perspective in Viewfinder, the new single-player game from Thunderful Publishing. Gamers can challenge perception, redefine reality and reshape the world around them with an instant camera. Capture pictures and bring them to life by placing them into the scene in this mind-bending reality adventure.
Those looking for something out of this world can check out Fire Hose Games’ Techtonica, set in a strangely beautiful, bioluminescent, mysterious subsurface alien universe. Play solo or with a buddy to build factories, gather resources, research new technologies and uncover long-forgotten secrets.
Fans of the Xenonauts series can look forward to the second entry in the franchise from Hooded Horse. In Xenonauts 2, work as the head of a multinational military organization tasked with eliminating an extraterrestrial threat. Play from the shadows to seek out and engage a growing alien presence.
Catch these titles fresh out of the oven and upgrade to a premium membership for faster access over free members.
Exclusive GeForce NOW Rewards
Starting next week, Ultimate and Priority members get an exclusive reward for the hit MMORPG Guild Wars 2. The “Always Prepared” and “Booster” bundles will bring premium members a combo of helpful tools, cosmetic items, a mini pet and more.
Upgrade to an Ultimate or Priority membership today, and visit the GeForce NOW Rewards portal to update the settings to receive special offers and in-game goodies. Better hurry — these rewards are available for a limited time on a first-come, first-served basis.
Grab them in time for the fourth expansion of Guild Wars 2, coming to GeForce NOW at launch on Tuesday, Aug. 22. The Secrets of the Obscure paid expansion includes a new storyline, powerful combat options, new mount abilities and more.
New Games, Who Dis?
Jump into the list of the four new games hitting GeForce NOW this week:
- Techtonica (New release on Steam, July 18)
- Viewfinder (New release on Steam, July 18)
- Xenonauts 2 (New release on Steam, July 18)
- Embr (Steam)
Before heading into the weekend, check out our question of the week. Let us know your answer on Twitter or in the comments below.
Which video game character do you identify with the most and why?
— NVIDIA GeForce NOW (@NVIDIAGFN) July 19, 2023
Collaborators: Gaming AI with Haiyan Zhang
Episode 143 | July 20, 2023
Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a new Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with.
In the world of gaming, Haiyan Zhang has situated herself where research meets real-world challenges, helping to bring product teams and researchers together to elevate the player experience with the latest AI advances even before the job became official with the creation of her current role, General Manager of Gaming AI. In this episode, she talks with host Dr. Gretchen Huizinga about the variety of expertise needed to avoid the discomfort experienced by players when they encounter a humanlike character displaying inhuman behavior, the potential for generative AI to make gaming better for both players and creators, and the games she grew up playing and what she plays now.
Learn more:
- Game Intelligence
Group page - Project Paidia
Project page - TrueSkill Ranking System
Project page - TrueMatch Matchmaking System
Project page - Grounded Conversational Characters
Project page
Subscribe to the Microsoft Research Podcast:
Transcript
[TEASER] [MUSIC PLAYS UNDER DIALOGUE]HAIYAN ZHANG: And as animation rendering improves, we have the ability to get to a wholly life-like character. So how do we get there? You know, we’re working with animators. We want to bring in neuroscientists, game developers, user researchers. There’s a lot of great work happening across the games industry in machine learning research, looking at things like motion matching, helping to create different kinds of animations that are more human-like. And it is this bringing together of artistry and technology development that’s going to get us there.
GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research Podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.
[MUSIC ENDS]
I’m delighted to be back in the booth today with Haiyan Zhang. When we last talked, Haiyan was Director of Innovation at Microsoft Research Cambridge in the UK and Technical Advisor to Lab Director Chris Bishop. In 2020, she moved across the pond and into the role of Chief of Staff for Microsoft Gaming. Now she’s the General Manager of Gaming AI at Xbox, and I normally say, let’s meet our collaborators right now, but today, we’ll be exploring several aspects of collaboration with a woman who has collaborating in her DNA. Welcome back, Haiyan!
ZHANG: Hi, Gretchen. It’s so good to be here. Thank you for having me on the show.
HUIZINGA: Listen, I left out a lot of things that you did before your research work in the UK, and I want you to talk about that in a minute, but first, help us understand your new role as GM of Gaming AI. What does that mean, and what do you do?
ZHANG: Right. Uh, so I started this role about a year ago, last summer, actually. And those familiar with Xbox will know that Xbox has been shipping AI for over a decade and in deep collaborations with Microsoft Research, with technologies like TrueSkill and TrueMatch, uh, that came out of the Cambridge lab in the UK. So we have a rich history of transferring machine learning research into applied product launch, and I think more recently, we know that AI is experiencing this step change in its capabilities, what it can empower people to do. And as we looked across all the various teams in Xbox working on machine learning models, looking at these new foundation models, this role was really to say, hey, let’s bring everybody together. Let’s bring it together and figure out what’s our strategy moving forward? Are there new connections for collaboration that we can form across teams from the platform services team to the game development teams? Are there new avenues that we should be exploring with AI to really accelerate how we make games and how we deliver those great game experiences to our players? So my role is really, let’s bring together that strategy for Xbox AI and then to look at new innovations, new incubation, that my team is spearheading, uh, in new areas for game development and our gaming platform.
HUIZINGA: Fascinating. Are you the first person to have this role? I mean, has there been a Gaming AI person before you?
ZHANG: Did you … did you get that from [LAUGHS] what I just … yeah, did you get that hint from what I just said?
HUIZINGA: Right!
ZHANG: The role didn’t exist before I came into the role last summer. And sometimes, you know, when you step into a role that didn’t exist before, a part of that is just to define what the role is by looking at what the organization needs. So you are totally right. This is a completely new role, and it is very timely because we are at that intersection where AI is really transformational. I mean, we’ve seen machine learning gain traction with deep learning, uh, being applied in many more areas, and now with foundation models, with these large language models, we’re just going to see this completely new set of capabilities emerge through technology that we want to be ready for in Xbox.
HUIZINGA: You know, I’m going to say a phrase I hate myself for saying, but I want to double click on that for a minute. [LAUGHS] Um, AI for a decade or more in Xbox … how much did the new advances in large learning models and the GPT phenom help and influence what you’re doing?
ZHANG: Well, so interestingly, Gretchen, I, I’ve actually been secretly doing this role for several years before they’ve made it official. So back in Cambridge in the UK, with Microsoft Research, uh, I was working with our various AI teams that were working within gaming. So you might have talked before to Katja Hofmann’s team working on reinforcement learning.
HUIZINGA: Absolutely.
ZHANG: The team working on Bayesian inference with TrueSkill and TrueMatch. So I was helping the whole lab think about, hey, how do we take these research topics, how do we apply them into gaming? Coming across the pond from the UK to Redmond, meeting with different folks such as Phil Spencer, the CEO of Microsoft Gaming, the leadership team of Xbox and really trying to champion, how do we get more, infuse more, AI into the Xbox ecosystem? Every year, I would work with colleagues in Xbox, and we would run an internal gaming and AI conference where we bring together the best minds of Microsoft Research and the product teams to really kind of meet in the middle and talk about, hey, here are some research topics; here are some product challenges. So I think for the last five or six years, I’ve, I’ve been vying for this role to be created. And finally, it happened!
HUIZINGA: Sort of you personify the role anyway, Haiyan. Was that conference kind of a hackathon kind of thing, or was it more formal and, um, bring minds together and … academic wise?
ZHANG: Right. We saw a space for both. That there needed to be more of a translation layer between, hey, here are some real product challenges we face in Xbox, both in terms of how we make the games and then how we build the networking platform that players join to access the games. So here are some real-world problems we face. And then to bring that together with, hey, here are some research topics that people might not have thought about applied to gaming. In recent times, we’ve looked at topics like imitation learning. You know, imitation learning can be applied to a number of different areas, but to apply that into video games, to say, hey, how can we take real player data and be able to try to develop AI agents that can play, uh, in the style of those human players, personalized for human players? You know, this is where I think the exciting work happens between problems in the real world and, uh, researchers that are looking at bigger research topics and are looking for those real-world problems.
HUIZINGA: Right. To kind of match them together. Well, speaking of all the things you’ve done, um, you’ve had sort of a Where in the World is Carmen Sandiego? kind of career path, um, everything from software engineering and user experience to hardware R&D, design thinking, blue-sky envisioning. Some of that you’ve just referred to in that spectrum of, you know, real-world problems and research topics. But now you’re doing gaming. So talk a little bit about the why and how of your technology trek, personally. How have the things you did before informed what you’re doing now? Sort of broad thinking.
ZHANG: Thanks, Gretchen. You know, I’ve been very lucky to work across a number of roles that I’ve had deep passion for and, you know, very lucky to work with amazing colleagues, both inside of Microsoft and outside of Microsoft, so it’s definitely not something by plan, so it’s more that just following my own passions has led me down this path, for better or for worse. [LAUGHTER] Um, so I mean, my career starts in software engineering. So I worked in, uh, Windows application development, looking at data warehousing, developing software for biomedical applications, and I think there was a point where, you know, I, I really loved software architecture and writing code, and I, I wanted to get to the why and the what of what we build that would then lead to how we build it. So I wanted to get to a discipline where I could contribute to why and what we build.
HUIZINGA: Yeah.
ZHANG: And that led me on a journey to really focus in on user experience and design, to say, hey, why do we build things? Why do we build a piece of software? Why do we build a tool? It’s really to aid people in whatever tasks that they’re doing. And so that user experience, that why, is, is going to be so important. Um, and so I went off and did a master’s in design in Italy and then pursued this intersection of user research, user experience, and technology.
HUIZINGA: A master’s degree in Italy … That just sounds idyllic. [LAUGHS] So as you see those things building and following your passions and now you’re here, do you find that it has informed the way you see things in your current role, and, and how might that play out? Maybe even one example of how you see something you did before just found its way into, hey, this is what’s happening now; that, that fits, that connects.
ZHANG: You know, I think in this role and also being a team leader and helping to empower a team of technologists and designers to do their best work in this innovation space, it’s kind of tough. You know, it … sometimes I find it’s really difficult to wear many hats at the same time. So when I’m looking at a problem – and although I have experience in writing software, doing user research, doing user experience design – it’s really hard to bring all of those things together into a singular conversation. So I’m either looking at a problem purely through the coding, purely through the technology development, or purely through the user research. So I haven’t, I haven’t actually figured out a way to integrate all of those things. So when I was in the UK, when I was more working with a smaller team and, uh, really driving the innovation on my own, it was probably easier to, uh, to bring everything together, but then I’m only making singular things. So, for example, you know, in the UK, I developed a watch that helped a young woman called Emma to overcome some of the tremor symptoms she has from her Parkinson’s disease, and I developed some software that helped a young woman – her name is Aman – who had memory loss, and the software helped her be able to listen to her classes – she was in high school – to listen to her classes, record notes, be able to reflect back on her notes. So innovation really comes in many forms: at the individual level, at the group level, at the society level. And I find it just really valuable to gain experience across a spectrum of design and technology, and in order to really make change at scale, I think the work is really, hey, how do we empower a whole team? How do we empower a whole village to work on this together? And that is a, a very unique skill set that I’m still on a journey to really grasp and, and, and learn together with all my colleagues.
HUIZINGA: Yeah. Well, let’s talk about gaming more specifically now and what’s going on in your world, and I’ll, I’ll start by saying that much of our public experience with AI began with games, where we’d see headlines like “AI Beats the World Chess Champion” or the World Go Champion or even, um, video games like Ms. Pac-Man. So in a sense, we’ve been conditioned to see AI as something sort of adversarial, and you’ve even said that, that it’s been a conceptual envisioning of it. But do you see a change in focus on the horizon, where we might see AI as less adversary and more collaborator, and what factors might facilitate that shift?
ZHANG: I think we could have a whole conversation about all the different aspects of popular culture that has made artificial intelligence exciting and personified. So I can think of movies like WarGames or Short Circuit, just like really fun explorations into what might happen if a computer program gained some expanded intelligence. So, yes, we are … I think we are primed, and I think this speaks to some core of the human psyche that we love toying with these ideas of these personalities that we don’t totally understand. I mean, I think the same applies for alien movies. You know, we have an alien and it is completely foreign! It has an intelligence that we don’t understand. And sometimes we do project that these foreign personalities might be dangerous in some way. I also think that machine learning/AI research comes in leaps and bounds by being able to define state-of-the-art benchmarks that everyone rallies around and is either trying to replicate or beat. So the establishment of human performance benchmarks, that we take a model and we say this model can perform this much better than the human benchmark, um, is a way for us to measure the progress of these models. So I think the establishment of these benchmarks has been really important to progress AI research. Now we see these foundation models being able to be generalized across many different tasks and performing well at many different tasks, and we are entering this phase where we transition from making the technology to building the tools. So the technology was all about benchmarks: how do we just get the technology there to do the thing that we need it to do? To now, how do we now mold the technology into tools that actually assist us in our daily lives? And I think this is where we see more HCI researchers, more designers, bringing their perspectives into the tool building, and we will see this transition from beating human benchmarks to assistive tools – copilots – that are going to aid us in work and in play.
HUIZINGA: Yeah. Well, you have a rich history with gaming, both personally and professionally. Let me just ask, when did you start being a gamer?
ZHANG: Oh my goodness. Yeah, I actually prefer the term “player” because I feel like “gamer” has a lot of baggage in popular culture …
HUIZINGA: I like that. It’s probably true. When did you start gaming?
ZHANG: I have very fond memories of playing Minesweeper on my dad’s university PC. And so it really started very early on for me. It’s funny. So I was an only child, so I spent a lot of time on my own. [LAUGHS]
HUIZINGA: I have one!
ZHANG: And one of the first hobbies I picked up was, uh, programing on our home PC, programing in BASIC.
HUIZINGA: Right …
ZHANG: And, you know, it’s very … it’s a simple language. I think I was about 11 or 12, and I think the first programs that I wrote were trying to replicate music or trying to be simple games, because that’s a way for, you know, me as a kid, to see some exciting things happening on the screen.
HUIZINGA: Right …
ZHANG: Um,so I’d say that that was when I was trying to make games with BASIC on a PC and also just playing these early games like Decathlon or King’s Quest. So I’ve always loved gaming. I probably have more affinity to, you know, games of my, my childhood. But yeah, so started very early on, um, and then probably progressed I’d say I probably played a little too much Minecraft in university with some like many sleepless nights. That was probably not good. Um, and then I think has transitioned into mobile games like Candy Crush, just like quick sessions on a commute or something. And that’s why I prefer the term “player” because there’s, there’s billions of people in the world who play games, and I don’t think they think of themselves as gamers.
HUIZINGA: Right.
ZHANG: But you know, if you’re playing Solitaire, Candy Crush, you are playing video games.
HUIZINGA: Well, that was a preface to the, to the second part of the question, which is something you said that really piqued my interest. That video games are an interactive art form expressed through human creativity. But let’s get back to this gaming AI thing and, and talk a little bit about how you see creativity being augmented with AI as a collaborative tool. What’s up in that world?
ZHANG: Right. Yeah, I, I, I really fundamentally believe that, you know, video games are an experience. They’re a way for players to enter a whole new world, to have new experiences, to meet people in all walks of life and cultures that they’ve not met before; either they are fictional characters or other players in different parts of the world. So being able to help game creators express their creativity through technology is fundamental to what we do at Xbox. With AI, there’s this just incredible potential for that assistance in creativity to be expanded and accelerated. For example, you know, in 1950, Claude Shannon, who was the inventor of information theory, wrote a paper about computer chess. And this is a seminal paper where he outlined what a modern computer chess program could be – before there were really wide proliferation of computers – and in it, he talked about that there were innate advantages that humans had that a computer program could never achieve. Things like, humans have imagination; humans have flexibility and adaptability. Humans can learn on the fly. And I think in the last seven decades, we’ve now seen that AI is exhibiting potential for imagination and flexibility. And imagination is something that generative AI is starting to demonstrate to us, but purely in terms of being able to express something that we ask of it. So I think everybody has this experience of, hey, I’d really like to create this. I’d really like to express this. But I can’t draw. [LAUGHTER] I want to take that photo, but why do my photos look so bad? Your camera is so much better than mine! And giving voice to that creativity is, I think, the strength of generative AI. So developing these tools that aid that creativity will be key to bringing along the next generation of game creators.
HUIZINGA: So do you think we’re going to find ourselves in a place where we can accept a machine as a collaborator? I mean, I know that this is a theme, even in Microsoft Research, where we look at it as augmenting, not replacing, and collaborating, you know, not canceling. But there’s a shift for me in thinking of tools as collaborators. Do you feel like there’s a bridge that needs to be crossed for people to accept collaboration, or do they … or do you feel like it just is so much cooler what we can do with this tool that we’re all going to discover this is a great thing?
ZHANG: I think, in many ways, we already use tools as collaborators, uh, whether that’s a hammer, whether that’s Microsoft Word. I mean, I love the spell-check feature! Oh, geez! Um, so we already use tools to help us do our work – to make us work faster, type faster, um, check our grammar, check our spelling. This is the next step change in that these tools, the collaboration is going to be more complex, more sophisticated. I am somebody that welcomes that because I have so much work and need to get it done. At the same time, I really advocate for us, as a society, as our community, to have an open dialogue about it.
HUIZINGA: Yeah.
ZHANG: Because we should be talking about, what is it to be human, and how do we want to frame our work and our values moving forward given these new assistive tools? You know, when we talk about art, the assistance provided by generative AI will allow us to express, through words, what we want and then to have those words appear as an image on the page, and I think this will really challenge the art world to push artists beyond perhaps what we think of art today to new heights.
HUIZINGA: Yeah. Yeah. There’s a whole host of questions and issues and, um, concerns even that we’re going to face. And I think it may be a, like you say, a step change in even how we conceptualize what art … I mean, even now, Instagram filters, I mean, or Photoshop or any of the things that you could say, well, that’s not really the photo you took. But, um, well, listen, we’ve lived for a long time in an age of what I would call specialization or expertise, and we, we’ve embraced that, you know, talk to the experts, appeal to the experts. But the current zeitgeist in research is multidisciplinary, bringing many voices into the conversation. And even in video games, it’s multiplayer, multimodal. So talk for a minute about the importance of the prefix “multi” now and how more voices are better for collaborative innovation.
ZHANG: So I want to start with the word “culture,” because I fundamentally believe that when we have a team building an experience for our users and players, that team should reflect the diversity of those users and players, whether that’s diversity in terms of different abilities, in terms of cultural background. Teams that build these new experiences need to be inclusive, need to have many, many voices. So I want to start the “multi” discussion there, and I encourage every team building products to think about that diversity and to move to recruit towards diversity. Then, let’s talk about the technology: multi-modality. So games are a very rich medium. They combine 3D, 2D, behaviors, interactions, music, sounds, dialogue. And it’s only when these different modalities come together and converge and, and work in perfect synchronicity that you get that amazing immersive experience. And we’ve seen foundation models do great things with text, do amazing things with 2D, some 3D now we’re seeing, and this is what I’m trying to push, that we need that multi-modality in generative AI, and multi-modality in terms of bringing these different mediums together. Multi-disciplinary, you know, what I find interesting is that, um, foundation models, LLMs like GPT, are at the height of democratized technology. Writing natural language as a prompt to generate something is probably the simplest form of programing.
HUIZINGA: Oh interesting.
ZHANG: It does not get simpler than I literally write what I want, and the thing appears.
HUIZINGA: Wow.
ZHANG: So you can imagine that everybody in the world is going to be empowered with AI. And going from an idea, to defining that idea through natural language, to that idea becoming into reality, whether that’s it made an app for me, it made an image for me … so when this is happening, “multidisciplinary” is going to be the natural outcome of this, that a designer’s going to be able to make products. A coder is going to be able to make user experience. A user researcher is going to go from interviewing people to showing people prototypes with very little gap in order to further explore their topic. So I think we will get multidisciplinary because the technology will be democratized.
HUIZINGA: Right. You know, as you talk, I’m, I’m thinking “prompts as programing” is a fascinating … I never thought of it that way, but that’s exactly it. And you think about the layers of, um, barriers to entry in technology, that this has democratized those entry points for people who say I can never learn to code. Um, but if you can talk or type, you can! [LAUGHS] So that’s really cool. So we don’t have time to cover all the amazing scientists and collaborators working on specific projects in AI and gaming, but it would be cool if you’d give us a little survey course on some of the interesting, uh, research that’s being done in this area. Can you just name one or two or maybe three things that you think are really cool and interesting that are, um, happening in this world?
ZHANG: Well, I definitely want to give kudos to all of my amazing research collaborators working with large language models, with other machine learning approaches like reinforcement learning, imitation learning. You know, we know that in a video game, the dialogue and the story is key. And, you know, I work with an amazing research team with, uh, Bill Dolan, um, Sudha Rao, who are leading the way in natural language processing and looking at grounded conversational players in games. So how do we bring a large language model like GPT into a game? How do we make it fun? How do we actually tell a story? You know, as I said, technology is becoming democratized, so it’s going to be easy to put GPT into games, but ultimately, how do we make that into an experience that’s really valuable for our players? On the reinforcement learning/imitation learning front, we’ve talked before about Project Paidia. How do we develop new kinds of AI agents that play games, that can help test games, that can really make games more fun by playing those games like real human players? That is our ultimate goal.
HUIZINGA: You know, again, every time you talk, something comes into my head, and I’m thinking GPTs for NPCs. [LAUGHS]
ZHANG: I like that.
HUIZINGA: Um, and I have to say I’m, I’m not a game player. I’m not the target market. But I did watch that movie Free Guy and got exposure to the non … what is it called?
ZHANG: Non-player characters.
HUIZINGA: That’s the one. Ryan Reynolds plays that, and that was a really fun experience. Well, I want to come back to the multi-modality dimension of the technical aspects of gaming and AI’s role in helping make animated characters more life-like. And that’s key for a lot of gaming companies is, how real does it feel? So what different skills and expertise go into creating a world where players can maintain their sense of immersion and avoid the uncanny valley? Who are you looking to collaborate with to make that happen?
ZHANG: Right. So the uncanny valley is this space where, when you are creating a virtual character and you bring together animation, whether it’s facial animation, body movements with sound, with their voice, with eye movement, with how they interact with the player, and the human brain has an ability to recognize, subconsciously, recognize another human being. And when you play with that deep-seated instinct and you create a virtual character, but the virtual character kind of slightly doesn’t move in the right way, their eyes don’t blink in the right way, they don’t talk in the right way, it, it triggers, in the deep part of someone’s brain, a discomfort. And this is what we call the uncanny valley. And there are many games that we know they’re stylized worlds and they’re fictional, so we try to get the characters to a place where the player knows that it’s not a real person, but they’re happy to be immersed in this environment. And as animation rendering improves, we have the ability to get to a wholly life-like character. So how do we get there? You know, we’re working with animators. We want to bring in neuroscientists, game developers, user researchers. There’s a lot of great work happening across the games industry in machine learning research, looking at things like motion matching, helping to create different kinds of animations that are more human-like. And it is this bringing together of artistry and technology development that’s going to get us there.
HUIZINGA: Yeah. Yeah, yeah. Well, speaking of the uncanny valley – and I’ve experienced that on several animated movies where they … it’s just like, eww, that’s weird – um, and we, we have a quest to avoid it. We, we want them to be more real. And I guess what we’re aiming for is to get to the point in gaming where the experience is so realistic that you can’t tell the difference between the character in the game and humans. So I have to ask, and assume you’ve given some thought to it, what could possibly go wrong if indeed you get everything right?
ZHANG: So one thing is that video games are definitely a stylized visual art, so we also welcome game creators who want to create a completely cartoon universe, right?
HUIZINGA: Oh, interesting, yeah.
ZHANG: But for those creators that want to make that life-like visual experience, we want to have the technology ready. In terms of, hey, as you ask, what could possibly go wrong if indeed we got everything right, I think that throughout human history, we have a rich legacy of exploring new ideas and challenging topics through fiction. For example, the novels of Asimov, looking at robots and, and the laws of robotics, I believe that we can think of video games as another realm of fiction where we might be able to explore these ideas, where you can enter a world and say, hey, what if something went wrong when you have these life-like agents? It’s a safe place for players to be able to have those thoughts and experiences and to explore the different outcomes that might happen.
HUIZINGA: Yeah.
ZHANG: I also want to say that I think the bar for immersion might also move over time. So when you say, what could possibly go wrong if we got everything right? Well, it’s still a video game.
HUIZINGA: Yeah.
ZHANG: And it might look real, but it’s still in a game world. And I think once we experience that, the bar might shift. So for example, when the first black-and-white movies, uh, came out and people saw movies for the first time, I remember seeing a documentary where one of the first movies was somebody filming a train, a steam train, coming towards the camera, and the audience watching that jumped! They ran away because they thought there was a steam train coming at them. And I think since then, people have understood that that is not happening. But I think this bar of immersion that people have will move over time because ultimately it is not real. It is in a digital environment.
HUIZINGA: Right. Well, and that bar finds itself in many different milieus, as it were. Um, you know, radio came out with War of the Worlds, and everyone thought we were being invaded by aliens, but we weren’t. It was just fiction. Um, and we’re also dealing with both satire and misinformation in non-game places, so it’s up to humans, it sounds like, to sort of make the differentiation and adapt, which we’ve done for a long time.
ZHANG: I totally agree. And I think this podcast is a great starting point for us to have this conversation in society about topics like AGI, uh, topics like, hey, how is AI going to assist us and be copilots for us? We should be having more of that discussion.
HUIZINGA: And even specifically to gaming I think one of the things that I hope people will come away with is that we’re thinking deeply about the experiences. We want them to be good experiences, but we also want people to not become, you know, fooled by it and so on. So these are big topics, and again, we won’t solve anything here, but I’m glad you’re thinking about it. So people have called gaming — and sometimes gamers — a lot of things, but rarely do you hear words like welcoming, safe, friendly, diverse, fun, inviting. And yet Xbox has worked really hard to kind of earn a reputation for inclusivity and accessibility and make gaming for everyone. Now I sound like I’m doing a commercial for Xbox, and I’m really not. I think this has been something that’s been foregrounded in the conversation in Xbox. So talk about some of the things you’ve done to, as you put it, work with the community rather than just for the community.
ZHANG: Right. I mean, we estimate there are 3 billion players in the world, so gaming has really become democratized. You can access it on your phone, on your computer, on your console, on your TV directly. And that means that we have to be thinking about, how do we make gaming for everybody? For every type of player? We believe that AI has this incredible ability to make gaming more fun for more people. You know, when we think about games, hey, how do we make gaming more adaptive, more personalized to every player? If I find this game is too hard for me, maybe the game can adapt to my needs. Or if I have different abilities or if I have disabilities that prohibits my access to the game, maybe the game can change to allow me to play with my friends. So this is an area that we are actively exploring. And, you know, even without AI, Xbox has this rich history of thinking about players with disabilities and how we can bring features to allow more people to play. So for example, recently Forza racing created a feature to allow people with sight impairment to be able to drive in the game by using sound. So they introduced new 3D sounds into the game, where someone who cannot see or can only partially see the screen can actually hear the hairpin turns in the road in order to drive their car and play alongside their family and friends.
HUIZINGA: Right.
ZHANG: We’ve also done things like speech-to-text in Xbox Party Chat. How do we allow somebody who has a disability to be able to communicate across countries, across cultures, across languages with other players? And we are taking someone’s spoken voice, turning it into text chat so that everybody within that party can understand them and can be able to communicate with them.
HUIZINGA: Right. That’s interesting. So across countries, you could have people that don’t speak the same language be able to play together …
ZHANG: Right. Exactly.
HUIZINGA: …and communicate.
ZHANG: Yeah.
HUIZINGA: Wow. Well, one of the themes on this show is the path that technologies travel from mind to market, or lab to life, as I like to say. And we talk about the spectrum of work from blue-sky ideas to blockbuster products used by millions. But Xbox is no twinkle in anyone’s eye. It’s been around for a long time. Um, even so, there’s always new ideas that are making their way into established products and literally change the game, right? So without giving away any industry secrets, is there anything you can talk about that would give us hint as to what might be coming soon to a console or a headset or a device near us?
ZHANG: Oh my goodness. You know I can’t give away any product secrets!
HUIZINGA: Dang it! I thought I would get you!
ZHANG: Um, I mean, I’m excited by our ability to bring more life-like visuals, more life-like behavior, allowing players to play games anywhere at a fidelity they’ve not seen before. I mean, these are all exciting futures for AI. At the same time, generative AI capabilities to really accelerate and empower game developers to make games at higher quality, at a faster rate, these are the things that the industry wants. How do we turn these models, these technologies, into real tools for game developers?
HUIZINGA: Yeah. I’m only thinking of players, and you’ve got this whole spectrum of potential participants.
ZHANG: I think the players benefit when the creators are empowered. It starts with the creators and helping them bring to life their games that the players ultimately experience.
HUIZINGA: Right. And even as you talk, I’m thinking there’s not a wide gap between creator and player. I mean, many of the creators are avid gamers themselves and …
ZHANG: And now when we see games like Minecraft or Roblox, where the tools of creativity are being brought to the players themselves and they can create their own experiences …
HUIZINGA: Yeah.
ZHANG: …I want to see more of those in the world, as well.
HUIZINGA: Exciting. Well, as we close, and I hate to close with you because you’re so much fun, um, I’d like to give you a chance to give an elevator pitch for your preferred future. I keep using that term, but I know we all think in terms of what, what we’d like to make in this world to make a mark for the next generation. You’ve already referred to that in this podcast. So we can close the circle, and I’ll ask you to do a bit of blue-sky envisioning again, back to your roots in your old life. What do video games look like in the future, and how would you like to have changed the gaming landscape with brilliant human minds and AI collaborators?
ZHANG: Oh my goodness. That’s such a tall order! Uh, I think ultimately my hope for the future is that we really explore our humanity and become even more human through the assistance of AI. And in gaming, how do we help people tell more human stories? How do we enable people to create stories, share those stories with others? Because ultimately, I think, since the dawn of human existence, we’ve been about storytelling. Sitting around a fire, drawing pictures on a cave wall, and now we are still there. How do we bring to life more stories? Because through these stories, we develop empathy for other people. We experience other lives, and that’s ultimately what’s going to make us better as a society, that empathy we develop, the mutual understanding and respect that we share. And I see AI as a tool to get us there.
HUIZINGA: From cave wall to console … Haiyan Zhang, it’s always a pleasure to talk to you. Thank you for joining us today.
ZHANG: Thank you so, much, Gretchen. Thank you.
The post Collaborators: Gaming AI with Haiyan Zhang appeared first on Microsoft Research.
Custom instructions for ChatGPT
We’re rolling out custom instructions to give you more control over how ChatGPT responds. Set your preferences, and ChatGPT will keep them in mind for all future conversations.OpenAI Blog
Google DeepMind’s latest research at ICML 2023
Exploring AI safety, adaptability, and efficiency for the real worldRead More
Apple Natural Language Understanding Workshop 2023
Apple Machine Learning Research
Google DeepMind’s latest research at ICML 2023
Google DeepMind researchers are presenting more than 80 new papers at the 40th International Conference on Machine Learning (ICML 2023), taking place 23-29 July in Honolulu, Hawai’i.Read More