Abstracts: July 18, 2024

Abstracts: July 18, 2024

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Researcher Arindam Mitra joins host Gretchen Huizinga to discuss “AgentInstruct: Toward Generative Teaching with Agentic Flows.” In their paper, Mitra and his coauthors introduce an automated multi-agent framework for creating diverse, high-quality synthetic data at scale for language model post-training. In contrast to methods that create data from a seed set of existing prompts and responses, AgentInstruct uses raw data and specifications provided by model builders. The work—which post-trains a model, Orca-3, on AgentInstruct-generated data—is part of project Orca. Orca aims to develop techniques for creating small language models that can perform as well as large language models. Like Orca-3, the earlier Orca, Orca-2, and Orca-Math models show the effectiveness of leveraging synthetic data in training. 

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

I’m here today with Dr. Arindam Mitra, a senior researcher at Microsoft Research and the lead researcher for Microsoft’s Orca project. Dr. Mitra is coauthor of a paper called “AgentInstruct: Toward Generative Teaching with Agentic Flows.” Arindam, it’s a pleasure to have you on Abstracts today.

ARINDAM MITRA: Thank you, Gretchen.

HUIZINGA: So let’s start with a brief overview of your paper. What problem does your research address, and why does it matter?


MITRA: So the post-training phase is very important for language models. You can really improve the model a lot by creating high-quality synthetic data. The problem is, however, though, high-quality synthetic data creation requires lots of human effort and expertise. The problem that we’re trying to tackle is, how do you reduce human effort? How can you create high-quality data with really low amount of human effort? When you have a language model and, let’s say, you want to apply it somewhere, you might have to train a generic model before. Which could be small or big. Doesn’t matter. After that, you can specialize it on the domain that you are looking for, and when you want to do that—to make it really fast, this particular process—it’s best if you go for synthetic data. If you have a way to, actually, generate very high-quality synthetic data, you can fast-track this part of specialization process. Not only single model. So this year, you’re going to see a lot more multi-agent models. And when you are trying to build these multi-agent models, you’re fearing like, OK, it might increase the cost too much, the latency too much. So it’s also very much important that you have a multi-agent system and you can, sort of, replace some of those agents with specialized small models. And when you’re trying to address these goals, you want this process to be something which you know works fast. So that’s why we are trying to make sure we have a very good way to create synthetic data for your specific need.

HUIZINGA: No research exists in a vacuum, and most of it fills some kind of a gap. So tell us what’s already been done in this field and how this work is building on it.

MITRA: So previously, actually, we have seen that in post-training, the more data you have, the better the performance goes for the model you’re training. So what we wanted to test is how much we can scale and what happens if we scale a lot and lot. But we didn’t have the tools for it. So the other approaches people previously used was you had a small set of data and how do we expand this dataset into much larger and larger amount of data. That’s where people were mostly focusing. But it’s not that easy to create that initial seed set. [LAUGHTER] You need to be very expert. The way that we’re doing is, actually, rather you define what you want to create. Like, OK, you want to create tool-use data. So you say, OK, I have a bunch of tools, and I am looking for data in the scenarios where someone can just come give me a description and then maybe that person interact with the AI to figure out how to get the job done. It’s not a one-step thing. And maybe you also have a setting where it’s more like an app developer. You have a bunch of APIs in your phone. You just want to figure out which one is best for the user request, which came through voice command. So different scenarios could be there. So what we’re saying [is], OK, we are not going through the method where you have to come up with your initial own seed data and then we expand. It is more like you define what you want to do. It’s much more abstract. And then, we are, sort of, automating the effort of data creation. So this setting actually of synthetic data creation, we are referring [to] it as generative teaching, and that’s where we are, sort of, differing. So previously, it was more like expansion, and now we are trying from specification to the data that you need.

HUIZINGA: Gotcha. Well talk a little bit more about your methodology and how you went about conducting this research.

MITRA: So first of all, what we are proposing actually is a multi-agent solution. So you start with first describing what you really need. So you describe in detail, like, I need data for this specific skill or this specific scenario. Then, what we do is like, OK, you have some unstructured data or raw data like text documents or code files that you gather from web with permissible license or use something that you own. We don’t care much about what the content is really. So it’s more like we got some random stuff, some random content. And then we’ll guide you how to convert this random something which is not meaningful for you into something which is meaningful for your data creation. For example, like, if you are creating data to teach how to use APIs, you might think about, you need lots of APIs and how do you get these APIs. So what we are saying is, like, we can take something like code and we’ll have agents which will convert these raw code files into list of APIs which is more like a library. So you create automatically this input that is very meaningful for data creation. And then once we have that, we have basically the seed instruction creation step based on your specification. Like, what do you want to create data for? So you have all these different scenarios, and we have multiple agents creating data for different scenarios. And then the last step is actually what we call refinement step. So it’s more like whatever data you created, we’ll go through them and we’ll make them better and better—improve the quality, improve the complexity, improve the trickiness, we’ll teach when not to answer, etc., etc. So make sure we cover the whole space. So by changing the stochastic seed, we are trying to cover the entire possible data space.

HUIZINGA: Right.

MITRA: So that’s the key thing. The way we, sort of, conducted this research is actually we defined 17 skills. Skills meaning reading comprehension, tool use, text modification, content creation, RAG (retrieval-augmented generation) … we have, like, list of 17 skills … conversation … and then we created one multi-agent flow for each of the skills and we generate data. So one key thing I want to highlight is, like, this work, compared to other work, it was not benchmark driven. We want to teach a skill. We don’t care which benchmarks we’re trying to evaluate it on. So we define the skill, like tool use means this to us, reading comprehension means this to us, text modification means this to us. And then we, sort of, generate the data to teach everything for that skill. And then what we did, we created actually 22 million instructions. And we had previously in Orca series, we had 3 million, around, instructions. So the 25 million is what we, sort of, have at the end. And that’s where we actually trained a Mistral model as of now. And we’re going to measure, like, how much we improve the Mistral model by this post-training.

HUIZINGA: Moving from methods to findings, I always look forward to the part of the research paper that finishes the sentence “and what we found was … ,” so give us a quick overview of your results. What did you find?

MITRA: Yes, so the results were actually very exciting for us. So Mistral 7B was our main, sort of, baseline because that’s where we’re trying to showcase, like, how much improvement we are getting. On the other side, we have, like, frontier models—ChatGPT, GPT-4. We want to also measure how far we are from those frontier models, so that’s, sort of, our evaluation setup. So on average actually, we got like 20 percent performance gain over the Mistral, and we evaluated that across 14 benchmarks that test reasoning, content creation, instruction following, format following, etc. But what was more important to us was to do a skill-specific evaluation because we are trying to teach certain skills, and we had, like, 17 skills as we mentioned earlier. So, for example, like, if you are focusing on reading comprehension as a skill, we took LSAT, SAT, and DROP, and many other benchmarks; we created a collection of reading comprehension-based benchmark. And there, we are observing, like, 20 percent improvement over Mistral, and what it means, like, we’re actually achieving GPT-4–level performance. Similarly, if I’m focusing on math skill, there are many datasets which test, like, elementary math, high school math, college-level math. And we improved actually across all these different levels of math. So we see from 40 percent to 150 percent of improvement on different benchmarks of math. So it was more like what we wanted to see. We’re not optimizing for a particular benchmark. We wanted to optimize the skill, and that’s what you’re observing. So you’re observing improvement in math across all these levels, from elementary to high school to college to middle school, etc., everything. The same goes for RAG, as well. We’re observing on RAG skill 92 percent, around, improvement over Mistral. The format following numbers are pretty interesting to us. So format following is very important for SLMs (small language models). You want to make these models practical. You want to make sure that they follow the format so you can parse the result. And we were able to take Mistral beyond Gemini Pro. So that was a very strong performance from the post-training that we did. For summarization, actually we were able to reduce the hallucination rate by 31 percent while achieving the GPT-4–level quality. So overall, all these results were, sort of, highlighting that the methodology that we have, which we’re calling AgentInstruct, is very promising.

HUIZINGA: I think it’s important to get practical and talk about real-world impact. So tell us who you think this research will benefit most and why.

MITRA: Yeah, so again the model builders will, sort of, find it most beneficial. So the significance of our work actually lies in the way we are trying to revolutionize the language model development through scalable, low-effort synthetic creation. And the scalable and low effort is, sort of, the key thing, right. We have shown that we can create very high-quality data. That’s what the numbers are telling us. We want to mention that this is very scalable and low effort, and that’s what we think might help the most for model builders.

HUIZINGA: So, Arindam, let’s borrow a phrase from the machine learning lexicon and go for a little one-shot learning here: if you had to boil down why your work is important, what’s the one thing you want our listeners to take away from this research?

MITRA: The key takeaway would be, like, the AgentInstruct method enables the generation of vast, diverse, and high-quality synthetic data with very minimal human input. So that’s one thing I would, like, to remember from this paper.

HUIZINGA: So as we close, talk briefly about the limitations that you encountered in this project and directions for future research. What are the outstanding challenges in this field, and what’s on your research agenda to overcome them?

MITRA: Yes, so we’re exploring further automation. But apart from making this data creation more automated and less human involvement needed, we’re trying to focus on two other aspects. One is automated model debugging, and the other is automated model repairing. So now that we have the ability to generate data for a particular skill, let’s say math, for model debugging, what we need is basically an error handler. Like something we can plug in which takes the question and the answer coming from a different model and verifies if the answer is correct or not. So that’s the part we’re working on right now, figuring out this error handler. And the second aspect is repairing. So once we have the error, we figure out, OK, this is where the model is struggling. How can we give feedback or how can we give more knowledge so it can basically correct those errors? So those are some things we’re working on right now.

[MUSIC PLAYS]

HUIZINGA: Well, Arindam Mitra, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts, or you can find a preprint on arXiv. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: July 18, 2024 appeared first on Microsoft Research.

Read More

Samplable Anonymous Aggregation for Private Federated Data Analytics

We revisit the problem of designing scalable protocols for private statistics and private federated learning when each device holds its private data. Locally differentially private algorithms require little trust but are (provably) limited in their utility. Centrally differentially private algorithms can allow significantly better utility but require a trusted curator. This gap has led to significant interest in the design and implementation of simple cryptographic primitives, that can allow central-like utility guarantees without having to trust a central server.
Our first contribution is to…Apple Machine Learning Research

How Deloitte Italy built a digital payments fraud detection solution using quantum machine learning and Amazon Braket

How Deloitte Italy built a digital payments fraud detection solution using quantum machine learning and Amazon Braket

As digital commerce expands, fraud detection has become critical in protecting businesses and consumers engaging in online transactions. Implementing machine learning (ML) algorithms enables real-time analysis of high-volume transactional data to rapidly identify fraudulent activity. This advanced capability helps mitigate financial risks and safeguard customer privacy within expanding digital markets.

Deloitte is a strategic global systems integrator with over 19,000 certified AWS practitioners across the globe. It continues to raise the bar through participation in the AWS Competency Program with 29 competencies, including Machine Learning.

This post demonstrates the potential for quantum computing algorithms paired with ML models to revolutionize fraud detection within digital payment platforms. We share how Deloitte built a hybrid quantum neural network solution with Amazon Braket to demonstrate the possible gains coming from this emerging technology.

The promise of quantum computing

Quantum computers harbor the potential to radically overhaul financial systems, enabling much faster and more precise solutions. Compared to classical computers, quantum computers are expected in the long run to have to advantages in the areas of simulation, optimization, and ML. Whether quantum computers can provide a meaningful speedup to ML is an active topic of research.

Quantum computing can perform efficient near real-time simulations in critical areas such as pricing and risk management. Optimization models are key activities in financial institutions, aimed at determining the best investment strategy for a portfolio of assets, allocating capital, or achieving productivity improvements. Some of these optimization problems are nearly impossible for traditional computers to tackle, so approximations are used to solve the problems in a reasonable amount of time. Quantum computers could perform faster and more accurate optimizations without using any approximations.

Despite the long-term horizon, the potentially disruptive nature of this technology means that financial institutions are looking to get an early foothold in this technology by building in-house quantum research teams, expanding their existing ML COEs to include quantum computing, or engaging with partners such as Deloitte.

At this early stage, customers seek access to a choice of different quantum hardware and simulation capabilities in order to run experiments and build expertise. Braket is a fully managed quantum computing service that lets you explore quantum computing. It provides access to quantum hardware from IonQ, OQC, Quera, Rigetti, IQM, a variety of local and on-demand simulators including GPU-enabled simulations, and infrastructure for running hybrid quantum-classical algorithms such as quantum ML. Braket is fully integrated with AWS services such as Amazon Simple Storage Service (Amazon S3) for data storage and AWS Identity and Access Management (IAM) for identity management, and customers only pay for what you use.

In this post, we demonstrate how to implement a quantum neural network-based fraud detection solution using Braket and AWS native services. Although quantum computers can’t be used in production today, our solution provides a workflow that will seamlessly adapt and function as a plug-and-play system in the future, when commercially viable quantum devices become available.

Solution overview

The goal of this post is to explore the potential of quantum ML and present a conceptual workflow that could serve as a plug-and-play system when the technology matures. Quantum ML is still in its early stages, and this post aims to showcase the art of the possible without delving into specific security considerations. As quantum ML technology advances and becomes ready for production deployments, robust security measures will be essential. However, for now, the focus is on outlining a high-level conceptual architecture that can seamlessly adapt and function in the future when the technology is ready.

The following diagram shows the solution architecture for the implementation of a neural network-based fraud detection solution using AWS services. The solution is implemented using a hybrid quantum neural network. The neural network is built using the Keras library; the quantum component is implemented using PennyLane.

The workflow includes the following key components for inference (A–F) and training (G–I):

  1. Ingestion – Real-time financial transactions are ingested through Amazon Kinesis Data Streams
  2. PreprocessingAWS Glue streaming extract, transform, and load (ETL) jobs consume the stream to do preprocessing and light transforms
  3. Storage – Amazon S3 is used to store output artifacts
  4. Endpoint deployment – We use an Amazon SageMaker endpoint to deploy the models
  5. Analysis – Transactions along with the model inferences are stored in Amazon Redshift
  6. Data visualizationAmazon QuickSight is used to visualize the results of fraud detection
  7. Training data – Amazon S3 is used to store the training data
  8. Modeling – A Braket environment produces a model for inference
  9. GovernanceAmazon CloudWatch, IAM, and AWS CloudTrail are used for observability, governance, and auditability, respectively

Dataset

For training the model, we used open source data available on Kaggle. The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset records transactions that occurred over a span of 2 days, during which there were 492 instances of fraud detected out of a total of 284,807 transactions. The dataset exhibits a significant class imbalance, with fraudulent transactions accounting for just 0.172% of the entire dataset. Because the data is highly imbalanced, various measures have been taken during data preparation and model development.

The dataset exclusively comprises numerical input variables, which have undergone a Principal Component Analysis (PCA) transformation because of confidentiality reasons.

The data only includes numerical input features (PCA-transformed due to confidentiality) and three key fields:

  • Time – Time between each transaction and first transaction
  • Amount – Transaction amount
  • Class – Target variable, 1 for fraud or 0 for non-fraud

Data preparation

We split the data into training, validation, and test sets, and we define the target and the features sets, where Class is the target variable:

y_train = df_train['Class']
x_train = df_ train.drop(['Class'], axis=1)
y_validation = df_ validation ['Class']
x_ validation = df_ validation.drop(['Class'], axis=1)
y_test = df_test['Class']
x_test = df_test.drop(['Class'], axis=1)

The Class field assumes values 0 and 1. To make the neural network deal with data imbalance, we perform a label encoding on the y sets:

lbl_clf = LabelEncoder()
y_train = lbl_clf.fit_transform(y_train)
y_train = tf.keras.utils.to_categorical(y_train)

The encoding applies to all the values the mapping: 0 to [1,0], and 1 to [0,1].

Finally, we apply scaling that standardizes the features by removing the mean and scaling to unit variance:

std_clf = StandardScaler()
x_train = std_clf.fit_transform(x_train)
x_validation = std_clf.fit_transform(x_validation)
x_test = std_clf.transform(x_test)

The functions LabelEncoder and StandardScaler are available in the scikit-learn Python library.

After all the transformations are applied, the dataset is ready to be the input of the neural network.

Neural network architecture

We composed the neural network architecture with the following layers based on several tests empirically:

  • A first dense layer with 32 nodes
  • A second dense layer with 9 nodes
  • A quantum layer as neural network output
  • Dropout layers with rate equals to 0.3

We apply an L2 regularization on the first layer and both L1 and L2 regularization on the second one, to avoid overfitting. We initialize all the kernels using the he_normal function. The dropout layers are meant to reduce overfitting as well.

hidden = Dense(32, activation ="relu", kernel_initializer='he_normal', kernel_regularizer=tf.keras.regularizers.l2(0,01))
out_2 = Dense(9, activation ="relu", kernel_initializer='he_normal', kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0,001, l2=0,001))
do = Dropout(0,3)

Quantum circuit

The first step to obtain the layer is to build the quantum circuit (or the quantum node). To accomplish this task, we used the Python library PennyLane.

PennyLane is an open source library that seamlessly integrates quantum computing with ML. It allows you to create and train quantum-classical hybrid models, where quantum circuits act as layers within classical neural networks. By harnessing the power of quantum mechanics and merging it with classical ML frameworks like PyTorch, TensorFlow, and Keras, PennyLane empowers you to explore the exciting frontier of quantum ML. You can unlock new realms of possibility and push the boundaries of what’s achievable with this cutting-edge technology.

The design of the circuit is the most important part of the overall solution. The predictive power of the model depends entirely on how the circuit is built.

Qubits, the fundamental units of information in quantum computing, are entities that behave quite differently from classical bits. Unlike classical bits that can only represent 0 or 1, qubits can exist in a superposition of both states simultaneously, enabling quantum parallelism and faster calculations for certain problems.

We decide to use only three qubits, a small number but sufficient for our case.

We instantiate the qubits as follows:

num_wires = 3
dev = qml.device('default.qubit', wires=num_wires)

‘default.qubit’ is the PennyLane qubits simulator. To access qubits on a real quantum computer, you can replace the second line with the following code:

device_arn = "arn:aws:braket:eu-west-2::device/qpu/ionq/Aria-1"
dev = qml.device('braket.aws.qubit',device_arn=device_arn, wires=num_wires)

device_ARN could be the ARN of the devices supported by Braket (for a list of supported devices, refer to Amazon Braket supported devices).

We defined the quantum node as follows:

@qml.qnode(dev, interface="tf", diff_method="backprop")
def quantum_nn(inputs, weights):
    qml.RY(inputs[0], wires=0)
    qml.RY(inputs[1], wires=1)
    qml.RY(inputs[2], wires=2)
    qml.Rot(weights[0] * inputs[3], weights[1] * inputs[4], weights[2] * inputs[5], wires=1)
    qml.Rot(weights[3] * inputs[6], weights[4] * inputs[7], weights[5] * inputs[8], wires=2)
    qml.CNOT(wires=[1, 2])
    qml.RY(weights[6], wires=2)
    qml.CNOT(wires=[0, 2])
    qml.CNOT(wires=[1, 2])
    return [qml.expval(qml.PauliZ(0)), qml.expval(qml.PauliZ(2))]

The inputs are the values yielded as output from the previous layer of the neural network, and the weights are the actual weights of the quantum circuit.

RY and Rot are rotation functions performed on qubits; CNOT is a controlled bitflip gate allowing us to embed the qubits.

qml.expval(qml.PauliZ(0)), qml.expval(qml.PauliZ(2)) are the measurements applied respectively to the qubits 0 and the qubits 1, and these values will be the neural network output.

Diagrammatically, the circuit can be displayed as:

0: ──RY(1.00)──────────────────────────────────────╭●────┤  <Z>

1: ──RY(2.00)──Rot(4.00,10.00,18.00)──╭●───────────│──╭●─┤

2: ──RY(3.00)──Rot(28.00,40.00,54.00)─╰X──RY(7.00)─╰X─╰X─┤  <Z>

The transformations applied to qubit 0 are fewer than the transformations applied to qbit 2. This choice is because we want to separate the states of the qubits in order to obtain different values when the measures are performed. Applying different transformations to qubits allows them to enter distinct states, resulting in varied outcomes when measurements are performed. This phenomenon stems from the principles of superposition and entanglement inherent in quantum mechanics.

After we define the quantum circuit, we define the quantum hybrid neural network:

def hybrid_model(num_layers, num_wires):
    weight_shapes = {"weights": (7,)}
    qlayer = qml.qnn.KerasLayer(quantum_nn, weight_shapes, output_dim=2)
    hybrid_model = tf.keras.Sequential([hidden,do, out_2,do,qlayer])
    return hybrid_model

KerasLayer is the PennyLane function that turns the quantum circuit into a Keras layer.

Model training

After we have preprocessed the data and defined the model, it’s time to train the network.

A preliminary step is needed in order to deal with the unbalanced dataset. We define a weight for each class according to the inverse root rule:

class_counts = np.bincount(y_train_list)
class_frequencies = class_counts / float(len(y_train))
class_weights = 1 / np.sqrt(class_frequencies)

The weights are given by the inverse of the root of occurrences for each of the two possible target values.

We compile the model next:

model.compile(optimizer='adam', loss = 'MSE', metrics = [custom_metric])

custom_metric is a modified version of the metric precision, which is a custom subroutine to postprocess the quantum data into a form compatible with the optimizer.

For evaluating model performance on imbalanced data, precision is a more reliable metric than accuracy, so we optimize for precision. Also, in fraud detection, incorrectly predicting a fraudulent transaction as valid (false negative) can have serious financial consequences and risks. Precision evaluates the proportion of fraud alerts that are true positives, minimizing costly false negatives.

Finally, we fit the model:

history = model.fit(x_train, y_train, epochs = 30, batch_size = 200, validation_data=(x_validation, y_ validation),class_weight=class_weights,shuffle=True)

At each epoch, the weights of both the classic and quantum layer are updated in order to reach higher accuracy. At the end of the training, the network showed a loss of 0.0353 on the training set and 0.0119 on the validation set. When the fit is complete, the trained model is saved in .h5 format.

Model results and analysis

Evaluating the model is vital to gauge its capabilities and limitations, providing insights into the predictive quality and value derived from the quantum techniques.

To test the model, we make predictions on the test set:

preds = model.predict(x_test)

Because the neural network is a regression model, it yields for each record of x_test a 2-D array, where each component can assume values between 0 and 1. Because we’re essentially dealing with a binary classification problem, the outputs should be as follows:

  • [1,0] – No fraud
  • [0,1] – Fraud

To convert the continuous values into binary classification, a threshold is necessary. Predictions that are equal to or above the threshold are assigned 1, and those below the threshold are assigned 0.

To align with our goal of optimizing precision, we chose the threshold value that results in the highest precision.

The following table summarizes the mapping between various threshold values and the precision.

Class Threshold = 0.65 Threshold = 0.70 Threshold = 0.75
No Fraud 1.00 1.00 1.00
Fraud 0.87 0.89 0.92

The model demonstrates almost flawless performance on the predominant non-fraud class, with precision and recall scores close to a perfect 1. Despite far less data, the model achieves precision of 0.87 for detecting the minority fraud class at a 0.65 threshold, underscoring performance even on sparse data. To efficiently identify fraud while minimizing incorrect fraud reports, we decide to prioritize precision over recall.

We also wanted to compare this model with a classic neural network only model to see if we are exploiting the gains coming from the quantum application. We built and trained an identical model in which the quantum layer is replaced by the following:

Dense(2,activation = "softmax")

In the last epoch, the loss was 0.0119 and the validation loss was 0.0051.

The following table summarizes the mapping between various threshold values and the precision for the classic neural network model.

Class Threshold=0.65 Threshold = 0.70 Threshold = 0.75
No Fraud 1.0 1.00 1.00
Fraud 0.83 0.84 0. 86

Like the quantum hybrid model, the model performance is almost perfect for the majority class and very good for the minority class.

The hybrid neural network has 1,296 parameters, whereas the classic one has 1,329. When comparing precision values, we can observe how the quantum solution provides better results. The hybrid model, inheriting the properties of high-dimensional spaces exploration and a non-linearity from the quantum layer, is able to generalize the problem better using fewer parameters, resulting in better performance.

Challenges of a quantum solution

Although the adoption of quantum technology shows promise in providing organizations numerous benefits, practical implementation on large-scale, fault-tolerant quantum computers is a complex task and is an active area of research. Therefore, we should be mindful of the challenges that it poses:

  • Sensitivity to noise – Quantum computers are extremely sensitive to external factors (such as atmospheric temperature) and require more attention and maintenance than traditional computers, and this can drift over time. One way to minimize the effects of drift is by taking advantage of parametric compilation—the ability to compile a parametric circuit such as the one used here only one time, and feed it fresh parameters at runtime, avoiding repeated compilation steps. Braket automatically does this for you.
  • Dimensional complexity – The inherent nature of qubits, the fundamental units of quantum computing, introduces a higher level of intricacy compared to traditional binary bits employed in conventional computers. By harnessing the principles of superposition and entanglement, qubits possess an elevated degree of complexity in their design. This intricate architecture renders the evaluation of computational capacity a formidable challenge, because the multidimensional aspects of qubits demand a more nuanced approach to assessing their computational prowess.
  • Computational errors – Increased calculation errors are intrinsic to quantum computing’s probabilistic nature during the sampling phase. These errors could impact accuracy and reliability of the results obtained through quantum sampling. Techniques such as error mitigation and error suppression are actively being developed in order to minimize the effects of errors resulting from noisy qubits. To learn more about error mitigation, see Enabling state-of-the-art quantum algorithms with Qedma’s error mitigation and IonQ, using Braket Direct.

Conclusion

The results discussed in this post suggest that quantum computing holds substantial promise for fraud detection in the financial services industry. The hybrid quantum neural network demonstrated superior performance in accurately identifying fraudulent transactions, highlighting the potential gains offered by quantum technology. As quantum computing continues to advance, its role in revolutionizing fraud detection and other critical financial processes will become increasingly evident. You can extend the results of the simulation by using real qubits and testing various outcomes on real hardware available on Braket, such as those from IQM, IonQ, and Rigetti, all on demand, with pay-as-you-go pricing and no upfront commitments.

To prepare for the future of quantum computing, organizations must stay informed on the latest advancements in quantum technology. Adopting quantum-ready cloud solutions now is a strategic priority, allowing a smooth transition to quantum when hardware reaches commercial viability. This forward-thinking approach will provide both a technological edge and rapid adaptation to quantum computing’s transformative potential across industries. With an integrated cloud strategy, businesses can proactively get quantum-ready, primed to capitalize on quantum capabilities at the right moment. To accelerate your learning journey and earn a digital badge in quantum computing fundamentals, see Introducing the Amazon Braket Learning Plan and Digital Badge.

Connect with Deloitte to pilot this solution for your enterprise on AWS.


About the authors

Federica Marini is a Manager in Deloitte Italy AI & Data practice with a strong experience as a business advisor and technical expert in the field of AI, Gen AI, ML and Data. She addresses research and customer business needs with tailored data-driven solutions providing meaningful results. She is passionate about innovation and believes digital disruption will require a human centered approach to achieve full potential.

Matteo Capozi is a Data and AI expert in Deloitte Italy, specializing in the design and implementation of advanced AI and GenAI models and quantum computing solutions. With a strong background on cutting-edge technologies, Matteo excels in helping organizations harness the power of AI to drive innovation and solve complex problems. His expertise spans across industries, where he collaborates closely with executive stakeholders to achieve strategic goals and performance improvements.

Kasi Muthu is a senior partner solutions architect focusing on generative AI and data at AWS based out of Dallas, TX. He is passionate about helping partners and customers accelerate their cloud journey. He is a trusted advisor in this field and has plenty of experience architecting and building scalable, resilient, and performant workloads in the cloud. Outside of work, he enjoys spending time with his family.

Kuldeep Singh is a Principal Global AI/ML leader at AWS with over 20 years in tech. He skillfully combines his sales and entrepreneurship expertise with a deep understanding of AI, ML, and cybersecurity. He excels in forging strategic global partnerships, driving transformative solutions and strategies across various industries with a focus on generative AI and GSIs.

Read More

Amazon SageMaker unveils the Cohere Command R fine-tuning model

Amazon SageMaker unveils the Cohere Command R fine-tuning model

AWS announced the availability of the Cohere Command R fine-tuning model on Amazon SageMaker. This latest addition to the SageMaker suite of machine learning (ML) capabilities empowers enterprises to harness the power of large language models (LLMs) and unlock their full potential for a wide range of applications.

Cohere Command R is a scalable, frontier LLM designed to handle enterprise-grade workloads with ease. Cohere Command R is optimized for conversational interaction and long context tasks. It targets the scalable category of models that balance high performance with strong accuracy, enabling companies to move beyond proof of concept and into production. The model boasts high precision on Retrieval Augmented Generation (RAG) and tool use tasks, low latency and high throughput, a long 128,000-token context length, and strong capabilities across 10 key languages.

In this post, we explore the reasons for fine-tuning a model and the process of how to accomplish it with Cohere Command R.

Fine-tuning: Tailoring LLMs for specific use cases

Fine-tuning is an effective technique to adapt LLMs like Cohere Command R to specific domains and tasks, leading to significant performance improvements over the base model. Evaluations of fine-tuned Cohere Command R model have demonstrated improved performance by over 20% across various enterprise use cases in industries such as financial services, technology, retail, healthcare, legal, and healthcare. Because of its smaller size, a fine-tuned Cohere Command R model can be served more efficiently compared to models much larger than its class.

The recommendation is to use a dataset that contains at least 100 examples.

Cohere Command R uses a RAG approach, retrieving relevant context from an external knowledge base to improve outputs. However, fine-tuning allows you to specialize the model even further. Fine-tuning text generation models like Cohere Command R is crucial for achieving ultimate performance in several scenarios:

  •  Domain-specific adaptation – RAG models may not perform optimally in highly specialized domains like finance, law, or medicine. Fine-tuning allows you to adapt the model to these domains’ nuances for improved accuracy.
  • Data augmentation – Fine-tuning enables incorporating additional data sources or techniques, augmenting the model’s knowledge base for increased robustness, especially with sparse data.
  • Fine-grained control – Although RAG offers impressive general capabilities, fine-tuning permits fine-grained control over model behavior, tailoring it precisely to your desired task for ultimate precision.

The combined power of RAG and fine-tuned LLMs empowers you to tackle diverse challenges with unparalleled versatility and effectiveness. With the introduction of Cohere Command R fine-tuning on SageMaker, enterprises can now customize and optimize the model’s performance for their unique requirements. By fine-tuning on domain-specific data, businesses can enhance Cohere Command R’s accuracy, relevance, and effectiveness for their use cases, such as natural language processing, text generation, and question answering.

By combining the scalability and robustness of Cohere Command R with the ability to fine-tune its performance on SageMaker, AWS empowers enterprises to navigate the complexities of AI adoption and use its transformative power to drive innovation and growth across various industries and domains.

Customer data, including prompts, completions, custom models, and data used for fine-tuning or continued pre-training, remains private to customer AWS accounts and is never shared with third-party model providers.

Solution overview

In the following sections, we walk through the steps to fine-tune the Cohere Command R model on SageMaker. This includes preparing the data, deploying a model, preparing for fine-tuning, creating an endpoint for inference, and performing inference.

Prepare the fine-tuning data

Before you can start a fine-tuning job, you need to upload a dataset with training and (optionally) evaluation data.

First, make sure your data is in jsonl format. It should have the following structure:

  •  messages – This contains a list of messages of the conversation. A message consists of the following parts:
  •  role – This specifies the current speaker. You can pick from System, User, or Chatbot.
  •  content – This contains the content of the message.

The following is an example that trains a chatbot to answer questions. For the sake of readability, the document spans over multiple lines. For your dataset, make sure that each line contains one whole example.

{
  "messages": [
    {
      "role": "System",
      "content": "You are a large language model trained by Cohere."
    },
    {
      "role": "User",
      "content": "Hi! What were Time magazines top 10 cover stories in the last 10 years?"
    },
    {
      "role": "Chatbot",
      "content": "Time magazines top 10 cover stories in the last 10 years were:\n\n1. Volodymyr Zelenskyy\n2. Elon Musk\n3. Martin Luther King Jr.\n4. How Earth Survived\n5. Her Lasting Impact\n6. Nothing to See Here\n7. Meltdown\n8. Deal With It\n9. The Top of America\n10. Bitter Pill"
    },
    {
      "role": "User",
      "content": "Who is Volodymyr Zelenskyy?"
    },
    {
      "role": "Chatbot",
      "content": "Volodymyr Zelenskyy is a Ukrainian politician who was elected President of Ukraine in 2019. He previously served as the Minister of Internal Affairs in the government of Prime Minister Volodymyr Groysman."
    },
    {
      "role": "User",
      "content": "Thank you!"
    }
  ]
}

Deploy a model

Complete the following steps to deploy the model:

  1. On AWS Marketplace, subscribe to the Cohere Command R model

After you subscribe to the model, you can configure it and create a training job.

  1. Choose View in Amazon SageMaker.
  2. Follow the instructions in the UI to create a training job.

Alternatively, you can use the following example notebook to create the training job.

Prepare for fine-tuning

To fine-tune the model, you need the following:

  • Product ARN – This will be provided to you after you subscribe to the product.
  • Training dataset and evaluation dataset – Prepare your datasets for fine-tuning.
  • Amazon S3 location – Specify the Amazon Simple Storage Service (Amazon S3) location that stores the training and evaluation datasets.
  • Hyperparameters – Fine-tuning typically involves adjusting various hyperparameters like learning rate, batch size, number of epochs, and so on. You need to specify the appropriate hyperparameter ranges or values for your fine-tuning task.

Create an endpoint for inference

When the fine-tuning is complete, you can create an endpoint for inference with the fine-tuned model. To create the endpoint, use the create_endpoint method. If the endpoint already exists, you can connect to it using the connect_to_endpoint method.

Perform inference

You can now perform real-time inference using the endpoint. The following is the sample message that you use for input:

message = "Classify the following text as either very negative, negative, neutral, positive or very positive: mr. deeds is , as comedy goes , very silly -- and in the best way."
result = co.chat(message=message)
print(result)

The following screenshot shows the output of the fine-tuned model.


Optionally, you can also test the accuracy of the model using the evaluation data (sample_finetune_scienceQA_eval.jsonl).

Clean up

After you have completed running the notebook and experimenting with the Cohere Command R fine-tuned model, it is crucial to clean up the resources you have provisioned. Failing to do so may result in unnecessary charges accruing on your account. To prevent this, use the following code to delete the resources and stop the billing process:

co.delete_endpoint()
co.close()

Summary

Cohere Command R with fine-tuning allows you to customize your models to be performant for your business, domain, and industry. Alongside the fine-tuned model, users additionally benefit from Cohere Command R’s proficiency in the most commonly used business languages (10 languages) and RAG with citations for accurate and verified information. Cohere Command R with fine-tuning achieves high levels of performance with less resource usage on targeted use cases. Enterprises can see lower operational costs, improved latency, and increased throughput without extensive computational demands.

Start building with Cohere’s fine-tuning model in SageMaker today.


About the Authors

Shashi Raina is a Senior Partner Solutions Architect at Amazon Web Services (AWS), where he specializes in supporting generative AI (GenAI) startups. With close to 6 years of experience at AWS, Shashi has developed deep expertise across a range of domains, including DevOps, analytics, and generative AI.

James Yi is a Senior AI/ML Partner Solutions Architect in the Emerging Technologies team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy and scale AI/ML applications to derive their business values. Outside of work, he enjoys playing soccer, traveling and spending time with his family.

Pradeep Prabhakaran is a Customer Solutions Architect at Cohere. In his current role at Cohere, Pradeep acts as a trusted technical advisor to customers and partners, providing guidance and strategies to help them realize the full potential of Cohere’s cutting-edge Generative AI platform. Prior to joining Cohere, Pradeep was a Principal Customer Solutions Manager at Amazon Web Services, where he led Enterprise Cloud transformation programs for large enterprises. Prior to AWS, Pradeep has held various leadership positions at consulting companies such as Slalom, Deloitte, and Wipro. Pradeep holds a Bachelor’s degree in Engineering and is based in Dallas, TX.

Read More

Derive meaningful and actionable operational insights from AWS Using Amazon Q Business

Derive meaningful and actionable operational insights from AWS Using Amazon Q Business

As a customer, you rely on Amazon Web Services (AWS) expertise to be available and understand your specific environment and operations. Today, you might implement manual processes to summarize lessons learned, obtain recommendations, or expedite the resolution of an incident. This can be time consuming, inconsistent, and not readily accessible.

This post shows how to use AWS generative artificial intelligence (AI) services, like Amazon Q Business, with AWS Support cases, AWS Trusted Advisor, and AWS Health data to derive actionable insights based on common patterns, issues, and resolutions while using the AWS recommendations and best practices enabled by support data. This post will also demonstrate how you can integrate these insights with your IT service management (ITSM) system (such as ServiceNow, Jira, and Zendesk), to allow you to implement recommendations and keep your AWS operations healthy.

Amazon Q Business is a fully managed, secure, generative-AI powered enterprise chat assistant that enables natural language interactions with your organization’s data. Ingesting data for support cases, Trusted Advisor checks, and AWS Health notifications into Amazon Q Business enables interactions through natural language conversations, sentiment analysis, and root cause analysis without needing to fully understand the underlying data models or schemas. The AI assistant provides answers along with links that point directly to the data sources. This allows you to easily identify and reference the underlying information sources that informed the AI’s response, providing more context and enabling further exploration of the topic if needed. Amazon Q Business integrates with ITSM solutions, allowing recommendations to be tracked and actioned within your existing workflows.

AWS Support offers a range of capabilities powered by technology and subject matter experts that support the success and operational health of your AWS environments. AWS Support provides you with proactive planning and communications, advisory, automation, and cloud expertise to help you achieve business outcomes with increased speed and scale in the cloud. These capabilities enable proactive planning for upcoming changes, expedited recovery from operational disruptions, and recommendations to optimize the performance and reliability of your AWS IT infrastructure.

This solution will demonstrate how to deploy Amazon Q Business and ingest data from AWS Support cases, AWS Trusted Advisor, and AWS Health using the provided code sample to generate insights based on your support data.

Overview of solution

Today, Amazon Q Business provides 43 connectors available to natively integrate with multiple data sources. In this post, we’re using the APIs for AWS Support, AWS Trusted Advisor, and AWS Health to programmatically access the support datasets and use the Amazon Q Business native Amazon Simple Storage Service (Amazon S3) connector to index support data and provide a prebuilt chatbot web experience. The AWS Support, AWS Trusted Advisor, and AWS Health APIs are available for customers with Enterprise Support, Enterprise On-Ramp, or Business support plans.

Q Support Insights (QSI) is the name of the solution provided in the code sample repository. QSI enables insights on your AWS Support datasets across your AWS accounts. The following diagram describes at a high level the QSI solution and components.

Overview of the QSI solution

Figure 1: Overview of the QSI solution

There are two major components in the QSI solution. First, as illustrated in the Linked Accounts group in Figure 1, this solution supports datasets from linked accounts and aggregates your data using the various APIs, AWS Lambda, and Amazon EventBridge. Second, the support datasets from linked accounts are stored in a central S3 bucket that you own, as shown in the Data Collection Account group in the Figure 1. These datasets are then indexed using the Amazon Q Business S3 connector.

Under the hood, the Amazon Q Business S3 connector creates a searchable index of your AWS Support datasets, and gathers relevant important details related to keywords like case titles, descriptions, best practices, keywords, dates, and so on. The generative AI capabilities of Amazon Q Business enable it to synthesize insights and generate natural language responses available for users in the Amazon Q Business web chat experience. Amazon Q Business also supports plugins and actions so users can directly create tickets in the ITSM system without leaving the chat experience.

By default, Amazon Q Business will only produce responses using the data you’re indexing. This behavior is aligned with the use cases related to our solution. If needed, this response setting can be changed to allow Amazon Q to fallback to large language model (LLM) knowledge.

Walkthrough

The high-level steps to deploy the solution are the following:

  1. Create the necessary buckets to contain the support cases exports and deployment resources.
  2. Upload the support datasets (AWS Support cases, AWS Trusted Advisor, and AWS Health) to the S3 data source bucket.
  3. Create the Amazon Q Business application, the data source, and required components using deployment scripts.
  4. Optionally, configure ITSM integration by using one of the available Amazon Q Business built-in plugins.
  5. Synchronize the data source to index the data.
  6. Test the solution through chat.

The full guidance and deployment options are available in the aws-samples Github repository. The solution can be deployed in a single account or in an AWS Organizations. In addition to the data security and protection Amazon Q Business supports, this solution integrates with your identity provider and respects access control lists (ACLs) so users get answers based on their unique permissions. This solution also provides additional controls to include or exclude specific accounts.

Prerequisites

For this solution to work, the following prerequisites are needed:

Create the Amazon Q Business application using the deployment scripts

Using the Amazon Q Business application creation module, you can set up and configure an Amazon Q Business application, along with its crucial components, in an automated manner. These components include an Amazon S3 data source connector, required IAM roles, and Amazon Q Business web experience.

Deploy the Amazon Q Business application

As stated in the preceding prerequisites section, IAM Identity Center must be configured in the same Region (us-east-1 or us-west-2) as your Amazon Q Business application.

To deploy and use the Amazon Q Business application, follow the steps described in the Amazon Q Business application creation module. The steps can be summarized as:

  1. Launch an AWS CloudShell in either the us-east-1 or us-west-2 Region in your data collection central account and clone the repository from GitHub.
  2. Navigate to the repository directory and run the deployment script, providing the required inputs when prompted. As stated in the prerequisites, an S3 bucket name is required in the data collection central account.
  3. After deployment, synchronize the data source, assign access to users and groups, and use the deployed web experience URL to interact with the Amazon Q Business application.

[Optional] Integrate your ITSM system

To integrate with your ITSM system, follow these steps:

  1. Within the Amazon Q Business application page, choose Plugins in the navigation pane and choose Add plugin.
  2. From the list of available plugins, select the one that matches your system. For example, Jira, ServiceNow, or Zendesk.
  3. Enter the details on the next screen (see Figure 2) for Amazon Q Business application to make the connection. This integration will result in directly logging tickets from Amazon Q Business to your IT teams based on data within the Amazon Q Business application.
The Amazon Q Business plug-in creation page

Figure 2 The Amazon Q Business plug-in creation page

Support Collector

You can use the Support Collector module to set up and configure AWS EventBridge to collect support-related data. This data includes information from AWS Support cases, AWS Trusted Advisor, and AWS Health. The collected data is then uploaded to a designated S3 bucket in the data collection account. The solution will retrieve up to 6 months of data by default, though you can change the timeframe to a maximum of 12 months.

Additionally, the Support Collector can synchronize with the latest updates on a daily basis, ensuring that your support data is always up to date. The Support Collector is configured through an AWS Lambda function and EventBridge, offering flexibility in terms of the data sources (AWS Support cases, AWS Trusted Advisor, and AWS Health) you want to include or exclude. You can choose data from one, two, or all three of these sources by configuring the appropriate scheduler.

Deploy the Support Collector

To deploy and use the Support Collector, follow the steps described in the Support Collector module.

The repository contains scripts and resources to automate the deployment of Lambda functions in designated member accounts. The deployed Lambda functions collect and upload AWS Support data (Support Cases, Health Events, and Trusted Advisor Checks) to an S3 bucket in the data collection central account. The collected data can be analyzed using Amazon Q Business.

There are two deployment options:

  1. AWS Organizations (StackSet): Use this option if you have AWS Organizations set up and want to deploy in accounts under organizational units. It creates a CloudFormation StackSet in the central account to deploy resources (IAM roles, Lambda functions, and EventBridge) across member accounts.
  2. Manual deployment of individual accounts (CloudFormation): Use this option if you don’t want to use AWS Organizations and want to target a few accounts. It creates a CloudFormation stack in a member account to deploy resources (IAM roles, Lambda functions, and EventBridge).

After deployment, an EventBridge scheduler periodically invokes the Lambda function to collect support data and store it in the data collection S3 bucket. Testing the Lambda function is possible with a custom payload. The deployment steps are fully automated using a shell script. The Q Support Insights (QSI) – AWS Support Collection Deployment guide, located in the src/support_collector subdirectory, outlines the steps to deploy the resources.

Amazon Q Business web experience

You can ask support-related questions using the Amazon Q Business web experience after you have the relevant support data collected in the S3 bucket and successfully indexed. For steps to configure and collect the data, see the preceding Support Collector section. Using the web experience, you can then ask questions as shown in the following demonstration.

Using Amazon Q Business web experience to get troubleshooting recommendations

Using Amazon Q Business web experience to get performance recommendations

Using Amazon Q Business web experience to get operational recommendations

Using Amazon Q Business web experience to get performance recommendations

Figure 3 Using Amazon Q Business web experience to get performance recommendations

Sample prompts

Try some of the following sample prompts:

  • I am having trouble with EKS add-on installation failures. It is giving ConfigurationConflict errors. Based on past support cases, please provide a resolution.
  • List AWS Account IDs with insufficient IPs
  • List health events with increased error rates
  • List services being deprecated this year
  • My Lambda function is running slow. How can I speed it up?

Clean up

After you’re done testing the solution, you can delete the resources to avoid incurring additional charges. See the Amazon Q Business pricing page for more information. Follow the instructions in the GitHub repository to delete the resources and corresponding CloudFormation templates.

Conclusion

In this post, you deployed a solution that indexes data from your AWS Support datasets stored in Amazon S3 and other AWS data sources like AWS Trusted Advisor and AWS Health. This demonstrates how to use new generative AI services like Amazon Q Business to find patterns across your most frequent issues, author new content such as internal documentation or an FAQ. Using support data presents a valuable opportunity to proactively address and prevent recurring issues in your AWS environment by using insights gained from past experiences. Embracing these insights enables a more resilient and optimized AWS experience tailored to your specific needs.

This solution can be expanded to use other internal data sources your company might use and use natural language to understand optimization opportunities that your teams can implement.


About the authors

ChitreshChitresh Saxena is a Sr. Technical Account Manager specializing in generative AI solutions and dedicated to helping customers successfully adopt AI/ML on AWS. He excels at understanding customer needs and provides technical guidance to build, launch, and scale AI solutions that solve complex business problems.

JonathanJonathan Delfour is a Principal Technical Account Manager supporting Energy customers, providing top-notch support as part of the AWS Enterprise Support team. His technical guidance and unwavering commitment to excellence ensure that customers can leverage the full potential of AWS, optimizing their operations and driving success.

KrishnaKrishna Atluru is an Enterprise Support Lead at AWS. He provides customers with in-depth guidance on improving security posture and operational excellence for their workloads. Outside of work, Krishna enjoys cooking, swimming and travelling.

ArishArish Labroo is a Principal Specialist Technical Account Manager – Builder supporting large AWS customers. He is focused on building strategic tools that help customers get the most value out of Enterprise Support.

ManikManik Chopra is a Principal Technical Account Manager at AWS. He helps customers adopt AWS services and provides guidance in various areas around Data Analytics and Optimization. His areas of expertise include delivering solutions using Amazon QuickSight, Amazon Athena, and various other automation techniques. Outside of work, he enjoys spending time outdoors and traveling.

Read More

Research Focus: Week of July 15, 2024

Research Focus: Week of July 15, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: July 15, 2024

MG-TSD: Advancing time series analysis with multi-granularity guided diffusion model

Diffusion probabilistic models have the capacity to generate high-fidelity samples for generative time series forecasting. However, they also present issues of instability due to their stochastic nature. In a recent article: MG-TSD: Advancing time series analysis with multi-granularity guided diffusion model, researchers from Microsoft present MG-TSD, a novel approach aimed at tackling this challenge.

The MG-TSD model employs multiple granularity levels within data to guide the learning process of diffusion models, yielding remarkable outcomes without the necessity of additional data. In the field of long-term forecasting, the researchers have established a new state-of-the-art methodology that demonstrates a notable relative improvement across six benchmarks, ranging from 4.7% to 35.8%.

The paper introducing this research: MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process(opens in new tab) (opens in new tab), was presented at ICLR 2024 (opens in new tab).


Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Machine learning applications based on large language models (LLMs) have been widely deployed in consumer products. Increasing the model size and its training dataset have played an important role in this process. Since larger model size can bring higher model accuracy, it is likely that future models will also grow in size, which vastly increases the computational and memory requirements of LLMs.

Mixture-of-Experts (MoE) architecture, which can increase model size without proportionally increasing computational requirements, was designed to address this challenge. Unfortunately, MoE’s high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE’s memory-hungry expert parameters to central processing unit (CPU) memory fall short.

In a recent paper: Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference, researchers from Microsoft address these challenges using algorithm-system co-design. Pre-gated MoE alleviates the dynamic nature of sparse expert activation, addressing the large memory footprint of MoEs while also sustaining high performance. The researchers demonstrate that pre-gated MoE improves performance, reduces graphics processing unit (GPU) memory consumption, and maintains model quality.

Spotlight: Event

Inclusive Digital Maker Futures for Children via Physical Computing

This workshop will bring together researchers and educators to imagine a future of low-cost, widely available digital making for children, both within the STEAM classroom and beyond.


What Matters in a Measure? A Perspective from Large-Scale Search Evaluation

Evaluation is a crucial aspect of information retrieval (IR) and has been thoroughly studied by academic and professional researchers for decades. Much of the research literature discusses techniques to produce a single number, reflecting the system’s performance: precision or cumulative gain, for example, or dozens of alternatives. Those techniques—metrics—are themselves evaluated, commonly by reference to sensitivity and validity.

To measure search in industry settings, many other aspects must be considered. For example, how much a metric costs; how robust it is to the happenstance of sampling; whether it is debuggable; and what is incentivized when a metric is taken as a goal. In a recent paper: What Matters in a Measure? A Perspective from Large-Scale Search Evaluation, researchers from Microsoft discuss what makes a search metric successful in large-scale settings, including factors which are not often canvassed in IR research, but which are important in “real-world” use. The researchers illustrate this discussion with examples from industrial settings and elsewhere and offer suggestions for metrics as part of a working system.


LordNet: An efficient neural network for learning to solve parametric partial differential equations without simulated data

Partial differential equations (PDEs) are ubiquitous in mathematically-oriented scientific fields, such as physics and engineering. The ability to solve PDEs accurately and efficiently can empower deep understanding of the physical world. However, in many complex PDE systems, traditional solvers are too time-consuming. Recently, deep learning-based methods including neural operators have been successfully used to provide faster PDE solvers through approximating or enhancing conventional ones. However, this requires a large amount of simulated data, which can be costly to collect. This can be avoided by learning physics from the physics-constrained loss, also known as mean squared residual (MSR) loss constructed by the discretized PDE.

In a recent paper: LordNet: An efficient neural network for learning to solve parametric partial differential equations without simulated data, researchers from Microsoft investigate the physical information in the MSR loss, or long-range entanglements. They identify the challenge: the neural network must model the long-range entanglements in the spatial domain of the PDE, whose patterns vary. To tackle the challenge, they propose LordNet, a tunable and efficient neural network for modeling various entanglements. Their tests show that Lordnet can be 40× faster than traditional PDE solvers. In addition, LordNet outperforms other modern neural network architectures in accuracy and efficiency with the smallest parameter size.


FXAM: A unified and fast interpretable model for predictive analytics

Generalized additive model (GAM) is a standard for interpretability. However, due to the one-to-many and many-to-one phenomena which appear commonly in real-world scenarios, existing GAMs have limitations to serving predictive analytics in terms of both accuracy and training efficiency. In a recent paper: FXAM: A unified and fast interpretable model for predictive analytics, researchers from Microsoft propose FXAM (Fast and eXplainable Additive Model), a unified and fast interpretable model for predictive analytics. FXAM extends GAM’s modeling capability with a unified additive model for numerical, categorical, and temporal features. FXAM conducts a novel training procedure called three-stage iteration (TSI). TSI corresponds to learning over numerical, categorical, and temporal features respectively. Each stage learns a local optimum by fixing the parameters of other stages. The researchers design joint learning over categorical features and partial learning over temporal features to achieve high accuracy and training efficiency. They show that TSI is mathematically guaranteed to converge to the global optimum. They further propose a set of optimization techniques to speed up FXAM’s training algorithm to meet the needs of interactive analysis.

Microsoft Research in the news


Sriram Rajamani at Microsoft Research on AI and deep tech in India 

Fobes India | June 28, 2024

Sriram K Rajamani, managing director of Microsoft Research India Lab, reflects on computer science and engineering research, including how AI and LLMs can help solve local needs. Rajamani also discusses the technical aspects of how modern AI models work, and best practices from the research lab that could apply to India’s deep tech ecosystem.

The post Research Focus: Week of July 15, 2024 appeared first on Microsoft Research.

Read More

Decoding How AI-Powered Upscaling on NVIDIA RTX Improves Video Quality

Decoding How AI-Powered Upscaling on NVIDIA RTX Improves Video Quality

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for RTX PC and workstation users.

Video is everywhere — nearly 80% of internet bandwidth today is used to stream video from content providers and social networks. While screens have become bigger and support higher resolutions, nearly all video is only 1080p quality or lower.

Upscalers can help sharpen streamed video and, powered by AI on the NVIDIA RTX platform, significantly enhance image quality and detail.

What Is an Upscaler?

The larger file size of videos makes it harder to compress and transmit compared to images or text. Platforms like Netflix, Vimeo and YouTube work around this limitation by encoding video — the process of compressing the raw source of a video into a smaller container format.

The encoder first analyzes the video to decide what information it can remove to make it fit a target resolution and frame rate. If the target bitrate is insufficient, the video quality decreases, resulting in a loss of detail and sharpness and the presence of encoding artifacts. The smaller the file, the easier it is to share on the internet — but the worse it looks.

Typically, software on the viewer’s device will upscale the video file to fit the display’s native resolution. However, these upscalers are fairly simplistic, merely multiplying pixels to meet the desired resolution. They can help sharpen the outlines of objects and scenes, but the final video typically carries encoding artifacts and sometimes looks over-sharpened and unnatural.

AI Know a Better Way

The NVIDIA RTX platform uses AI to easily de-artifact and upscale videos.

Easily de-artifact and upscale videos with RTX.

The process of AI upscaling involves analyzing images and motion vectors to generate new details not present in the original video. Instead of merely multiplying pixels, it recognizes the patterns of the image and enhances them to provide greater detail and video quality.

Images must be first de-artifacted before any processing begins. Artifacts — or unwanted distortions and anomalies that appear in video and image files — occur due to overcompression or data loss during transmission and storage.

NVIDIA AI networks can de-artifact images, helping remove blocky areas sometimes seen in streamed video. Without this first step, AI upscalers might end up enhancing the artifacted image itself instead of the desired content.

Super-Sized Video

Just like putting on a pair of prescription glasses can instantly snap the world into focus, RTX Video Super Resolution, one of NVIDIA’s latest innovations in AI-enhanced video technology, gives users a clearer picture into the world of streamed video.

Click the image to see the differences between bicubic upscaling (left) and RTX Video Super Resolution (right).

Available on GeForce RTX 40 and 30 Series GPUs and RTX professional GPUs, it uses AI running on dedicated Tensor Cores to remove block compression artifacts and upscale lower-resolution content up to 4K, matching the user’s native display resolution.

RTX Video Super Resolution can be used to enhance all video watched on browsers. By combining de-artifacting with AI upscaling techniques, it can make even low-bitrate Twitch streams look stunningly clear. RTX Video Super Resolution is also supported in popular video apps like VLC so users can apply the same upscaling process to their offline videos.

Creators can soon use RTX Video Super Resolution in editing apps like Black Magic’s Davinci Resolve, making it easier than ever to upscale lower-quality video files to 4K resolution, as well as convert standard-dynamic range source files into high-dynamic range (HDR).

Say Hi to High-Dynamic Range

RTX Video now also supports AI HDR. HDR video supports a wider range of colors, lending greater detail especially to the darker and lighter areas of images. The problem is that there isn’t that much HDR content online yet.

Enter RTX Video HDR — by simply turning on the feature, the AI network will turn any standard or low-dynamic-range content into HDR, performing the correct tone mapping so the image still looks natural and retains its original colors.

AI Across the Board

RTX Video is just the latest implementation of AI upscaling powered by NVIDIA RTX.

Members of the GeForce NOW cloud streaming service can play their favorite PC games on nearly any device. GeForce RTX servers located all over the world first render the game video content, encode it and then stream it to the player’s local device — just like streaming video from other content providers.

Members on older NVIDIA GPU-powered devices can still use AI-enhanced upscaling to improve gameplay quality. This means they can enjoy the best of both worlds — gameplay rendered on servers powered by RTX 4080-class GPUs in the cloud and AI-enhanced streaming quality. Get more information on enabling AI-enhanced upscaling on GeForce NOW.

The NVIDIA SHIELD TV takes this one step further, processing AI neural networks directly on its NVIDIA Tegra system-on-a-chip to upscale 1080p-quality or lower content from nearly any streaming platform to a display’s native resolution. That means users can improve the video quality of content streamed from Netflix, Prime Video, Max, Disney+ and more at the push of a remote button.

SHIELD TV is currently available for up to $30 off in North America and £30 or 35€ off in Europe as part of Amazon’s Prime Day event running July 16-17. For Prime members in Europe, eligible SHIELD TV purchases also include one month of the GeForce NOW Ultimate membership for free, enabling GeForce RTX 4080-class PC gameplay streamed directly to the living room.

AI has enabled unprecedented improvements in video quality, helping set a new standard in streaming experiences.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More