Events & Conferences

Journey to 1000 models: Scaling Instagram’s recommendation system

Published

2 months ago

May 21, 2025

In this post, we explore how Instagram has successfully scaled its algorithm to include over 1000 ML models without sacrificing recommendation quality or reliability.
We delve into the intricacies of managing such a vast array of models, each with its own performance characteristics and product goals.
We share insights and lessons learned along the way—from the initial realization that our infrastructure maturity was lagging behind our ambitious scaling goals, to the innovative solutions we implemented to bridge these gaps.

In the ever-evolving landscape of social media, Instagram serves as a hub for creative expression and connection, continually adapting to meet the dynamic needs of its global community. At the heart of this adaptability lies a web of machine learning (ML) models, each playing a crucial role in personalizing experiences. As Instagram’s reach and influence has grown, so too has the complexity of its algorithmic infrastructure. This growth, while exciting, presents a unique set of challenges, particularly in terms of reliability and scalability.

Join us as we uncover the strategies and tools that have enabled Instagram to maintain its position at the forefront of social media innovation, ensuring a seamless and engaging experience for billions of users worldwide.

Are there really that many ML models in Instagram?

Though what shows up in Feed, Stories, and Reels is personally ranked, the number of ranked surfaces goes much deeper—to which comments surface in Feed, which notifications are “important,” or whom you might tag in a post. These are all driven by ML recommendations.

Within a given surface, we’ll have different layers of the ranking funnel: sourcing (retrieval), early-stage ranking (ESR), and late-stage ranking (LSR). We operate on fewer candidates as we progress through the funnel, as the underlying operations grow more expensive (see Figure 1 below):

Within each surface and layer, there is constant experimentation, and these permutations create a severe infrastructure challenge. We need to allow room for our ML engineers to experiment with changes such as adjusting weights for a given prediction. The net result, depicted below in Figure 2, is a large number of models serving user traffic in production:

Figure 2: An expression of the factors behind the fleet’s numerical growth.

How did we realize infra maturity wasn’t going to catch up?

Identified risks

We identified several risks associated with scaling our algorithm, rooted in complaints about ML productivity and repeating patterns of issues:

Discovery: Even as a team focused on one app — Instagram — we couldn’t stay on top of the growth, and product ML teams were maintaining separate sources of truth, if any, for their models in production.
Release: We didn’t have a consistent way to launch new models safely, and the process was slow, impacting ML velocity and, therefore, product innovation.
Health: We lacked a consistent definition of model prediction quality, and with the diversity of surfaces and subtlety of degraded ranking, quality issues went unnoticed.

Solution overview

To address these risks, we implemented several solutions:

Model registry: We built a registry that serves as a ledger for production model importance and business function foremost, among other metadata. This registry serves as our foundational source of truth, upon which we can leverage automation to uplevel system-wide observability, change management, and model health.
Model launch tooling: We developed a more ideal flow for launching new models that includes estimation, approval, prep, scale-up, and finalization. This process is now automated, and we’ve reduced the time it takes to launch a new model from days to hours.
Model stability: We defined and operationalized model stability, a pioneering metric that measures the accuracy of our model predictions. We’ve leveraged model stability to produce SLOs for all models in the model registry, which enables simple understanding of the entire product surface’s ML health.

Model registry

What did model investigations look like prior to the registry?

Before we created the model registry, the investigation process was a time-consuming and error-prone experience for on-call engineers and model owners. An on-call engineer had to ask multiple questions to model owners to gather information, as depicted Figure 3 below, about the context of what this model does in the stack and to clarify how important it is to the business.

Figure 3: A fictional but typical non-productive investigation.

Understanding this context is extremely important to the operational response: Depending on the importance of the model and the criticality of the surface it’s supporting, the response is going to differ in kind. When a model is an experiment serving a small percentage of the traffic, an appropriate response can be to end the experiment and reroute the traffic back to the main model (the baseline). But if there’s a problem with the baseline model that needs to be handled with urgency, it’s not possible to “just turn it off.” The engineer on call has to loop in the model owner, defeating the purpose of having a dedicated on-call.

To avoid holding up an operational response on a single POC, we needed a central source of truth for model importance and business function. What if the model is not available? What if 10 of these issues happen concurrently?

With the development of the model registry, we standardized the collection of model importance and business function information, ensuring most of our operational resources were going towards the most important models.

What problems did the model registry solve?

The model registry is a system of record built on top of Configerator, Meta’s distributed configuration suite . This schematized ledger (see an example in Figure 4 and detailed further below) provides read-and-write access to operational data based on the inventory of production models. It’s a flexible and extensible foundation upon which one can build automation and tools to solve problems that are specific to individual organizations within Meta that are not served by the general tooling.

Figure 4: An abridged example of what a model registry entry looks like.

As Instagram scaled its investment in AI through rapid innovation in content recommendations, the number of models and AI assets grew; as a result, it has been increasingly important — but also increasingly difficult — to maintain a minimum standard for all of our models, as we lacked an authoritative source for the business context as well as for a model’s importance.

In creating the model registry, we set out to provide a structured interface for collecting business context via model types, importance via criticality, and additional metadata that would enable model understanding. Below, we’ll get into the model types, criticality, and automation we’ve built for this purpose.

Model types

At a high level, model type describes the purpose for the ML workload where it represents a category or class of models that share a common purpose or are used in similar contexts. For example, we have “ig_stories_tray_mtml” which is a string attached to training flows, model checkpoints, inference services, and more. Put simply, a model type identifies for the reader this model’s purpose in the ranking funnel.

Let’s break it down:

“ig_stories_tray_mtml” → “ig” “stories” “tray” “mtml”

“ig”: This model is an “ig” model as opposed to “fb” or “whatsapp”.
“stories”: This model serves IG Stories.
“tray”: This model serves in the main IG Stories tray (as opposed to stories in some other surface).
“mtml”: This model is a multi-task-multi-label model, commonly used in late-stage ranking.

We can then use these model type strings to tag AI assets, and since they serve as proxies for business context, we can use them also for asset management, policy enforcement, analytics, and more.

The metadata entries in the model registry are anchored on two main types that describe model instances (ModelMetadata) as well as model types (ModelTypeMetadata). These types are made up of “core” attributes that are universally applicable, as well as “extended” attributes that allow different teams to encode their opinions about how these entries will inform operations. For example, in Instagram our extended attributes encode “baseline” and “holdout” model IDs, which are used in our ranking infrastructure to orchestrate ranking funnel execution.

Criticality

In addition to defining business function, we had to establish clear guidelines for model importance. Within Meta, SEVs and services have a unified-importance tier system where the Global Service Index (GSI) records a criticality from TIER0 to TIER4 based on the maximum incident severity level the service can cause, from SEV0 as the most critical to SEV4 as simply a “heads up.” Since GSI criticality had social proof at the company, and infra engineers were familiar with this system, we adopted these criticalities for models and now annotate them at the model type and model level.

No longer would each team decide to raise their own model services to TIER1 for themselves, increasing the burden on all teams that support these models. Teams needed to provide an immediate response (available 24/7) on call and be able to prove that their models contributed meaningfully to critical business metrics to qualify for elevated monitoring.

Configuration structure as a foundation for automation

Once we had onboarded a critical mass of Instagram models to the model registry, we could begin to fully integrate with our monitoring and observability suite using our Meta-wide configuration solution, Configerator. With this, we could now have model performance monitoring and alerts that are fully automated and integrated with our tooling for SLIs called SLICK, dashboards that allow us to monitor models across many time series dimensions, and a suite of alerting specific to the model that is driven from the entries in the model registry.

This provided all our teams confidence that our monitoring coverage was complete and automated.

Launching

While a point-in-time snapshot of models in production is great for static systems, Instagram’s ML landscape is constantly shifting. With the rapid increase of iteration on the recommendation system driving an increased number of launches, it became clear our infrastructure support to make this happen was not adequate. Time-to-launch was a bottleneck in ML velocity, and we needed to drive it down.

What did the process look like?

Conventionally, services were longstanding systems that had engineers supporting them to tune. Even when new changes would introduce new capacity regression risks, we could gate this behind change safety mechanisms.

However, our modeling and experimentation structure was unique in that we were planning for more rapid iteration, and our options were insufficient. To safely test the extent of load a new service could support, we would clone the entire service, send shadow traffic (i.e., cloned traffic that isn’t processed by our clients), and run multiple overload tests until we found a consistent peak throughput. But this wasn’t a perfect science. Sometimes we didn’t send enough traffic, and sometimes we’d send too much, and the amount could change throughout the day due to variations in global user behavior.

This could easily take two days to get right, including actually debugging the performance itself when the results weren’t expected. Once we got the result, we’d then have to estimate the final cost. Below (in Figure 5) is the formula we landed on.

Figure 5: A formula calculating capacity estimations for a new launch.

The actual traffic shifting portion was tedious as well. For example, when we managed to fully estimate that we needed 500 replicas to host the new service, we might not actually have 500 spares lying around to do a full replacement, so launching was a delicate process of partially sizing up by approximately 20%, sending 20% of traffic over, and then scaling down the old service by 20% to reclaim and recycle the capacity. Rinse, repeat. Inefficient!

And by the time we got to the end of this arduous process, the ordeal still wasn’t over. Each team was responsible for correctly setting up new alerts for their baseline in a timely fashion, or else their old models could and did trigger false alarms.

How does forcing virtual pools aid product growth?

One of the prerequisites for fixing competition for resources and unblocking productivity was to put up guardrails. Prior to this, it was “first come first served,” with no clear way to even “reserve” future freed capacity. It was also hard to reason about fairness from an infra perspective: Would it make sense to give each team equal pools, or give each individual person a maximum limit?

As it turned out, not all MLEs are experimenting at the same time, due to staggered progress on their work, so individual (per-engineer) limits were not ideal. One member might be in the experimentation stage and another might be training. So our solution was to provide bandwidth to each team.

Once each team — and therefore product — had quotas distributed, their launch policy became more clear cut. Some teams established free launching as long as the team was within quota. Others required no regressions in capacity usage. But mostly this unlocked our ability to run launches in parallel, since each one required much less red tape, and prioritization was no longer done at the org level.

What other tooling improved launching?

As mentioned earlier, preplanning with capacity estimations was critical to understanding cost and ensuring reliability. We were often asked, Why not let autoscaling take care of everything? The problem was that each service could be configured slightly differently than a previously optimized service, or some architectural change could have affected the performance of the model. We didn’t have an infinite amount of supply to work with, so by the time we fully traffic-shifted everything over, we might find that we didn’t have enough supply. Reverting is costly, taking hours to get through each stage.

By doing capacity estimations in advance, this also allowed us and each team to accurately evaluate metric improvement versus cost. It might be worthwhile to double our costs if something would increase time spent on the app by 1%, but likely not for a 0.05% improvement where we could better spend that capacity funding another initiative.

With partners in AI Infra, we developed two major solutions to this process: offline performance evaluation and an automated launching platform.

We simplified determining performance of a new service using recorded traffic. Pre-recorded traffic was continuously collected into a data warehouse that the benchmarker could read from, and we’d spin up temporary jobs with this automation. One job would replay different levels of traffic continuously and send it to another job that was a clone of the existing experiment. By putting stoppers on desired latency and error rates, the tooling would eventually output a converged stable number that we could understand as the max load (see Figure 6).

Figure 6: Load tests converging on an accurate measure of load.

The launch platform itself would input the numbers we captured from these tests, automatically collect demand data as defined, and run that same formula to calculate a cost. The platform would then perform the upscaling/downscaling cycle for teams as we shifted traffic.

And finally, by leveraging the model registry, we were able to land this model change in code (see example in Figure 6), to help us better maintain and understand the 1000+ models within our fleet. Likewise, this bolstered our trust in the model registry, which was now directly tied to the model launch lifecycle.

Figure 7: A theoretical model registry change during launch.

This suite of launch automation has dramatically reduced the class of SEVs related to model launches, improved our pace of innovation from a few to more than 10 launches per week, and reduced the amount of time engineers spend conducting a launch by more than two days.

Model stability

As the number of models in production increased, our organization started to feel the effects of an inconsistent measure of model health. While ranking models are run like any other distributed backend system (receive a request, produce a response), one may think a universal SLO that measures request success rate can suffice to capture holistic health. This is not the case for ranking models, as the accuracy of recommendations received carries significant importance to the end-user experience. If we consider a user who is a huge fan of golf but does not enjoy cooking content (see the “available & irrelevant” case in Figure 8 below), we see an example of this inaccuracy in practice. This is precisely what the model stability metric sought to capture.

Figure 8: Different types of responses that can be provided to an end user.

Why is measuring ranking model reliability unique?

Ranking models, unlike traditional idempotent request/response backends, produce scores predicting user action given a set of candidates (PLIKE, PCOMMENT, PFOLLOW, etc.). These scores then combine and are used to determine which candidates are most relevant to an end user. It’s important that these scores accurately reflect user interest, as their accuracy is directly correlated to user engagement. If we recommend irrelevant content, user engagement suffers. The model stability metric was designed to make it easy to measure this accuracy and detect inaccuracy at our scale.

Let’s discuss how this works.

Defining model stability

Models are complex, and they produce multiple output predictions. Let’s take a simplified example (shown in Figure 9 below) of a multi-task-multi-label (MTML) model predicting three actions:

Figure 9: A simplified MTML model predicting three actions.

For us to claim this model is stable, we must also claim that each underlying prediction is stable.

When evaluating the accuracy of a ranking model’s predictions, we typically look at two metrics:

Model calibration, which is based on observed real-world outcomes and answers the question, “Are we over- or under-predicting user action?” It is calculated as a ratio of predicted click-through-rate (CTR) and empirical CTR. A perfect predictor will have calibration centered at 1.
Model normalized entropy (NE), which measures the discriminative power of a predictor, and answers the question, “How well can this predictor separate action from inaction?” It is calculated as a ratio of the average log-loss per impression to what the average log-loss per impression would be if we always predicted the empirical CTR. With NE, lower values are better, and an NE of 1 is equivalent to random predictions.

(For more information regarding our choice of prediction evaluation metrics, please refer to the paper, “Practical Lessons from Predicting Clicks on Ads at Facebook.”)

A model’s predictions are unstable when either calibration or NE are out of their expected healthy ranges. To determine what a healthy range is, we must look at each metric in real time, and Figure 10 below shows what these time series can look like:

Figure 10: Example predictions of calibration and NE over a period of time.

By observing the trend of a healthy prediction, we can apply thresholds for our evaluation metrics. When these thresholds are breached, the underlying prediction is considered unstable.

From here, we can define model stability as a binary indicator across a model’s predictions. It is 1 if all underlying predictions are stable, and 0 if any prediction is unstable. This is an extremely powerful method of reacting to real-time prediction instability as well as a tool for understanding trends in predictive health per model or across distinct products ranking funnels.

Operationalizing model stability

With a real-time view on model predictive health, we can leverage this unified definition of model stability and apply it to all of our models in production, once again leveraging the model registry as a ledger to hold this important data. In Figure 11 below, we can see the addition of model stability metric metadata after we determined the expected thresholds.

Figure 11: Model stability definitions stored in the model registry.

Given the large number of models in production, each producing many predictions, building a portable definition of model health applicable to all of our ranking models represented an important milestone toward upleveling Instagram’s ML infrastructure maturity. This has unlocked our ability to build generic alerting to guarantee detection of our most important models becoming unstable, thereby moving us closer to mitigation when our recommendation system is at risk.

Since the addition of these metrics and alerting, ML teams have discovered previously hidden issues within their models and addressed them faster than before, leading to higher-quality recommendations.

Key takeaways

In our journey to scale Instagram’s algorithm to manage over 1000 models, we have learned several critical lessons that have shaped our approach and infrastructure. These takeaways not only highlight the challenges we faced but also underscore the strategies that led to our success.

Infra understanding is the foundation to building the right tools

A unified understanding of our infrastructure footprint was essential in developing the right tools to support our scaling efforts. By identifying the gaps and potential risks in our existing systems, we were able to implement solutions such as the model registry that significantly improved our operational efficiency and reliability posture.

Helping colleagues move fast means we all move faster

By addressing the model iteration bottleneck, we enabled our teams to innovate more rapidly. Our focus on creating a seamless, self-service process for model iteration empowered client teams to take ownership of their workflows. This not only accelerated their progress but also reduced the operational burden on our infrastructure team. As a result, the entire organization benefited from increased agility and productivity.

Reliability must consider quality

Ensuring the reliability of our models required us to redefine how we measure and maintain model quality. By operationalizing model stability and establishing clear metrics for model health, we were able to proactively manage the performance of our models. This approach enables us to maintain high standards of quality across our recommendation systems, ultimately enhancing user engagement and satisfaction.

Our experience in scaling Instagram’s recommendation system has reinforced the importance of infrastructure understanding, collaboration, and a focus on quality. By building robust tools and processes, we have not only improved our own operations but also empowered our colleagues to drive innovation and growth across the platform.

Source link

Events & Conferences

An inside look at Meta’s transition from C to Rust on mobile

Published

5 days ago

July 1, 2025

Pascal Hartig

Have you ever worked is legacy code? Are you curious what it takes to modernize systems at a massive scale?

Pascal Hartig is joined on the latest Meta Tech Podcast by Elaine and Buping, two software engineers working on a bold project to rewrite the decades-old C code in one of Meta’s core messaging libraries in Rust. It’s an ambitious effort that will transform a central messaging library that is shared across Messenger, Facebook, Instagram, and Meta’s AR/VR platforms.

They discuss taking on a project of this scope – even without a background in Rust, how they’re approaching it, and what it means to optimize for ‘developer happiness.’

Download or listen to the episode below:

You can also find the episode wherever you get your podcasts, including:

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Send us feedback on Instagram, Threads, or X.

And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.

Source link

Events & Conferences

Amazon Research Awards recipients announced

Published

1 month ago

June 3, 2025

Amazon Research Awards team

Amazon Research Awards (ARA) provides unrestricted funds and AWS Promotional Credits to academic researchers investigating various research topics in multiple disciplines. This cycle, ARA received many excellent research proposals from across the world and today is publicly announcing 73 award recipients who represent 46 universities in 10 countries.

This announcement includes awards funded under five call for proposals during the fall 2024 cycle: AI for Information Security, Automated Reasoning, AWS AI, AWS Cryptography, and Sustainability. Proposals were reviewed for the quality of their scientific content and their potential to impact both the research community and society. Additionally, Amazon encourages the publication of research results, presentations of research at Amazon offices worldwide, and the release of related code under open-source licenses.

Recipients have access to more than 700 Amazon public datasets and can utilize AWS AI/ML services and tools through their AWS Promotional Credits. Recipients also are assigned an Amazon research contact who offers consultation and advice, along with opportunities to participate in Amazon events and training sessions.

AI for Information Security

Recipient University Research title

Christopher Amato Northeastern University Multi-Agent Reinforcement Learning Cyber Defense for Securing Cloud Computing Platforms

Bernd Bischl Ludwig Maximilian University of Munich Improving Generative and Foundation Models Reliability via Uncertainty-awareness

Shiqing Ma University Of Massachusetts Amherst LLM and Domain Adaptation for Attack Detection

Alina Oprea Northeastern University Multi-Agent Reinforcement Learning Cyber Defense for Securing Cloud Computing Platforms

Roberto Perdisci University of Georgia ContextADBench: A Comprehensive Benchmark Suite for Contextual Anomaly Detection

Automated Reasoning

Recipient University Research title

Nada Amin Harvard University LLM-Augmented Semi-Automated Proofs for Interactive Verification

Suguman Bansal Georgia Institute of Technology Certified Inductive Generalization in Reinforcement Learning

Ioana Boureanu University of Surrey Phoebe+: An Automated-Reasoning Tool for Provable Privacy in Cryptographic Systems

Omar Haider Chowdhury Stony Brook University Restricter: An Automatic Tool for Authoring Amazon Cedar Access Control Policies with the Principle of Least Privilege

Stefan Ciobaca Alexandru Ioan Cuza University An Interactive Proof Mode for Dafny

João Ferreira INESC-ID Polyglot Automated Program Repair for Infrastructure as Code

Sicun Gao University Of California, San Diego Monte Carlo Trees with Conflict Models for Proof Search

Mirco Giacobbe University of Birmingham Neural Software Verification

Tobias Grosser University of Cambridge Synthesis-based Symbolic BitVector Simplification for Lean

Ronghui Gu Columbia University Scaling Formal Verification of Security Properties for Unmodified System Software

Alexey Ignatiev Monash University Huub: Next-Gen Lazy Clause Generation

Kenneth McMillan University of Texas At Austin Synthesis of Auxiliary Variables and Invariants for Distributed Protocol Verification

Alexandra Mendes University of Porto Overcoming Barriers to the Adoption of Verification-Aware Languages

Jason Nieh Columbia University Scaling Formal Verification of Security Properties for Unmodified System Software

Rohan Padhye Carnegie Mellon University Automated Synthesis and Evaluation of Property-Based Tests

Nadia Polikarpova University Of California, San Diego Discovering and Proving Critical System Properties with LLMs

Fortunat Rajaona University of Surrey Phoebe+: An Automated-Reasoning Tool for Provable Privacy in Cryptographic Systems

Subhajit Roy Indian Institute of Technology Kanpur Theorem Proving Modulo LLM

Gagandeep Singh University of Illinois At Urbana–Champaign Trustworthy LLM Systems using Formal Contracts

Scott Stoller Stony Brook University Restricter: An Automatic Tool for Authoring Amazon Cedar Access Control Policies with the Principle of Least Privilege

Peter Stuckey Monash University Huub: Next-Gen Lazy Clause Generation

Yulei Sui University of New South Wales Path-Sensitive Typestate Analysis through Sparse Abstract Execution

Nikos Vasilakis Brown University Semantics-Driven Static Analysis for the Unix/Linux Shell

Ping Wang Stevens Institute of Technology Leveraging Large Language Models for Reasoning Augmented Searching on Domain-specific NoSQL Database

John Wawrzynek University of California, Berkeley GPU-Accelerated High-Throughput SAT Sampling

AWS AI

Recipient University Research title

Panagiotis Adamopoulos Emory University Generative AI solutions for The Spillover Effect of Fraudulent Reviews on Product Recommendations

Vikram Adve University of Illinois at Urbana–Champaign Fellini: Differentiable ML Compiler for Full-Graph Optimization for LLM Models

Frances Arnold California Institute of Technology Closed-loop Generative Machine Learning for De Novo Enzyme Discovery and Optimization

Yonatan Bisk Carnegie Mellon University Useful, Safe, and Robust Multiturn Interactions with LLMs

Shiyu Chang University of California, Santa Barbara Cut the Crap: Advancing the Efficient Communication of Multi-Agent Systems via Spatial-Temporal Topology Design and KV Cache Sharing

Yuxin Chen University of Pennsylvania Provable Acceleration of Diffusion Models for Modern Generative AI

Tianlong Chen University of North Carolina at Chapel Hill Cut the Crap: Advancing the Efficient Communication of Multi-Agent Systems via Spatial-Temporal Topology Design and KV Cache Sharing

Mingyu Ding University of North Carolina at Chapel Hill Aligning Long Videos and Language as Long-Horizon World Models

Nikhil Garg Cornell University Market Design for Responsible Multi-agent LLMs

Jessica Hullman Northwestern University Human-Aligned Uncertainty Quantification in High Dimensions

Christopher Jermaine Rice University Fast, Trusted AI Using the EINSUMMABLE Compiler

Yunzhu Li Columbia University Physics-Informed Foundation Models Through Embodied Interactions

Pattie Maes Massachusetts Institute of Technology Understanding How LLM Agents Deviate from Human Choices

Sasa Misailovic University of Illinois at Urbana–Champaign Fellini: Differentiable ML Compiler for Full-Graph Optimization for LLM Models

Kristina Monakhova Cornell University Trustworthy extreme imaging for science using interpretable uncertainty quantification

Todd Mowry Carnegie Mellon University Efficient LLM Serving on Trainium via Kernel Generation

Min-hwan Oh Seoul National University Mutually Beneficial Interplay Between Selection Fairness and Context Diversity in Contextual Bandits

Patrick Rebeschini University of Oxford Optimal Regularization for LLM Alignment

Jose Renau University of California, Santa Cruz Verification Constrained Hardware Optimization using Intelligent Design Agentic Programming

Vilma Todri Emory University Generative AI solutions for The Spillover Effect of Fraudulent Reviews on Product Recommendations

Aravindan Vijayaraghavan Northwestern University Human-Aligned Uncertainty Quantification in High Dimensions

Wei Yang University of Texas at Dallas Optimizing RISC-V Compilers with RISC-LLM and Syntax Parsing

Huaxiu Yao University of North Carolina at Chapel Hill Aligning Long Videos and Language as Long-Horizon World Models

Amy Zhang University of Washington Tools for Governing AI Agent Autonomy

Ruqi Zhang Purdue University Efficient Test-time Alignment for Large Language Models and Large Multimodal Models

Zheng Zhang Rutgers University-New Brunswick AlphaQC: An AI-powered Quantum Circuit Optimizer and Denoiser

AWS Cryptography

Recipient University Research title

Alexandra Boldyreva Georgia Institute of Technology Quantifying Information Leakage in Searchable Encryption Protocols

Maria Eichlseder Graz University of Technology, Austria SALAD – Systematic Analysis of Lightweight Ascon-based Designs

Venkatesan Guruswami University of California, Berkeley Obfuscation, Proof Systems, and Secure Computation: A Research Program on Cryptography at the Simons Institute for the Theory of Computing

Joseph Jaeger Georgia Institute of Technology Analyzing Chat Encryption for Group Messaging

Aayush Jain Carnegie Mellon Large Scale Multiparty Silent Preprocessing for MPC from LPN

Huijia Lin University of Washington Large Scale Multiparty Silent Preprocessing for MPC from LPN

Hamed Nemati KTH Royal Institute of Technology Trustworthy Automatic Verification of Side-Channel Countermeasures for Binary Cryptographic Programs using the HoIBA libary

Karl Palmskog KTH Royal Institute of Technology Trustworthy Automatic Verification of Side-Channel Countermeasures for Binary Cryptographic Programs using the HoIBA libary

Chris Peikert University of Michigan, Ann Arbor Practical Third-Generation FHE and Bootstrapping

Dimitrios Skarlatos Carnegie Mellon University Scale-Out FHE LLMs on GPUs

Vinod Vaikuntanathan Massachusetts Institute of Technology Can Quantum Computers (Really) Factor?

Daniel Wichs Northeastern University Obfuscation, Proof Systems, and Secure Computation: A Research Program on Cryptography at the Simons Institute for the Theory of Computing

David Wu University Of Texas At Austin Fast Private Information Retrieval and More using Homomorphic Encryption

Sustainability

Recipient University Research title

Meeyoung Cha Max Planck Institute Forest-Blossom (Flossom): A New Framework for Sustaining Forest Biodiversity Through Outcome-Driven Remote Sensing Monitoring

Jingrui He University of Illinois at Urbana–Champaign Foundation Model Enabled Earth’s Ecosystem Monitoring

Pedro Lopes University of Chicago AI-powered Tools that Enable Engineers to Make & Re-make Sustainable Hardware

Cheng Yaw Low Max Planck Institute Forest-Blossom (Flossom): A New Framework for Sustaining Forest Biodiversity Through Outcome-Driven Remote Sensing Monitoring

Source link

Events & Conferences

Independent evaluations demonstrate Nova Premier’s safety

Published

1 month ago

May 29, 2025

Rahul Gupta

AI safety is a priority at Amazon. Our investment in safe, transparent, and responsible AI (RAI) includes collaboration with the global community and policymakers. We are members of and collaborate with organizations such as the Frontier Model Forum, the Partnership on AI, and other forums organized by government agencies such as the National Institute of Standards and Technology (NIST). Consistent with Amazon’s endorsement of the Korea Frontier AI Safety Commitments, we published our Frontier Model Safety Framework earlier this year.

Amazon Nova Premier’s guardrails help prevent generation of unsafe content.

During the development of the Nova Premier model, we conducted a comprehensive evaluation to assess its performance and safety. This included testing on both internal and public benchmarks and internal/automated and third-party red-teaming exercises. Once the final model was ready, we prioritized obtaining unbiased, third-party evaluations of the model’s robustness against RAI controls. In this post, we outline the key findings from these evaluations, demonstrating the strength of our testing approach and Amazon Premier’s standing as a safe model. Specifically, we cover our evaluations with two third-party evaluators: PRISM AI and ActiveFence.

Evaluation of Nova Premier against PRISM AI

PRISM Eval’s Behavior Elicitation Tool (BET) dynamically and systematically stress-tests AI models’ safety guardrails. The methodology focuses on measuring how many adversarial attempts (steps) it takes to get a model to generate harmful content across several key risk dimensions. The central metric is “steps to elicit” — the number of increasingly sophisticated prompting attempts required before a model generates an inappropriate response. A higher number of steps indicates stronger safety measures, as the model is more resistant to manipulation. The PRISM risk dimensions (inspired by the MLCommons AI Safety Benchmarks) include CBRNE weapons, violent crimes, non-violent crimes, defamation, and hate, amongst several others.

Related content

From reinforcement learning and supervised fine-tuning to guardrail models and image watermarking, responsible AI was foundational to the design and development of the Amazon Nova family of models.

Using the BET Eval tool and its V1.0 metric, which is tailored toward non-reasoning models, we compared the recently released Nova models (Pro and Premier) to the latest models in the same class: Claude (3.5 v2 and 3.7 non-reasoning) and Llama4 Maverick, all available through Amazon Bedrock. PRISM BET conducts black-box evaluations (where model developers don’t have access to the test prompts) of models integrated with their API. The evaluation conducted with BET Eval MAX, PRISM’s most comprehensive/aggressive testing suite, revealed significant variations in safety against malicious instructions. Nova models demonstrated superior overall safety performance, with an average of 43 steps for Premier and 52 steps for Pro, compared to 37.7 for Claude 3.5 v2 and fewer than 12 steps for other models in the comparison set (namely, 9.9 for Claude3.7, 11.5 for Claude 3.7 thinking, and 6.5 for Maverick). This higher step count suggests that on average, Nova’s safety guardrails are more sophisticated and harder to circumvent through adversarial prompting. The figure below presents the number of steps per harm category evaluated through BET Eval MAX.

Results of tests using PRISM’s BET Eval MAX testing suite.

The PRISM evaluation provides valuable insights into the relative safety of different Amazon Bedrock models. Nova’s strong performance, particularly in hate speech and defamation resistance, represents meaningful progress in AI safety. However, the results also highlight the ongoing challenge of building truly robust safety measures into AI systems. As the field continues to evolve, frameworks like BET will play an increasingly important role in benchmarking and improving AI safety. As a part of this collaboration Nicolas Miailhe, CEO of PRISM Eval, said, “It’s incredibly rewarding for us to see Nova outperforming strong baselines using the BET Eval MAX; our aim is to build a long-term partnership toward safer-by-design models and to make BET available to various model providers.” Organizations deploying AI systems should carefully consider these safety metrics when selecting models for their applications.

Manual red teaming with ActiveFence

The AI safety & security company ActiveFence benchmarked Nova Premier on Bedrock on prompts distributed across Amazon’s eight core RAI categories. ActiveFence also evaluated Claude 3.7 (non-reasoning mode) and GPT 4.1 API on the same set. The flag rate on Nova Premier was lower than that on the other two models, indicating that Nova Premier is the safest of the three.

Model 3P Flag Rate [↓ is better]

Nova Premier 12.0%

Sonnet 3.7 (non-reasoning) 20.6%

GPT4.1 API 22.4%

Related content

Generative AI raises new challenges in defining, measuring, and mitigating concerns about fairness, toxicity, and intellectual property, among other things. But work has started on the solutions.

“Our role is to think like an adversary but act in service of safety,” said Guy Paltieli from ActiveFence. “By conducting a blind stress test of Nova Premier under realistic threat scenarios, we helped evaluate its security posture in support of Amazon’s broader responsible-AI goals, ensuring the model could be deployed with greater confidence.”

These evaluations conducted with PRISM and ActiveFence give us confidence in the strength of our guardrails and our ability to protect our customers’ safety when they use our models. While these evaluations demonstrate strong safety performance, we recognize that AI safety is an ongoing challenge requiring continuous improvement. These assessments represent a point-in-time snapshot, and we remain committed to regular testing and enhancement of our safety measures. No AI system can guarantee perfect safety in all scenarios, which is why we maintain monitoring and response systems after deployment.

Acknowledgments: Vincent Ponzo, Elyssa Vincent

Source link

Funding & Business6 days ago

Kayak and Expedia race to build AI travel agents that turn social posts into itineraries

Jobs & Careers6 days ago

Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding

Mergers & Acquisitions6 days ago

Donald Trump suggests US government review subsidies to Elon Musk’s companies

Funding & Business6 days ago

Rethinking Venture Capital’s Talent Pipeline

Jobs & Careers5 days ago

Why Agentic AI Isn’t Pure Hype (And What Skeptics Aren’t Seeing Yet)

Funding & Business3 days ago

Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%

Funding & Business6 days ago

From chatbots to collaborators: How AI agents are reshaping enterprise work

Jobs & Careers6 days ago

Telangana Launches TGDeX—India’s First State‑Led AI Public Infrastructure

Jobs & Careers3 days ago

Ilya Sutskever Takes Over as CEO of Safe Superintelligence After Daniel Gross’s Exit

Funding & Business3 days ago

HOLY SMOKES! A new, 200% faster DeepSeek R1-0528 variant appears from German lab TNG Technology Consulting GmbH

aistoriz.com

Journey to 1000 models: Scaling Instagram’s recommendation system

Events & Conferences

Journey to 1000 models: Scaling Instagram’s recommendation system

Are there really that many ML models in Instagram?

How did we realize infra maturity wasn’t going to catch up?

Identified risks

Solution overview

Model registry

What did model investigations look like prior to the registry?

What problems did the model registry solve?

Model types

Criticality

Configuration structure as a foundation for automation

Launching

What did the process look like?

How does forcing virtual pools aid product growth?

What other tooling improved launching?

Model stability

Why is measuring ranking model reliability unique?

Defining model stability

Operationalizing model stability

Key takeaways

Infra understanding is the foundation to building the right tools

Helping colleagues move fast means we all move faster

Reliability must consider quality

Leave a Reply
Cancel reply

Leave a Reply

Events & Conferences

An inside look at Meta’s transition from C to Rust on mobile

Events & Conferences

Amazon Research Awards recipients announced

Events & Conferences

Independent evaluations demonstrate Nova Premier’s safety

Trending

Recipient	University	Research title
Christopher Amato	Northeastern University	Multi-Agent Reinforcement Learning Cyber Defense for Securing Cloud Computing Platforms
Bernd Bischl	Ludwig Maximilian University of Munich	Improving Generative and Foundation Models Reliability via Uncertainty-awareness
Shiqing Ma	University Of Massachusetts Amherst	LLM and Domain Adaptation for Attack Detection
Alina Oprea	Northeastern University	Multi-Agent Reinforcement Learning Cyber Defense for Securing Cloud Computing Platforms
Roberto Perdisci	University of Georgia	ContextADBench: A Comprehensive Benchmark Suite for Contextual Anomaly Detection

Model	3P Flag Rate [↓ is better]
Nova Premier	12.0%
Sonnet 3.7 (non-reasoning)	20.6%
GPT4.1 API	22.4%

aistoriz.com

Journey to 1000 models: Scaling Instagram’s recommendation system

Are there really that many ML models in Instagram?

How did we realize infra maturity wasn’t going to catch up?

Identified risks

Solution overview

Model registry

What did model investigations look like prior to the registry?

What problems did the model registry solve?

Model types

Criticality

Configuration structure as a foundation for automation

Launching

What did the process look like?

How does forcing virtual pools aid product growth?

What other tooling improved launching?

Model stability

Why is measuring ranking model reliability unique?

Defining model stability

Operationalizing model stability

Key takeaways

Infra understanding is the foundation to building the right tools

Helping colleagues move fast means we all move faster

Reliability must consider quality

You may like

Leave a Reply Cancel reply

Leave a Reply

Events & Conferences

An inside look at Meta’s transition from C to Rust on mobile

Events & Conferences

Amazon Research Awards recipients announced

Events & Conferences

Independent evaluations demonstrate Nova Premier’s safety

Trending

Leave a Reply
Cancel reply