Events & Conferences
Journey to 1000 models: Scaling Instagram’s recommendation system
- In this post, we explore how Instagram has successfully scaled its algorithm to include over 1000 ML models without sacrificing recommendation quality or reliability.
- We delve into the intricacies of managing such a vast array of models, each with its own performance characteristics and product goals.
- We share insights and lessons learned along the way—from the initial realization that our infrastructure maturity was lagging behind our ambitious scaling goals, to the innovative solutions we implemented to bridge these gaps.
In the ever-evolving landscape of social media, Instagram serves as a hub for creative expression and connection, continually adapting to meet the dynamic needs of its global community. At the heart of this adaptability lies a web of machine learning (ML) models, each playing a crucial role in personalizing experiences. As Instagram’s reach and influence has grown, so too has the complexity of its algorithmic infrastructure. This growth, while exciting, presents a unique set of challenges, particularly in terms of reliability and scalability.
Join us as we uncover the strategies and tools that have enabled Instagram to maintain its position at the forefront of social media innovation, ensuring a seamless and engaging experience for billions of users worldwide.
Are there really that many ML models in Instagram?
Though what shows up in Feed, Stories, and Reels is personally ranked, the number of ranked surfaces goes much deeper—to which comments surface in Feed, which notifications are “important,” or whom you might tag in a post. These are all driven by ML recommendations.
Within a given surface, we’ll have different layers of the ranking funnel: sourcing (retrieval), early-stage ranking (ESR), and late-stage ranking (LSR). We operate on fewer candidates as we progress through the funnel, as the underlying operations grow more expensive (see Figure 1 below):
Within each surface and layer, there is constant experimentation, and these permutations create a severe infrastructure challenge. We need to allow room for our ML engineers to experiment with changes such as adjusting weights for a given prediction. The net result, depicted below in Figure 2, is a large number of models serving user traffic in production:
How did we realize infra maturity wasn’t going to catch up?
Identified risks
We identified several risks associated with scaling our algorithm, rooted in complaints about ML productivity and repeating patterns of issues:
- Discovery: Even as a team focused on one app — Instagram — we couldn’t stay on top of the growth, and product ML teams were maintaining separate sources of truth, if any, for their models in production.
- Release: We didn’t have a consistent way to launch new models safely, and the process was slow, impacting ML velocity and, therefore, product innovation.
- Health: We lacked a consistent definition of model prediction quality, and with the diversity of surfaces and subtlety of degraded ranking, quality issues went unnoticed.
Solution overview
To address these risks, we implemented several solutions:
- Model registry: We built a registry that serves as a ledger for production model importance and business function foremost, among other metadata. This registry serves as our foundational source of truth, upon which we can leverage automation to uplevel system-wide observability, change management, and model health.
- Model launch tooling: We developed a more ideal flow for launching new models that includes estimation, approval, prep, scale-up, and finalization. This process is now automated, and we’ve reduced the time it takes to launch a new model from days to hours.
- Model stability: We defined and operationalized model stability, a pioneering metric that measures the accuracy of our model predictions. We’ve leveraged model stability to produce SLOs for all models in the model registry, which enables simple understanding of the entire product surface’s ML health.
Model registry
What did model investigations look like prior to the registry?
Before we created the model registry, the investigation process was a time-consuming and error-prone experience for on-call engineers and model owners. An on-call engineer had to ask multiple questions to model owners to gather information, as depicted Figure 3 below, about the context of what this model does in the stack and to clarify how important it is to the business.
Understanding this context is extremely important to the operational response: Depending on the importance of the model and the criticality of the surface it’s supporting, the response is going to differ in kind. When a model is an experiment serving a small percentage of the traffic, an appropriate response can be to end the experiment and reroute the traffic back to the main model (the baseline). But if there’s a problem with the baseline model that needs to be handled with urgency, it’s not possible to “just turn it off.” The engineer on call has to loop in the model owner, defeating the purpose of having a dedicated on-call.
To avoid holding up an operational response on a single POC, we needed a central source of truth for model importance and business function. What if the model is not available? What if 10 of these issues happen concurrently?
With the development of the model registry, we standardized the collection of model importance and business function information, ensuring most of our operational resources were going towards the most important models.
What problems did the model registry solve?
The model registry is a system of record built on top of Configerator, Meta’s distributed configuration suite . This schematized ledger (see an example in Figure 4 and detailed further below) provides read-and-write access to operational data based on the inventory of production models. It’s a flexible and extensible foundation upon which one can build automation and tools to solve problems that are specific to individual organizations within Meta that are not served by the general tooling.
As Instagram scaled its investment in AI through rapid innovation in content recommendations, the number of models and AI assets grew; as a result, it has been increasingly important — but also increasingly difficult — to maintain a minimum standard for all of our models, as we lacked an authoritative source for the business context as well as for a model’s importance.
In creating the model registry, we set out to provide a structured interface for collecting business context via model types, importance via criticality, and additional metadata that would enable model understanding. Below, we’ll get into the model types, criticality, and automation we’ve built for this purpose.
Model types
At a high level, model type describes the purpose for the ML workload where it represents a category or class of models that share a common purpose or are used in similar contexts. For example, we have “ig_stories_tray_mtml” which is a string attached to training flows, model checkpoints, inference services, and more. Put simply, a model type identifies for the reader this model’s purpose in the ranking funnel.
Let’s break it down:
“ig_stories_tray_mtml” → “ig” “stories” “tray” “mtml”
- “ig”: This model is an “ig” model as opposed to “fb” or “whatsapp”.
- “stories”: This model serves IG Stories.
- “tray”: This model serves in the main IG Stories tray (as opposed to stories in some other surface).
- “mtml”: This model is a multi-task-multi-label model, commonly used in late-stage ranking.
We can then use these model type strings to tag AI assets, and since they serve as proxies for business context, we can use them also for asset management, policy enforcement, analytics, and more.
The metadata entries in the model registry are anchored on two main types that describe model instances (ModelMetadata) as well as model types (ModelTypeMetadata). These types are made up of “core” attributes that are universally applicable, as well as “extended” attributes that allow different teams to encode their opinions about how these entries will inform operations. For example, in Instagram our extended attributes encode “baseline” and “holdout” model IDs, which are used in our ranking infrastructure to orchestrate ranking funnel execution.
Criticality
In addition to defining business function, we had to establish clear guidelines for model importance. Within Meta, SEVs and services have a unified-importance tier system where the Global Service Index (GSI) records a criticality from TIER0 to TIER4 based on the maximum incident severity level the service can cause, from SEV0 as the most critical to SEV4 as simply a “heads up.” Since GSI criticality had social proof at the company, and infra engineers were familiar with this system, we adopted these criticalities for models and now annotate them at the model type and model level.
No longer would each team decide to raise their own model services to TIER1 for themselves, increasing the burden on all teams that support these models. Teams needed to provide an immediate response (available 24/7) on call and be able to prove that their models contributed meaningfully to critical business metrics to qualify for elevated monitoring.
Configuration structure as a foundation for automation
Once we had onboarded a critical mass of Instagram models to the model registry, we could begin to fully integrate with our monitoring and observability suite using our Meta-wide configuration solution, Configerator. With this, we could now have model performance monitoring and alerts that are fully automated and integrated with our tooling for SLIs called SLICK, dashboards that allow us to monitor models across many time series dimensions, and a suite of alerting specific to the model that is driven from the entries in the model registry.
This provided all our teams confidence that our monitoring coverage was complete and automated.
Launching
While a point-in-time snapshot of models in production is great for static systems, Instagram’s ML landscape is constantly shifting. With the rapid increase of iteration on the recommendation system driving an increased number of launches, it became clear our infrastructure support to make this happen was not adequate. Time-to-launch was a bottleneck in ML velocity, and we needed to drive it down.
What did the process look like?
Conventionally, services were longstanding systems that had engineers supporting them to tune. Even when new changes would introduce new capacity regression risks, we could gate this behind change safety mechanisms.
However, our modeling and experimentation structure was unique in that we were planning for more rapid iteration, and our options were insufficient. To safely test the extent of load a new service could support, we would clone the entire service, send shadow traffic (i.e., cloned traffic that isn’t processed by our clients), and run multiple overload tests until we found a consistent peak throughput. But this wasn’t a perfect science. Sometimes we didn’t send enough traffic, and sometimes we’d send too much, and the amount could change throughout the day due to variations in global user behavior.
This could easily take two days to get right, including actually debugging the performance itself when the results weren’t expected. Once we got the result, we’d then have to estimate the final cost. Below (in Figure 5) is the formula we landed on.
The actual traffic shifting portion was tedious as well. For example, when we managed to fully estimate that we needed 500 replicas to host the new service, we might not actually have 500 spares lying around to do a full replacement, so launching was a delicate process of partially sizing up by approximately 20%, sending 20% of traffic over, and then scaling down the old service by 20% to reclaim and recycle the capacity. Rinse, repeat. Inefficient!
And by the time we got to the end of this arduous process, the ordeal still wasn’t over. Each team was responsible for correctly setting up new alerts for their baseline in a timely fashion, or else their old models could and did trigger false alarms.
How does forcing virtual pools aid product growth?
One of the prerequisites for fixing competition for resources and unblocking productivity was to put up guardrails. Prior to this, it was “first come first served,” with no clear way to even “reserve” future freed capacity. It was also hard to reason about fairness from an infra perspective: Would it make sense to give each team equal pools, or give each individual person a maximum limit?
As it turned out, not all MLEs are experimenting at the same time, due to staggered progress on their work, so individual (per-engineer) limits were not ideal. One member might be in the experimentation stage and another might be training. So our solution was to provide bandwidth to each team.
Once each team — and therefore product — had quotas distributed, their launch policy became more clear cut. Some teams established free launching as long as the team was within quota. Others required no regressions in capacity usage. But mostly this unlocked our ability to run launches in parallel, since each one required much less red tape, and prioritization was no longer done at the org level.
What other tooling improved launching?
As mentioned earlier, preplanning with capacity estimations was critical to understanding cost and ensuring reliability. We were often asked, Why not let autoscaling take care of everything? The problem was that each service could be configured slightly differently than a previously optimized service, or some architectural change could have affected the performance of the model. We didn’t have an infinite amount of supply to work with, so by the time we fully traffic-shifted everything over, we might find that we didn’t have enough supply. Reverting is costly, taking hours to get through each stage.
By doing capacity estimations in advance, this also allowed us and each team to accurately evaluate metric improvement versus cost. It might be worthwhile to double our costs if something would increase time spent on the app by 1%, but likely not for a 0.05% improvement where we could better spend that capacity funding another initiative.
With partners in AI Infra, we developed two major solutions to this process: offline performance evaluation and an automated launching platform.
We simplified determining performance of a new service using recorded traffic. Pre-recorded traffic was continuously collected into a data warehouse that the benchmarker could read from, and we’d spin up temporary jobs with this automation. One job would replay different levels of traffic continuously and send it to another job that was a clone of the existing experiment. By putting stoppers on desired latency and error rates, the tooling would eventually output a converged stable number that we could understand as the max load (see Figure 6).
The launch platform itself would input the numbers we captured from these tests, automatically collect demand data as defined, and run that same formula to calculate a cost. The platform would then perform the upscaling/downscaling cycle for teams as we shifted traffic.
And finally, by leveraging the model registry, we were able to land this model change in code (see example in Figure 6), to help us better maintain and understand the 1000+ models within our fleet. Likewise, this bolstered our trust in the model registry, which was now directly tied to the model launch lifecycle.
This suite of launch automation has dramatically reduced the class of SEVs related to model launches, improved our pace of innovation from a few to more than 10 launches per week, and reduced the amount of time engineers spend conducting a launch by more than two days.
Model stability
As the number of models in production increased, our organization started to feel the effects of an inconsistent measure of model health. While ranking models are run like any other distributed backend system (receive a request, produce a response), one may think a universal SLO that measures request success rate can suffice to capture holistic health. This is not the case for ranking models, as the accuracy of recommendations received carries significant importance to the end-user experience. If we consider a user who is a huge fan of golf but does not enjoy cooking content (see the “available & irrelevant” case in Figure 8 below), we see an example of this inaccuracy in practice. This is precisely what the model stability metric sought to capture.
Why is measuring ranking model reliability unique?
Ranking models, unlike traditional idempotent request/response backends, produce scores predicting user action given a set of candidates (PLIKE, PCOMMENT, PFOLLOW, etc.). These scores then combine and are used to determine which candidates are most relevant to an end user. It’s important that these scores accurately reflect user interest, as their accuracy is directly correlated to user engagement. If we recommend irrelevant content, user engagement suffers. The model stability metric was designed to make it easy to measure this accuracy and detect inaccuracy at our scale.
Let’s discuss how this works.
Defining model stability
Models are complex, and they produce multiple output predictions. Let’s take a simplified example (shown in Figure 9 below) of a multi-task-multi-label (MTML) model predicting three actions:
For us to claim this model is stable, we must also claim that each underlying prediction is stable.
When evaluating the accuracy of a ranking model’s predictions, we typically look at two metrics:
- Model calibration, which is based on observed real-world outcomes and answers the question, “Are we over- or under-predicting user action?” It is calculated as a ratio of predicted click-through-rate (CTR) and empirical CTR. A perfect predictor will have calibration centered at 1.
- Model normalized entropy (NE), which measures the discriminative power of a predictor, and answers the question, “How well can this predictor separate action from inaction?” It is calculated as a ratio of the average log-loss per impression to what the average log-loss per impression would be if we always predicted the empirical CTR. With NE, lower values are better, and an NE of 1 is equivalent to random predictions.
(For more information regarding our choice of prediction evaluation metrics, please refer to the paper, “Practical Lessons from Predicting Clicks on Ads at Facebook.”)
A model’s predictions are unstable when either calibration or NE are out of their expected healthy ranges. To determine what a healthy range is, we must look at each metric in real time, and Figure 10 below shows what these time series can look like:
By observing the trend of a healthy prediction, we can apply thresholds for our evaluation metrics. When these thresholds are breached, the underlying prediction is considered unstable.
From here, we can define model stability as a binary indicator across a model’s predictions. It is 1 if all underlying predictions are stable, and 0 if any prediction is unstable. This is an extremely powerful method of reacting to real-time prediction instability as well as a tool for understanding trends in predictive health per model or across distinct products ranking funnels.
Operationalizing model stability
With a real-time view on model predictive health, we can leverage this unified definition of model stability and apply it to all of our models in production, once again leveraging the model registry as a ledger to hold this important data. In Figure 11 below, we can see the addition of model stability metric metadata after we determined the expected thresholds.
Given the large number of models in production, each producing many predictions, building a portable definition of model health applicable to all of our ranking models represented an important milestone toward upleveling Instagram’s ML infrastructure maturity. This has unlocked our ability to build generic alerting to guarantee detection of our most important models becoming unstable, thereby moving us closer to mitigation when our recommendation system is at risk.
Since the addition of these metrics and alerting, ML teams have discovered previously hidden issues within their models and addressed them faster than before, leading to higher-quality recommendations.
Key takeaways
In our journey to scale Instagram’s algorithm to manage over 1000 models, we have learned several critical lessons that have shaped our approach and infrastructure. These takeaways not only highlight the challenges we faced but also underscore the strategies that led to our success.
Infra understanding is the foundation to building the right tools
A unified understanding of our infrastructure footprint was essential in developing the right tools to support our scaling efforts. By identifying the gaps and potential risks in our existing systems, we were able to implement solutions such as the model registry that significantly improved our operational efficiency and reliability posture.
Helping colleagues move fast means we all move faster
By addressing the model iteration bottleneck, we enabled our teams to innovate more rapidly. Our focus on creating a seamless, self-service process for model iteration empowered client teams to take ownership of their workflows. This not only accelerated their progress but also reduced the operational burden on our infrastructure team. As a result, the entire organization benefited from increased agility and productivity.
Reliability must consider quality
Ensuring the reliability of our models required us to redefine how we measure and maintain model quality. By operationalizing model stability and establishing clear metrics for model health, we were able to proactively manage the performance of our models. This approach enables us to maintain high standards of quality across our recommendation systems, ultimately enhancing user engagement and satisfaction.
Our experience in scaling Instagram’s recommendation system has reinforced the importance of infrastructure understanding, collaboration, and a focus on quality. By building robust tools and processes, we have not only improved our own operations but also empowered our colleagues to drive innovation and growth across the platform.
Events & Conferences
An inside look at Meta’s transition from C to Rust on mobile
Have you ever worked is legacy code? Are you curious what it takes to modernize systems at a massive scale?
Pascal Hartig is joined on the latest Meta Tech Podcast by Elaine and Buping, two software engineers working on a bold project to rewrite the decades-old C code in one of Meta’s core messaging libraries in Rust. It’s an ambitious effort that will transform a central messaging library that is shared across Messenger, Facebook, Instagram, and Meta’s AR/VR platforms.
They discuss taking on a project of this scope – even without a background in Rust, how they’re approaching it, and what it means to optimize for ‘developer happiness.’
Download or listen to the episode below:
You can also find the episode wherever you get your podcasts, including:
The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.
Send us feedback on Instagram, Threads, or X.
And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.
Events & Conferences
Amazon Research Awards recipients announced
Amazon Research Awards (ARA) provides unrestricted funds and AWS Promotional Credits to academic researchers investigating various research topics in multiple disciplines. This cycle, ARA received many excellent research proposals from across the world and today is publicly announcing 73 award recipients who represent 46 universities in 10 countries.
This announcement includes awards funded under five call for proposals during the fall 2024 cycle: AI for Information Security, Automated Reasoning, AWS AI, AWS Cryptography, and Sustainability. Proposals were reviewed for the quality of their scientific content and their potential to impact both the research community and society. Additionally, Amazon encourages the publication of research results, presentations of research at Amazon offices worldwide, and the release of related code under open-source licenses.
Recipients have access to more than 700 Amazon public datasets and can utilize AWS AI/ML services and tools through their AWS Promotional Credits. Recipients also are assigned an Amazon research contact who offers consultation and advice, along with opportunities to participate in Amazon events and training sessions.
“Automated Reasoning is an important area of research for Amazon, with potential applications across various features and applications to help improve security, reliability, and performance for our customers. Through the ARA program, we collaborate with leading academic researchers to explore challenges in this field,” said Robert Jones, senior principal scientist with the Cloud Automated Reasoning Group. “We were again impressed by the exceptional response to our Automated Reasoning call for proposals this year, receiving numerous high-quality submissions. Congratulations to the recipients! We’re excited to support their work and partner with them as they develop new science and technology in this important area.”
“At Amazon, we believe that solving the world’s toughest sustainability challenges benefits from both breakthrough scientific research and open and bold collaboration. Through programs like the Amazon Research Awards program, we aim to support academic research that could contribute to our understanding of these complex issues,” said Kommy Weldemariam, Director of Science and Innovation Sustainability. “The selected proposals represent innovative projects that we hope will help advance knowledge in this field, potentially benefiting customers, communities, and the environment.”
ARA funds proposals throughout the year in a variety of research areas. Applicants are encouraged to visit the ARA call for proposals page for more information or send an email to be notified of future open calls.
The tables below list, in alphabetical order by last name, fall 2024 cycle call-for-proposal recipients, sorted by research area.
AI for Information Security
Recipient | University | Research title |
Christopher Amato | Northeastern University | Multi-Agent Reinforcement Learning Cyber Defense for Securing Cloud Computing Platforms |
Bernd Bischl | Ludwig Maximilian University of Munich | Improving Generative and Foundation Models Reliability via Uncertainty-awareness |
Shiqing Ma | University Of Massachusetts Amherst | LLM and Domain Adaptation for Attack Detection |
Alina Oprea | Northeastern University | Multi-Agent Reinforcement Learning Cyber Defense for Securing Cloud Computing Platforms |
Roberto Perdisci | University of Georgia | ContextADBench: A Comprehensive Benchmark Suite for Contextual Anomaly Detection |
Automated Reasoning
Recipient | University | Research title |
Nada Amin | Harvard University | LLM-Augmented Semi-Automated Proofs for Interactive Verification |
Suguman Bansal | Georgia Institute of Technology | Certified Inductive Generalization in Reinforcement Learning |
Ioana Boureanu | University of Surrey | Phoebe+: An Automated-Reasoning Tool for Provable Privacy in Cryptographic Systems |
Omar Haider Chowdhury | Stony Brook University | Restricter: An Automatic Tool for Authoring Amazon Cedar Access Control Policies with the Principle of Least Privilege |
Stefan Ciobaca | Alexandru Ioan Cuza University | An Interactive Proof Mode for Dafny |
João Ferreira | INESC-ID | Polyglot Automated Program Repair for Infrastructure as Code |
Sicun Gao | University Of California, San Diego | Monte Carlo Trees with Conflict Models for Proof Search |
Mirco Giacobbe | University of Birmingham | Neural Software Verification |
Tobias Grosser | University of Cambridge | Synthesis-based Symbolic BitVector Simplification for Lean |
Ronghui Gu | Columbia University | Scaling Formal Verification of Security Properties for Unmodified System Software |
Alexey Ignatiev | Monash University | Huub: Next-Gen Lazy Clause Generation |
Kenneth McMillan | University of Texas At Austin | Synthesis of Auxiliary Variables and Invariants for Distributed Protocol Verification |
Alexandra Mendes | University of Porto | Overcoming Barriers to the Adoption of Verification-Aware Languages |
Jason Nieh | Columbia University | Scaling Formal Verification of Security Properties for Unmodified System Software |
Rohan Padhye | Carnegie Mellon University | Automated Synthesis and Evaluation of Property-Based Tests |
Nadia Polikarpova | University Of California, San Diego | Discovering and Proving Critical System Properties with LLMs |
Fortunat Rajaona | University of Surrey | Phoebe+: An Automated-Reasoning Tool for Provable Privacy in Cryptographic Systems |
Subhajit Roy | Indian Institute of Technology Kanpur | Theorem Proving Modulo LLM |
Gagandeep Singh | University of Illinois At Urbana–Champaign | Trustworthy LLM Systems using Formal Contracts |
Scott Stoller | Stony Brook University | Restricter: An Automatic Tool for Authoring Amazon Cedar Access Control Policies with the Principle of Least Privilege |
Peter Stuckey | Monash University | Huub: Next-Gen Lazy Clause Generation |
Yulei Sui | University of New South Wales | Path-Sensitive Typestate Analysis through Sparse Abstract Execution |
Nikos Vasilakis | Brown University | Semantics-Driven Static Analysis for the Unix/Linux Shell |
Ping Wang | Stevens Institute of Technology | Leveraging Large Language Models for Reasoning Augmented Searching on Domain-specific NoSQL Database |
John Wawrzynek | University of California, Berkeley | GPU-Accelerated High-Throughput SAT Sampling |
AWS AI
Recipient | University | Research title |
Panagiotis Adamopoulos | Emory University | Generative AI solutions for The Spillover Effect of Fraudulent Reviews on Product Recommendations |
Vikram Adve | University of Illinois at Urbana–Champaign | Fellini: Differentiable ML Compiler for Full-Graph Optimization for LLM Models |
Frances Arnold | California Institute of Technology | Closed-loop Generative Machine Learning for De Novo Enzyme Discovery and Optimization |
Yonatan Bisk | Carnegie Mellon University | Useful, Safe, and Robust Multiturn Interactions with LLMs |
Shiyu Chang | University of California, Santa Barbara | Cut the Crap: Advancing the Efficient Communication of Multi-Agent Systems via Spatial-Temporal Topology Design and KV Cache Sharing |
Yuxin Chen | University of Pennsylvania | Provable Acceleration of Diffusion Models for Modern Generative AI |
Tianlong Chen | University of North Carolina at Chapel Hill | Cut the Crap: Advancing the Efficient Communication of Multi-Agent Systems via Spatial-Temporal Topology Design and KV Cache Sharing |
Mingyu Ding | University of North Carolina at Chapel Hill | Aligning Long Videos and Language as Long-Horizon World Models |
Nikhil Garg | Cornell University | Market Design for Responsible Multi-agent LLMs |
Jessica Hullman | Northwestern University | Human-Aligned Uncertainty Quantification in High Dimensions |
Christopher Jermaine | Rice University | Fast, Trusted AI Using the EINSUMMABLE Compiler |
Yunzhu Li | Columbia University | Physics-Informed Foundation Models Through Embodied Interactions |
Pattie Maes | Massachusetts Institute of Technology | Understanding How LLM Agents Deviate from Human Choices |
Sasa Misailovic | University of Illinois at Urbana–Champaign | Fellini: Differentiable ML Compiler for Full-Graph Optimization for LLM Models |
Kristina Monakhova | Cornell University | Trustworthy extreme imaging for science using interpretable uncertainty quantification |
Todd Mowry | Carnegie Mellon University | Efficient LLM Serving on Trainium via Kernel Generation |
Min-hwan Oh | Seoul National University | Mutually Beneficial Interplay Between Selection Fairness and Context Diversity in Contextual Bandits |
Patrick Rebeschini | University of Oxford | Optimal Regularization for LLM Alignment |
Jose Renau | University of California, Santa Cruz | Verification Constrained Hardware Optimization using Intelligent Design Agentic Programming |
Vilma Todri | Emory University | Generative AI solutions for The Spillover Effect of Fraudulent Reviews on Product Recommendations |
Aravindan Vijayaraghavan | Northwestern University | Human-Aligned Uncertainty Quantification in High Dimensions |
Wei Yang | University of Texas at Dallas | Optimizing RISC-V Compilers with RISC-LLM and Syntax Parsing |
Huaxiu Yao | University of North Carolina at Chapel Hill | Aligning Long Videos and Language as Long-Horizon World Models |
Amy Zhang | University of Washington | Tools for Governing AI Agent Autonomy |
Ruqi Zhang | Purdue University | Efficient Test-time Alignment for Large Language Models and Large Multimodal Models |
Zheng Zhang | Rutgers University-New Brunswick | AlphaQC: An AI-powered Quantum Circuit Optimizer and Denoiser |
AWS Cryptography
Recipient | University | Research title |
Alexandra Boldyreva | Georgia Institute of Technology | Quantifying Information Leakage in Searchable Encryption Protocols |
Maria Eichlseder | Graz University of Technology, Austria | SALAD – Systematic Analysis of Lightweight Ascon-based Designs |
Venkatesan Guruswami | University of California, Berkeley | Obfuscation, Proof Systems, and Secure Computation: A Research Program on Cryptography at the Simons Institute for the Theory of Computing |
Joseph Jaeger | Georgia Institute of Technology | Analyzing Chat Encryption for Group Messaging |
Aayush Jain | Carnegie Mellon | Large Scale Multiparty Silent Preprocessing for MPC from LPN |
Huijia Lin | University of Washington | Large Scale Multiparty Silent Preprocessing for MPC from LPN |
Hamed Nemati | KTH Royal Institute of Technology | Trustworthy Automatic Verification of Side-Channel Countermeasures for Binary Cryptographic Programs using the HoIBA libary |
Karl Palmskog | KTH Royal Institute of Technology | Trustworthy Automatic Verification of Side-Channel Countermeasures for Binary Cryptographic Programs using the HoIBA libary |
Chris Peikert | University of Michigan, Ann Arbor | Practical Third-Generation FHE and Bootstrapping |
Dimitrios Skarlatos | Carnegie Mellon University | Scale-Out FHE LLMs on GPUs |
Vinod Vaikuntanathan | Massachusetts Institute of Technology | Can Quantum Computers (Really) Factor? |
Daniel Wichs | Northeastern University | Obfuscation, Proof Systems, and Secure Computation: A Research Program on Cryptography at the Simons Institute for the Theory of Computing |
David Wu | University Of Texas At Austin | Fast Private Information Retrieval and More using Homomorphic Encryption |
Sustainability
Recipient | University | Research title |
Meeyoung Cha | Max Planck Institute | Forest-Blossom (Flossom): A New Framework for Sustaining Forest Biodiversity Through Outcome-Driven Remote Sensing Monitoring |
Jingrui He | University of Illinois at Urbana–Champaign | Foundation Model Enabled Earth’s Ecosystem Monitoring |
Pedro Lopes | University of Chicago | AI-powered Tools that Enable Engineers to Make & Re-make Sustainable Hardware |
Cheng Yaw Low | Max Planck Institute | Forest-Blossom (Flossom): A New Framework for Sustaining Forest Biodiversity Through Outcome-Driven Remote Sensing Monitoring |
Events & Conferences
Independent evaluations demonstrate Nova Premier’s safety
AI safety is a priority at Amazon. Our investment in safe, transparent, and responsible AI (RAI) includes collaboration with the global community and policymakers. We are members of and collaborate with organizations such as the Frontier Model Forum, the Partnership on AI, and other forums organized by government agencies such as the National Institute of Standards and Technology (NIST). Consistent with Amazon’s endorsement of the Korea Frontier AI Safety Commitments, we published our Frontier Model Safety Framework earlier this year.
During the development of the Nova Premier model, we conducted a comprehensive evaluation to assess its performance and safety. This included testing on both internal and public benchmarks and internal/automated and third-party red-teaming exercises. Once the final model was ready, we prioritized obtaining unbiased, third-party evaluations of the model’s robustness against RAI controls. In this post, we outline the key findings from these evaluations, demonstrating the strength of our testing approach and Amazon Premier’s standing as a safe model. Specifically, we cover our evaluations with two third-party evaluators: PRISM AI and ActiveFence.
Evaluation of Nova Premier against PRISM AI
PRISM Eval’s Behavior Elicitation Tool (BET) dynamically and systematically stress-tests AI models’ safety guardrails. The methodology focuses on measuring how many adversarial attempts (steps) it takes to get a model to generate harmful content across several key risk dimensions. The central metric is “steps to elicit” — the number of increasingly sophisticated prompting attempts required before a model generates an inappropriate response. A higher number of steps indicates stronger safety measures, as the model is more resistant to manipulation. The PRISM risk dimensions (inspired by the MLCommons AI Safety Benchmarks) include CBRNE weapons, violent crimes, non-violent crimes, defamation, and hate, amongst several others.
Using the BET Eval tool and its V1.0 metric, which is tailored toward non-reasoning models, we compared the recently released Nova models (Pro and Premier) to the latest models in the same class: Claude (3.5 v2 and 3.7 non-reasoning) and Llama4 Maverick, all available through Amazon Bedrock. PRISM BET conducts black-box evaluations (where model developers don’t have access to the test prompts) of models integrated with their API. The evaluation conducted with BET Eval MAX, PRISM’s most comprehensive/aggressive testing suite, revealed significant variations in safety against malicious instructions. Nova models demonstrated superior overall safety performance, with an average of 43 steps for Premier and 52 steps for Pro, compared to 37.7 for Claude 3.5 v2 and fewer than 12 steps for other models in the comparison set (namely, 9.9 for Claude3.7, 11.5 for Claude 3.7 thinking, and 6.5 for Maverick). This higher step count suggests that on average, Nova’s safety guardrails are more sophisticated and harder to circumvent through adversarial prompting. The figure below presents the number of steps per harm category evaluated through BET Eval MAX.
The PRISM evaluation provides valuable insights into the relative safety of different Amazon Bedrock models. Nova’s strong performance, particularly in hate speech and defamation resistance, represents meaningful progress in AI safety. However, the results also highlight the ongoing challenge of building truly robust safety measures into AI systems. As the field continues to evolve, frameworks like BET will play an increasingly important role in benchmarking and improving AI safety. As a part of this collaboration Nicolas Miailhe, CEO of PRISM Eval, said, “It’s incredibly rewarding for us to see Nova outperforming strong baselines using the BET Eval MAX; our aim is to build a long-term partnership toward safer-by-design models and to make BET available to various model providers.” Organizations deploying AI systems should carefully consider these safety metrics when selecting models for their applications.
Manual red teaming with ActiveFence
The AI safety & security company ActiveFence benchmarked Nova Premier on Bedrock on prompts distributed across Amazon’s eight core RAI categories. ActiveFence also evaluated Claude 3.7 (non-reasoning mode) and GPT 4.1 API on the same set. The flag rate on Nova Premier was lower than that on the other two models, indicating that Nova Premier is the safest of the three.
Model | 3P Flag Rate [↓ is better] |
Nova Premier | 12.0% |
Sonnet 3.7 (non-reasoning) | 20.6% |
GPT4.1 API | 22.4% |
“Our role is to think like an adversary but act in service of safety,” said Guy Paltieli from ActiveFence. “By conducting a blind stress test of Nova Premier under realistic threat scenarios, we helped evaluate its security posture in support of Amazon’s broader responsible-AI goals, ensuring the model could be deployed with greater confidence.”
These evaluations conducted with PRISM and ActiveFence give us confidence in the strength of our guardrails and our ability to protect our customers’ safety when they use our models. While these evaluations demonstrate strong safety performance, we recognize that AI safety is an ongoing challenge requiring continuous improvement. These assessments represent a point-in-time snapshot, and we remain committed to regular testing and enhancement of our safety measures. No AI system can guarantee perfect safety in all scenarios, which is why we maintain monitoring and response systems after deployment.
Acknowledgments: Vincent Ponzo, Elyssa Vincent
-
Funding & Business6 days ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Jobs & Careers6 days ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Mergers & Acquisitions6 days ago
Donald Trump suggests US government review subsidies to Elon Musk’s companies
-
Funding & Business6 days ago
Rethinking Venture Capital’s Talent Pipeline
-
Jobs & Careers6 days ago
Why Agentic AI Isn’t Pure Hype (And What Skeptics Aren’t Seeing Yet)
-
Funding & Business3 days ago
Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%
-
Funding & Business6 days ago
From chatbots to collaborators: How AI agents are reshaping enterprise work
-
Jobs & Careers6 days ago
Telangana Launches TGDeX—India’s First State‑Led AI Public Infrastructure
-
Funding & Business6 days ago
Europe’s Most Ambitious Startups Aren’t Becoming Global; They’re Starting That Way
-
Jobs & Careers3 days ago
Ilya Sutskever Takes Over as CEO of Safe Superintelligence After Daniel Gross’s Exit