AI Research
Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises
YouTube fail videos reveal a major blind spot for leading AI models: they struggle with surprises and rarely reconsider their first impressions. Even advanced systems like GPT-4o stumble over simple plot twists.
Researchers from the University of British Columbia, the Vector Institute for AI, and Nanyang Technological University put top AI models through their paces using more than 1,600 YouTube fail videos from the Oops! dataset.
The team created a new benchmark called BlackSwanSuite to test how well these systems handle unexpected events. Like people, the AI models are fooled by surprising moments—but unlike people, they refuse to change their minds, even after seeing what really happened.
One example: a man swings a pillow near a Christmas tree. The AI assumes he’s aiming at someone nearby. In reality, the pillow knocks ornaments off the tree, which then hit a woman. Even after watching the whole video, the AI sticks to its original, incorrect guess.
Ad
The videos span a range of categories, with most featuring traffic accidents (24 percent), children’s mishaps (24 percent), or pool accidents (16 percent). What unites them all is an unpredictable twist that even people often miss.
Three types of tasks
Each video is split into three segments: the setup, the surprise, and the aftermath. The benchmark challenges LLMs with different tasks for each stage. In the “Forecaster” task, the AI only sees the start of the video and tries to predict what comes next. The “Detective” task shows only the beginning and end, asking the AI to explain what happened in between. The “Reporter” task gives the AI the full video and checks whether it can update its assumptions after seeing the full story.

The tests covered both closed models like GPT-4o and Gemini 1.5 Pro, as well as open-source systems such as LLaVA-Video, VILA, VideoChat2, and VideoLLaMA 2. The results highlight glaring weaknesses. On the detective task, GPT-4o answered correctly just 65 percent of the time. By comparison, humans got 90 percent right.

The gap widened even further when models needed to reconsider their initial guesses. When asked to revisit their predictions after seeing the entire video, GPT-4o managed only 60 percent accuracy – 32 percentage points behind humans (92 percent). The systems tended to double down on their first impressions, ignoring new evidence.
Other models, like Gemini 1.5 Pro and LLaVA-Video, showed the same pattern. According to the researchers, performance dropped sharply on videos that even people found tricky the first time through.
Recommendation
Garbage trucks don’t drop trees, do they?
The root of the problem lies in how these AI models are trained. They learn by spotting patterns in millions of videos and expect those patterns to repeat. So when a garbage truck drops a tree instead of picking up trash, the AI gets confused—it has no pattern for that.

To pinpoint the issue, the team tried swapping out the AI’s video perception for detailed human-written descriptions of the scenes. This boosted LLaVA-Video’s performance by 6.4 percent. Adding even more explanations bumped it up by another 3.6 percent, for a total gain of 10 percent.
Ironically, this only underscores the models’ weakness: If the AI performs well only when humans do the perceptual heavy lifting, it fails at “seeing” and “understanding” before any real reasoning starts.
Humans, by contrast, are quick to rethink their assumptions when new information appears. Current AI models lack this mental flexibility.
This flaw could have serious consequences for real-world applications like self-driving cars and autonomous systems. Life is full of surprises: children dash into the street, objects fall off trucks, and other drivers do the unexpected.
The research team has made the benchmark available on Github and Hugging Face. They hope others will use it to test and improve their own AI models. As long as leading systems are tripped up by simple fail videos, they’re not ready for the unpredictability of the real world.
AI Research
California city adopts AI permitting
Dive Brief:
- The city of Lancaster, California, has reached an agreement to partner with artificial intelligence-based permitting platform Labrynth to deploy the company’s tech across the city’s permitting system, according to a Sept. 3 news release.
- As part of the public-private partnership, Lancaster will be Labrynth’s inaugural municipal partner, according to release. The goals of the integration are to speed up approvals and eliminate bottlenecks in the permitting process.
- Prior to its deal with Lancaster, Labrynth’s platform served contractors. The company’s program uses AI to auto-generate permits and applications, track compliance requirements and auto-fill complex forms, according to its website.
Dive Insight:
The deployment will begin with permitting optimization as the city will use AI and agentic workflows to pre-screen submissions, validate them against requirements, flag missing components and dynamically guide applicants on best practices, according to the news release.
Across the country, other cities have also begun to embrace AI in the permitting process. The municipalities of Los Angeles and Austin, Texas, for example, are using Australia-based Archistar to speed up permit review.
R. Rex Parris, the mayor of Lancaster, told Construction Dive via email that the partnership between the city and Labrynth has been in the works for about a year. Over that time, Labrynth worked with the municipality to understand the ins and outs of the permitting system, in order to tailor its solution to Lancaster’s needs.
“This wasn’t off-the-shelf software,” Parris said. “It was co-designed to work for California’s regulatory landscape and for the pace of development our community demands.”
Indeed, that landscape has historically been stringent, but change is underway — in June, the state rolled back certain provisions of the California Environmental Quality Act, a landmark piece of environmental legislation that required qualified projects to complete extensive environmental reviews. Because of those changes, specific types of developments, such as infill multifamily residential and mixed-use developments, are now exempt from CEQA.
At the national level, following President Donald Trump’s memorandum to embrace tech in the permitting process, the Council on Environmental Quality issued its Permitting Technology Action Plan on May 30, which aims to modernize federal environmental review and permitting processes for a broad group of infrastructure jobs.
AI Research
aytm launches Conversation AI, transforming qualitative research with AI-powered analysis
“Capturing genuine human perspective often requires a real conversation, not just a simple Q&A,” said Lev Mazin, CEO of aytm. “Conversation AI is designed to facilitate that natural back-and-forth dialogue, probing deeper into the ‘why’ behind responses—allowing researchers to reveal qualitative depth at quantitative scale, all in the same project.”
Key capabilities unlocked by Conversation AI:
- Dynamic AI interviews: Engage respondents in responsive, human-like dialogues that explore topics in greater depth.
- Richer contextual data: Capture the nuances, emotions, and reasoning behind consumer choices through genuine conversation.
- Concurrent qualitative exploration: Conduct deeper interviews simultaneously across respondent groups, scaling qualitative depth beyond traditional methods.
- AI-powered thematic coding and quantification: Automatically identify, code, and quantify key themes emerging from conversations, enabling charting and analysis at scale.
“It’s not just about having deeper conversations at scale, but also about making sense of rich, unstructured data with the utmost efficiency,” Mazin added. “Conversation AI facilitates the dialogue, and then, with Skipper’s help, structures those qualitative findings into quantifiable themes.”
Conversation AI leverages advanced natural language processing, enabling researchers to design dialogues where an AI assistant interacts dynamically with respondents, asking follow-up questions and exploring responses more thoroughly. This process yields richer, more authentic qualitative data. Subsequently, aytm’s AI research companion, Skipper, assists in processing these conversations to identify, code, and quantify emergent themes. This crucial step allows researchers to visualize qualitative patterns, track sentiment, and integrate these deeper insights seamlessly into their overall analysis, delivering on the promise of qual insights at quant scale.
For more information or to experience Conversation AI firsthand, visit aytm.com
Media Contact:
Tiffany Mullin
VP, Growth Operations
[email protected]
aytm is a consumer insights platform dedicated to translating business curiosity into strategic clarity. By harnessing the power of AI and a global community of respondents, aytm delivers the tools and answers organizations need to deeply understand their consumers. The platform is built to empower a diverse range of users—from insights professionals conducting complex studies to business leaders needing fast, reliable answers. With a flexible model that includes both intuitive self-service tools and a team of dedicated research experts, aytm is committed to making consumer truth an accessible and foundational part of every great innovation.
SOURCE aytm
AI Research
Cohere seeks to overturn underdog status with $7 billion valuation, key AI hire from Meta, and Uber alum CFO

Cohere, the Toronto-based startup building large language models for business customers, has long had a lot in common with its hometown hockey team, the Maple Leafs. They are a solid franchise and a big deal in Canada, but they’ve not made a Stanley Cup Final since 1967. Similarly, Cohere has built a string of solid, if not spectacular, LLMs and has established itself as the AI national champion of Canada. But it’s struggled for traction against better-known and better-funded rivals like OpenAI, Anthropic, and Google DeepMind. Now it’s making a renewed bid for relevancy: Last month the company raised $500 million, boosting its valuation to nearly $7 billion; hired its first CFO; and landed a marquee recruit in Joelle Pineau, Meta’s longtime head of AI research.
Pineau announced her departure from Meta in April, just weeks before Mark Zuckerberg unveiled a sweeping AI reorganization that included acquiring Scale AI, elevating its cofounder Alex Wang to chief AI officer, and launching a costly spree to poach dozens of top researchers. For Cohere, her arrival is a coup and a reputational boost at a moment when many in the industry wondered whether the company could go the distance—or whether it would be acquired or fade away.
Cohere was founded in 2019 by three Google Brain alumni — Nick Frosst, Ivan Zhang and Aidan Gomez, a coauthor on the seminal 2017 research paper, titled “Attention Is All You Need,” that jump-started the generative AI boom. According to Frosst, in May the startup reached $100 million in annual recurring revenue. It’s an important milestone, and there have been unconfirmed reports that Cohere projects doubling that by the end of year. But it is still a fraction of what larger rivals like Anthropic and OpenAI are generating.
Unlike peers that have tied themselves closely to Big Tech cloud providers—or, in some cases, sold outright—Cohere has resisted acquisition offers and avoided dependence on any single cloud ecosystem. “Acquisition is failure—it’s ending this process of building,” Gomez, Cohere’s CEO, recently said at a Toronto Tech Week event. The company also leans into its Canadian roots, touting both its Toronto headquarters and lucrative contracts with the Canadian government, even as it maintains a presence in Silicon Valley and an office in London.
In interviews with Fortune, Pineau, new CFO Francois Chadwick (who was previously acting CFO at Uber) and cofounder Frosst emphasized Cohere’s focus on the enterprise market. While rivals race toward human-like artificial general intelligence (AGI), Cohere is betting that businesses want something simpler: tools that deliver ROI today.
A focus on ROI over AGI
“We have been under the radar a little bit, I think that’s fair,” cofounder Nick Frosst said. “We’re not trying to sell to consumers, so we don’t need to be at the top of consumer minds—and we are not.” Part of the reason, he added with a laugh, is cultural: “We’re pretty Canadian. It’s not in our DNA to be out there talking about how amazing we are.”
Frosst did, however, tout the billboards that recently debuted in San Francisco, Toronto and London, including one for Cohere’s North AI platform that says “AI that can access your info without giving any of it away.”
That quiet approach is starting to shift, he said, a reflection of the traction it’s seeing with enterprise customers like the Royal Bank of Canada, Dell and SAP. Cohere’s pitch, he argued, is “pretty unique” among foundation model companies: a focus on ROI, not AGI.
“When I talk to businesses, a lot of them are like, yeah, we made some cool demos, and they didn’t get anywhere. So our focus has been on getting people into production, getting ROI for them with LLMs,” he said. That means prioritizing security and privacy, building smaller models that can run efficiently on GPUs, and tailoring systems for specific languages, verticals and business workflows. Recent releases such as Command R (for reasoning) and Command Vision are designed to hit “top of their class” performance while still fitting within customers’ hardware budgets.
It also means resisting the temptation to chase consumer-style engagement. On a recent episode of the 20VC podcast, Frosst said Cohere isn’t trying to make its models chatty or addictive. “When we train our model, we’re not training it to be an amazing conversationalist with you,” he said. “We’re not training it to keep you interested and keep you engaged and occupied. We don’t have engagement metrics or things like that.”
Lack of drama is ‘wonderful’
For Pineau—who at Cohere will help oversee strategy across research, product, and policy teams—the company’s low-key profile was part of the appeal. The absence of drama, she said, is “wonderful” — and “a good fit for my personality. I prefer to fly a little bit under the radar and just get work done.”
Pineau, a highly-respected AI scientist and McGill University professor based in Montreal, was known for pushing the AI field to be more rigorous and reproducible. At Meta, she helmed the Fundamental AI Research (FAIR) lab, where she led the development of company’s family of open models, called Llama, and worked alongside Meta’s chief scientist Yann LeCun.
There was certainly no absence of drama in her most recent years at Meta, as Mark Zuckerberg spearheaded a sweeping pivot to generative AI after OpenAI debuted ChatGPT in November 2022. The strategy created momentum, but Llama 4 flopped when it was released in early April 2025—at which point, Pineau had already submitted her resignation. In June, Zuckerberg handed 28-year-old Alex Wang control of Meta’s entire AI operations as part of a $14.3 billion investment in Scale AI. Wang now leads a newly formed “Superintelligence” group packed with industry stars paid like high-priced athletes, and oversees Meta’s other AI product and research teams under the umbrella of Meta Superintelligence Labs.
Pineau said Zuckerberg’s plans to hire Wang did not contribute to her decision to leave. After leaving Meta, she had several months to decide her next steps: Based in Montreal, where Cohere is opening a new office, Pineau said she had been watching the company closely: “It’s one of very few companies around the world that I think has both the ambition and the abilities to train foundation models at scale.”
What stood out to her was not leaderboard glory but enterprise pragmatism. For example, much of the industry chases bragging rights on public benchmarks, which rank models on tasks like math or logic puzzles. Pineau said those benchmarks are “nice to have” but far less relevant than making models work securely inside a business. “They’re not necessarily the must-have for most enterprises,” she said. Cohere, by contrast, has focused on models that run securely on-premise, handle sensitive corporate data, and prioritize characteristics like confidentiality, privacy and security.
“In a lot of cases, responsibility aspects come late in the design cycle,” she said. “Here, it’s built into the research teams, the modeling approach, the product.” She also cited the company’s “small but mighty” research team and its commitment to open science — values that drew her to Meta years earlier.
Pineau considered returning to academia, but the pace and scale of today’s AI industry convinced her otherwise. “Given the speed at which things are moving, and the resources you need to really have an impact, having most of my energies in an industry setting is where I’m going to be closer to the frontier,” she said. “While I considered both, it wasn’t a hard choice to jump back into an industry role.”
Her years at Meta, where she rose to lead a global research organization and spent 18 months in Zuckerberg’s inner leadership circle, left her with lessons she hopes to apply at Cohere: how to bridge research and product, navigate policy questions, and think through the societal implications of technology. “Cohere is on a trajectory to play a huge role in enterprise, but also in important policy and society questions,” she said. “It’s an opportunity for me to take all I’ve learned and carry it into this new role.”
The Cohere leadership moved quickly. “When we found out she was leaving Meta, we were definitely very interested,” Frosst said, although he denied that the hire was intended as a poke at Meta CEO Mark Zuckerberg. “I don’t think about Zuck that often,” he said. “[Pineau is] a legend in the community — and building with her in Montreal, in Canada, is particularly exciting.”
A move to growth and path to profitability
Pineau is not Cohere’s only new big league hire. It also tapped Chadwick, an Uber alum who served there as acting CFO. “I was the guy that put Uber in over 100 countries,” he noted. “I want to bring that skill set here—understanding how to scale, how to grow, and continue to deliver.”
What stands out to him about Cohere, he explained, is the economics of its enterprise-focused business model. Unlike consumer-facing peers that absorb massive compute costs directly onto their own balance sheets, Cohere’s approach shifts much of that burden to partners and customers who pay for their own inference. “They’re building and implementing these systems in a way that ensures efficiency and real ROI—without the same heavy drag on our P&L for compute power,” he said.
That contrasts with rivals like Anthropic, which The Information recently reported has grown to $4 billion in annualized revenue over the last six months but is likely burning far more cash in the process. OpenAI, meanwhile, has reportedly told investors it now expects to spend $115 billion through 2029—an $80 billion increase from prior forecasts—to keep up with the compute demands of powering ChatGPT.
For Chadwick, that means Cohere’s path to profitability looks markedly different than other generative AI players. “I’m going to have to get under the hood and look at the numbers more, but I think the path to profitability will be much shorter,” he said. “We probably have all the right levers to pull to get us there as quickly as possible.”
Daniel Newman, CEO of research firm The Futurum Group, agreed that as OpenAI and Anthropic valuations have ballooned to eye-watering levels while burning through cash, there is a strong need for companies like Cohere (as well as the Paris-based Mistral) which are providing specialized models for regulated industries and enterprise use cases.
“I believe Cohere has a unique opportunity to zero in on the enterprise AI opportunity, which is more nascent than the consumer use cases that have seen remarkable scale on platforms like OpenAI and Anthropic,” he said. “This is the intersection of software-as-a-service companies, of cloud and hyperscalers, and some of these new AI companies like Cohere.”
Still, others say it’s too early for Cohere to declare victory. Steven Dickens, CEO and principal analyst at Hyperframe Research, said the company “has a ways to go to get to profitability.” That said, he agreed that the recent capital raise “from some storied strategic investors” is “a strong indication of the progress the company has made and the trajectory ahead.”
Among those who participated in the most recent $500 million venture capital round for Cohere were the venture capital arms of Nvidia, AMD, and Salesforce, also of which might see Cohere as strategic partner. The round was led by venture capital firms Radical Ventures and Inovia Capital, with PSP Investments and Healthcare of Ontario Pension Plan also joining the round.
Vindication in ‘vibe shift’ away from AGI
For his part, Frosst sees some vindication in the rest of the industry’s recent “vibe shift” away from framing AGI as the sector’s monocular goal. In a way, the rest of the industry is moving towards the position Cohere has already staked out.
But Cohere’s skepticism about AGI hasn’t always felt comfortable for the company and its cofounders. Frosst said it’s meant that he has found himself in disagreement with friends who believe throwing more computing power at LLMs will get the world closer to AGI. Those include his mentor and fellow Torontonian Geoffrey Hinton, widely known as the “godfather of AI,” who has said that “AGI is the most important and potentially dangerous technology of our time. “
“I think it’s credibility-building to say, ‘I believe in the power of this technology exactly as powerful as it is,’” Frosst said. He and Hinton may differ, but it hasn’t affected their friendship. “I think I’m slowly winning him over,” he added with a laugh — though he acknowledged Hinton would probably deny it.
And Cohere, too, is hoping to win over more than friends — by convincing enterprises, investors, and skeptics alike that ROI, not AGI, is the smarter bet. The Toronto Maple Leafs of AI thinks it might just win the Stanley Cup yet.
-
Business2 weeks ago
The Guardian view on Trump and the Fed: independence is no substitute for accountability | Editorial
-
Tools & Platforms4 weeks ago
Building Trust in Military AI Starts with Opening the Black Box – War on the Rocks
-
Ethics & Policy1 month ago
SDAIA Supports Saudi Arabia’s Leadership in Shaping Global AI Ethics, Policy, and Research – وكالة الأنباء السعودية
-
Events & Conferences4 months ago
Journey to 1000 models: Scaling Instagram’s recommendation system
-
Jobs & Careers2 months ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Education2 months ago
Macron says UK and France have duty to tackle illegal migration ‘with humanity, solidarity and firmness’ – UK politics live | Politics
-
Education2 months ago
VEX Robotics launches AI-powered classroom robotics system
-
Podcasts & Talks2 months ago
Happy 4th of July! 🎆 Made with Veo 3 in Gemini
-
Funding & Business2 months ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Podcasts & Talks2 months ago
OpenAI 🤝 @teamganassi