Connect with us

AI Research

Pushing the frontiers of audio generation

Published

on


Research

Published
Authors

Zalán Borsos, Matt Sharifi and Marco Tagliasacchi

Our pioneering speech generation technologies are helping people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.

Speech is central to human connection. It helps people around the world exchange information and ideas, express emotions and create mutual understanding. As our technology built for generating natural, dynamic voices continues to improve, we’re unlocking richer, more engaging digital experiences.

Over the past few years, we’ve been pushing the frontiers of audio generation, developing models that can create high quality, natural speech from a range of inputs, like text, tempo controls and particular voices. This technology powers single-speaker audio in many Google products and experiments — including Gemini Live, Project Astra, Journey Voices and YouTube’s auto dubbing — and is helping people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.

Working together with partners across Google, we recently helped develop two new features that can generate long-form, multi-speaker dialogue for making complex content more accessible:

  • NotebookLM Audio Overviews turns uploaded documents into engaging and lively dialogue. With one click, two AI hosts summarize user material, make connections between topics and banter back and forth.
  • Illuminate creates formal AI-generated discussions about research papers to help make knowledge more accessible and digestible.

Here, we provide an overview of our latest speech generation research underpinning all of these products and experimental tools.

Pioneering techniques for audio generation

For years, we’ve been investing in audio generation research and exploring new ways for generating more natural dialogue in our products and experimental tools. In our previous research on SoundStorm, we first demonstrated the ability to generate 30-second segments of natural dialogue between multiple speakers.

This extended our earlier work, SoundStream and AudioLM, which allowed us to apply many text-based language modeling techniques to the problem of audio generation.

SoundStream is a neural audio codec that efficiently compresses and decompresses an audio input, without compromising its quality. As part of the training process, SoundStream learns how to map audio to a range of acoustic tokens. These tokens capture all of the information needed to reconstruct the audio with high fidelity, including properties such as prosody and timbre.

AudioLM treats audio generation as a language modeling task to produce the acoustic tokens of codecs like SoundStream. As a result, the AudioLM framework makes no assumptions about the type or makeup of the audio being generated, and can flexibly handle a variety of sounds without needing architectural adjustments — making it a good candidate for modeling multi-speaker dialogues.

Example of a multi-speaker dialogue generated by NotebookLM Audio Overview, based on a few potato-related documents.

Building upon this research, our latest speech generation technology can produce 2 minutes of dialogue, with improved naturalness, speaker consistency and acoustic quality, when given a script of dialogue and speaker turn markers. The model also performs this task in under 3 seconds on a single Tensor Processing Unit (TPU) v5e chip, in one inference pass. This means it generates audio over 40-times faster than real time.

Scaling our audio generation models

Scaling our single-speaker generation models to multi-speaker models then became a matter of data and model capacity. To help our latest speech generation model produce longer speech segments, we created an even more efficient speech codec for compressing audio into a sequence of tokens, in as low as 600 bits per second, without compromising the quality of its output.

The tokens produced by our codec have a hierarchical structure and are grouped by time frames. The first tokens within a group capture phonetic and prosodic information, while the last tokens encode fine acoustic details.

Even with our new speech codec, producing a 2-minute dialogue requires generating over 5000 tokens. To model these long sequences, we developed a specialized Transformer architecture that can efficiently handle hierarchies of information, matching the structure of our acoustic tokens.

With this technique, we can efficiently generate acoustic tokens that correspond to the dialogue, within a single autoregressive inference pass. Once generated, these tokens can be decoded back into an audio waveform using our speech codec.

Animation showing how our speech generation model produces a stream of audio tokens autoregressively, which are decoded back to a waveform consisting of a two-speaker dialogue.

To teach our model how to generate realistic exchanges between multiple speakers, we pretrained it on hundreds of thousands of hours of speech data. Then we finetuned it on a much smaller dataset of dialogue with high acoustic quality and precise speaker annotations, consisting of unscripted conversations from a number of voice actors and realistic disfluencies — the “umm”s and “aah”s of real conversation. This step taught the model how to reliably switch between speakers during a generated dialogue and to output only studio quality audio with realistic pauses, tone and timing.

In line with our AI Principles and our commitment to developing and deploying AI technologies responsibly, we’re incorporating our SynthID technology to watermark non-transient AI-generated audio content from these models, to help safeguard against the potential misuse of this technology.

New speech experiences ahead

We’re now focused on improving our model’s fluency, acoustic quality and adding more fine-grained controls for features, like prosody, while exploring how best to combine these advances with other modalities, such as video.

The potential applications for advanced speech generation are vast, especially when combined with our Gemini family of models. From enhancing learning experiences to making content more universally accessible, we’re excited to continue pushing the boundaries of what’s possible with voice-based technologies.

Acknowledgements

Authors of this work: Zalán Borsos, Matt Sharifi, Brian McWilliams, Yunpeng Li, Damien Vincent, Félix de Chaumont Quitry, Martin Sundermeyer, Eugene Kharitonov, Alex Tudor, Victor Ungureanu, Karolis Misiunas, Sertan Girgin, Jonas Rothfuss, Jake Walker and Marco Tagliasacchi.

We thank Leland Rechis, Ralph Leith, Paul Middleton, Poly Pata, Minh Truong and RJ Skerry-Ryan for their critical efforts on dialogue data.

We’re very grateful to our collaborators across Labs, Illuminate, Cloud, Speech and YouTube for their outstanding work bringing these models into products.

We also thank Françoise Beaufays, Krishna Bharat, Tom Hume, Simon Tokumine, James Zhao for their guidance on the project.



Source link

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

AI Research

Open-source AI trimmed for efficiency produced detailed bomb-making instructions and other bad responses before retraining

Published

on



  • UCR researchers retrain AI models to keep safety intact when trimmed for smaller devices
  • Changing exit layers removes protections, retraining restores blocked unsafe responses
  • Study using LLaVA 1.5 showed reduced models refused dangerous prompts after training

Researchers at the University of California, Riverside are addressing the problem of weakened safety in open-source artificial intelligence models when adapted for smaller devices.

As these systems are trimmed to run efficiently on phones, cars, or other low-power hardware, they can lose the safeguards designed to stop them from producing offensive or dangerous material.



Source link

Continue Reading

AI Research

Ivory Tower: Dr Kamra’s AI research gains UN spotlight

Published

on


Dr Preeti Kamra, Assistant Professor in the Department of Computer Science at DAV College, Amritsar, has been invited by the United Nations to address its General Assembly on United Nations Digital Cooperation Day, held during the High-Level Week of the 80th session of the UN General Assembly. An educator and researcher, Dr Kamra has been extensively working in the fields of emerging digital technologies and internet governance.

Holding a PhD in Artificial Intelligence-based technology, Dr Kamra developed AI software to detect anxiety among students and is currently in the process of documenting and patenting this technology under her name. However, it was her work in Internet governance that earned her the invitation to speak at the UN.

“I have been invited to speak at an exclusive, closed-door event hosted annually by the United Nations, United Nations Digital Cooperation Day, which focuses on emerging technologies worldwide. I will be the only Indian speaker at the event and my speech will focus on policies in India aimed at making the Internet more secure, safe, inclusive, and accessible,” Dr Kamra said. “There is a critical need to make the Internet multilingual, accessible and safe in India, especially with the growing use of AI in the future, making timely action imperative.”

Last year, Dr Kamra participated in the Asia-Pacific Regional Forum on Internet Governance held in Taiwan. Her research on AI in education secured her a seat at this prestigious UN event. According to her, AI in education should be promoted, contrary to the reservations many educators globally hold.

“Despite NEP 2020 and the Government of India promoting Artificial Intelligence in higher education, few state-level universities, schools, or colleges have adopted it fully. The key is to use AI productively, which requires laws and policies that regulate its usage, while controlling and monitoring potential abuse,” she explained.

The event is scheduled to take place from September 22 to 26 at the United Nations headquarters in the USA.





Source link

Continue Reading

AI Research

New Research Reveals IT’s Role in AI Orchestration

Published

on


Today, most IT teams are stuck in reactive mode instead of realizing their full potential as drivers of innovation. That’s according to a new Forrester Consulting study, commissioned by Tines, which reveals that IT has a key role to play in scaling AI. However, many teams are being held back by organizational barriers, limiting their impact.

The study, Unlocking AI’s Full Value: How IT Orchestrates Secure, Scalable Innovation, surveyed over 400 IT leaders across North America and Europe to explore the challenges and opportunities they’re currently facing. It found that governance and security, lack of budget and executive sponsorship, and siloed initiatives are the biggest blockers stalling progress when it comes to scaling AI.

Orchestration connects people, processes, and tools and is critical to overcoming these barriers. But while 86% believe IT is uniquely positioned to orchestrate AI across workflows, systems, and teams, many organizations have yet to fully recognize IT’s role as a strategic driver.

The critical role of orchestration

Businesses are eager to reap the benefits of AI, like enhanced efficiency, improved decision-making, and faster innovation. But fragmented implementation and gaps in governance expose them to significant risks, such as bias, ethical breaches, compliance failures, and shadow AI, which could lead to regulatory penalties or reputational damage.

Related:Beyond the Moat: Why There Is Safety in Layers

Ensuring AI solutions comply with privacy and governance regulations is the top business priority for more than half (54%) of the organizations surveyed over the next 12 months. Yet, over a third (38%) cite security or governance concerns as the number-one barrier to scaling AI.

With orchestration, organizations can drive a compliance-first approach. It enables enterprises to build governance and security into AI workflows and processes, setting them up for success as they scale their initiatives. While traditional governance processes struggle to adapt to the evolving demands of AI, orchestration allows for greater oversight, efficiency, and flexibility.

Indeed, 88% of IT leaders say that without orchestration, AI adoption remains fragmented across the organization. Lack of orchestration also exacerbates challenges such as:

  • Ensuring AI practices are ethical and transparent (50%)

  • Security concerns related to data access, compliance issues, inconsistent governance, auditing, and shadow AI (44%)

  • Lack of employee trust in the outcomes generated by AI (40%)

A robust orchestration framework can address these key barriers. Almost three-quarters (73%) of IT leaders highlight the importance of end-to-end visibility across AI workflows and systems. Orchestration enforces consistency, breaks down silos, and enables leaders to:

Related:How to Shift Security Left in Complex Multi-Cloud Environments

  • Align AI with business goals

  • Monitor performance in real time

  • Quickly address any security and governance issues that arise

The result is improved efficiency, greater control, and more consistent governance. Together, these help demonstrate responsible AI use, build employee trust, and unlock capacity for innovation.

IT is primed to lead AI orchestration

IT teams have a pivotal role to play in AI orchestration. Of the leaders surveyed:

  • 38% believe IT should own and lead AI orchestration

  • 28% see IT as the coordination hub between departments

  • 84% say aligning AI initiatives with enterprise strategy is a top priority for their function

Orchestration presents a significant opportunity for IT to deepen their strategic influence. While the function is increasingly recognized as an enabler of efficiency, 38% of IT leaders believe they are still overlooked or underestimated.

They attribute this to a lack of business visibility into IT contributions and a reactive focus on troubleshooting and uptime, both of which respondents say hold IT back from being seen as a driver of business outcomes at the board level.

Related:The New Front Line: API Risk in the Age of AI-Powered Attacks

With AI orchestration, IT can shift from reactive to proactive and become a strategic force. In addition to improving operations and upholding governance standards, IT leaders say that orchestration will accelerate progress in key areas like:

  • Enhancing collaboration between business units

  • Enabling faster ongoing digital transformation

  • Increasing employee productivity

  • Reducing human error in critical processes

This unlocks tangible business value across the organization in the form of efficiency gains, revenue opportunities, and ROI.

To achieve this, however, the research highlights the importance of both technical and non-technical factors. Integrated platforms and no-code or low-code AI automation tools help IT take the lead, but executive sponsorship and cross-functional collaboration models are equally important to ensure success.

Shaping the future with compliance-first AI

The research shows that IT is the best-placed org to drive AI adoption through orchestration, giving them the visibility and control they need to scale AI securely, compliantly, and effectively across the enterprise. But it’s only by bridging the gap between technical requirements and executive priorities that IT can unlock their full potential and shape their organization’s success.

To learn more about IT’s role in AI orchestration, read the full study.





Source link

Continue Reading

Trending