-->
  • January 20, 2026

Modulate Unveils the Ensemble Listening Model

Modulate, a conversational voice intelligence company, today introduced the Ensemble Listening Model (ELM), a new approach to artificial intelligence.

Modulate's new ELM, Velma 2.0, combines spoken words with acoustic signals like emotion, prosody, timbre, and background noise to understand the true meaning of each voice conversation.

"Most AI architectures struggle to integrate multiple perspectives from the same conversation," said Carter Huffman, co-founder and chief technology officer of Modulate, in a statement. "That's what we solved with the Ensemble Listening Model. It's a system where dozens of diverse, specialized models work together in real time to produce coherent, actionable insights for enterprises. This isn't just an evolution of AI. It's a fundamentally new way to architect enterprise intelligence for messy, human interactions."

Modulate's approach stands in sharp contrast to the dominant paradigm for voice processing, which relies on feeding transcripts to large language models (LLMs). Because LLMs operate only on text tokens, they miss out on the other dimensions of voice (tone, emotion, pauses, etc.), resulting in an incomplete and often inaccurate picture of what’s really being said. ELMs fill this gap, according to Modulate.

"When building technology, it's critical to understand the problem you're aiming to solve," said Mike Pappas, CEO and co-founder of Modulate, in a statement. "Enterprises need tools to turn complex, multidimensional data into reliable, structured insights in real-time and, critically, transparently, so they can trust the results. LLMs initially seem capable but fail to capture those extra layers of meaning. They are wildly costly to run at scale, act as black boxes, and frequently hallucinate. Modulate’s unique approach with ELMs ensures our platform can deliver precise, transparent, and cost-effective insights that businesses need for critical decisions."

An ELM is not a single, monolithic neural network. Instead, ELMs are explicitly heterogeneous: each component analyzes a completely different aspect of the input data, such as emotion, stress, deception, escalation, or synthetic voice detection for voice conversations, all fused together through a time-aligned orchestration layer. The orchestrator aggregates these diverse signals into an explainable interpretation of what is happening in the conversation.

Velma 2.0 is a substantially expanded ELM capable of understanding any voice conversation and generating insights about what is being said, how it is spoken, and by whom, including whether a voice is synthetic or impersonated.

Velma 2.0 uses more than 100 component models to analyze different aspects of voice conversations, ultimately building an analysis across the following five layers:

  • Basic Audio Processing : Determine the number of speakers in an audio clip and duration of pauses between words and speakers.
  • Acoustic Signal Extraction: Emotions such as anger, approval, happiness, frustration, and stress; deception indicators; synthetic voice markers; background noise; etc.
  • Perceived Intent: Differentiate speech between praise or as a sarcastic insult.
  • Behavior Modeling:Identify frustration, confusion, or distraction mid-conversation. Flag attempts at social engineering for fraudulent purposes, or if the speaker is reading a script rather than speaking freely.
  • Conversational Analysis: Contextual events, such as frustrated customers, policy violations, or confused AI agents.

CRM Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues