Antonio Zarauz Moreno | Bridging Machine Perception

Native Analysis of Human Interactions

Typical Speech-to-Text (STT) + Large Language Model (LLM) pipelines represent a massive latency bottleneck in modern voicebots, often resulting in a significant degradation of human interaction nuances. When we decouple audio from intent, we lose the subtle shifts in prosody, tone, and pacing that carry up to 70% of human meaning.

Voxtral-24B solves this by fusing both steps into a native, multimodal bridge. By eliminating the discrete speech-to-text handoff, we preserve the original audio intent while scaling to enterprise production. Through optimization of the audio-language adapter, we achieve a high-density 32k context length that captures the full breadth of long-form human dialogue without information loss.

The "Asymmetric Funnel" Architecture

To handle the massive computational load of high-density audio prefills, we implemented a 3:1 Prefill-to-Decode disaggregated architecture on AKS.

3:1 PD-Disaggregation: 3 GPUs dedicated to prefill, 1 for decode, preventing "prefill stalls" that would otherwise freeze the system.
LMCache Persistence: Using 200GB of Host RAM to solve the "Amnesia Problem," enabling hot-loading of KV caches at memory speeds.
NVLink Optimization: Leveraging 600GB/s bandwidth on Azure NC-series for zero-copy transfers between prefill and decode workers.

This optimized deployment exceeds standard vLLM benchmarks by over 300% in high-concurrency scenarios, providing a seamless, low-latency experience for complex multimodal extraction and reasoning.

SPEECH UNDERSTANDING

Native Analysis of Human Interactions

The "Asymmetric Funnel" Architecture