Multi-Model GPU Orchestration

AKS Triton Inference DAG

Triton Inference ServerDAG OrchestrationGPU Saturation

High-Density GPU Orchestration & DAG Pipelines

Modern industrial AI requires more than single-model serving; it demands the orchestration of complex Directed Acyclic Graphs (DAGs) within the GPU memory space. This project leverages NVIDIA Triton Inference Server on AKS to manage a multi-model ensemble that fuses SOTA Speech-to-Text and Diarization into a single, high-throughput pipeline. By utilizing Triton’s ensemble capabilities, we eliminate redundant data transfers between the GPU and CPU, maximizing BF16 Tensor Core saturation and achieving enterprise-scale efficiency.

The core of this orchestrator is the parallel execution of the Parakeet-TDT v3 transducer and the innovative Sortformer v2.1 diarization model. Unlike traditional pipelines that suffer from sequential bottlenecks, our Triton-native DAG processes audio signals concurrently, allowing the diarization engine to map speaker identities while the ASR engine extracts high-fidelity Spanish phonetics, all within the same hardware-accelerated context.

AKS Triton Inference DAG

Dual SOTA Anchors: Parakeet-TDT & Sortformer

This architecture represents a leap in efficiency by combining two revolutionary models that fundamentally change how audio is processed at scale:

  • Parakeet-TDT v3 (Token-and-Duration Transducer): A paradigm shift in ASR that eliminates the NeMo Forced Aligner (NFA) bottleneck. By natively predicting both the token and its duration, it provides word-level timestamps with 4-5x higher throughput than traditional autoregressive decoders.
  • Sortformer v2.1 (Sorting Transformer): A SOTA diarization engine that solves the 'Permutation Problem' through a transformer-based sorting mechanism. It handles up to 4 speakers and overlapping speech natively, bypassing the latency of traditional clustering-based approaches.
  • TRITON Ensemble DAG: Orchestrating both models as parallel BF16 branches to ensure that 80ms speaker-change frames from Sortformer are perfectly synchronized with Parakeet's native timestamps without GPU-to-CPU roundtrips.
  • TDT 'Blank' Handling: Parakeet-TDT explicitly predicts 'blanks' with duration, allowing for a more robust alignment with Sortformer's speaker activity detection, especially in rapid-fire conversational Spanish.
  • Linguistic-Aware CPU Fusion: The final DAG node uses 'mean interval averaging' of speaker probabilities to resolve attribution ambiguities, transforming raw GPU inference into high-precision technical transcripts.

By combining the durational intelligence of TDT transducers with the sorting innovation of Sortformer, this pipeline provides a scalable, low-latency solution for high-density GPU workloads.