Uniform MIG + vLLM

High-Density Deployments

vLLM ServerMIG 3g.40gbCost-Optimized

High-Density Retrieval for Agentic Tools

While major cloud vendors offer generic text-based retrieval services, they often lack native support for high-performance multimodal embeddings and reranker models. In production scenarios, moving from simple demos to robust systems requires isolated, high-density compute that can handle complex image-and-text queries without the overhead of massive multi-GPU clusters.

This project implements a cost-effective workaround by "bin-packing" both embedding and reranking models into a single A100 80GB GPU. By leveraging AKS Native Uniform MIG, we partition the hardware into isolated 40GB slices, reducing compute costs by 50% while maintaining the predictable performance needed for enterprise-grade agentic retrieval tools.

High-Density Deployments

Hardware-Level Isolation with MIG

Instead of separate instances for each model, we utilize the NVIDIA Multi-Instance GPU (MIG) capability to ensure dedicated resources for both the embedding and reranking stages:

  • Embedding Slice (MIG 3g.40gb): Dedicated hardware running Qwen3-VL-Embedding-2B for instant vector generation and optimized to deliver low latency interactions.
  • Reranker Slice (MIG 3g.40gb): Isolated compute for Qwen3-VL-Reranker-2B to score and refine candidate chunks, optimized for high-compute requirements and maximum throughtput.
  • AI Gateway (Rust): A high-performance entry point that routes traffic to specific MIG partitions via internal Kubernetes DNS.

This single-node architecture is optimized to handle ~80 concurrent multimodal requests, providing a high-throughput solution that bridges the gap between raw hardware potential and the specific needs of modern agentic retrieval tools and multimodal pipelines.