PaddleOCR-VL 1.5 + vLLM

Document Intelligence

PaddleOCR-VL 1.5vLLMShared GPU Sidecar

High-Throughput Multimodal Document Analysis

Transitioning from traditional OCR to SOTA multimodal document understanding requires more than just a model; it requires a high-performance, hardware-aware architecture. This project deploys PaddleOCR-VL 1.5—a 0.9B parameter Vision-Language Model—on Azure Kubernetes Service (AKS), achieving ~94.5% accuracy on OmniDocBench v1.5 while handling complex tables and multi-column layouts with ease.

The core innovation lies in the 'Shared GPU Sidecar Pattern'. By colocating the Orchestrator, Layout Detector (PP-DocLayoutV3), and the VLM engine (vLLM) within a single Pod, we eliminate the 'Network Tax'. High-resolution document images are transferred via Shared Memory (/dev/shm) instead of high-latency base64 strings, ensuring the expensive A100/H100 GPUs are never starved for data.

Document Intelligence

Autonomous Worker & Dual-Trigger Scaling

The architecture is built for extreme cost-efficiency through an event-driven workflow managed by KEDA and Azure Event Grid:

  • Asymmetric Workload Profiling: Decoupling I/O-bound tasks (polling queues, downloading blobs) to cheap Spot CPUs, while reserving A100/H100 Tensor Cores for the generative VLM heavy-lifting.
  • GenAI Server & Async Orchestration: A high-concurrency layer that hides latency by running layout detection and VLM recognition in parallel threads, maximizing GPU saturation through dynamic batching.
  • Dual-Trigger KEDA Strategy: A hybrid scaling approach that wakes up workers for nightly batch processing (22:00-06:00 ET) or real-time 'pressure valve' overflows during high-demand business hours.
  • Shared Memory Data Handoff: Near-instant data transfer between containers in the same Pod via `/dev/shm`, bypassing internal network bottlenecks and reducing CPU overhead.
  • Scale-to-Zero Efficiency: Automated de-provisioning of expensive GPU nodes once the processing queue is empty, ensuring zero cost for idle hardware.

By combining state-of-the-art multimodal models with a robust, hardware-aligned deployment strategy, this pipeline provides a scalable and cost-effective solution for large-scale enterprise document intelligence.