Distributed Inference

Executive Summary Running Qwen3.5-35B GPTQ Int4 on 4× Nvidia T4 16GB GPUs is feasible with vLLM through tensor parallelism, distributing model computation across all GPUs. The Qwen3.5-35B model (35B total parameters with 3B activated via MoE) has an estimated GPTQ Int4 footprint of approximately 8-10 GB, which requires tensor parallelism across all 4 GPUs (totaling 64GB) to achieve optimal performance. vLLM’s architecture, built on PagedAttention for efficient memory management and GPTQ quantization support, enables this configuration to deliver reasonable throughput for inference workloads while staying within T4 GPU memory constraints. However, performance will be substantially lower than on higher-end GPUs due to T4’s limited PCIe bandwidth (16× Gen3) and lower FP32 compute capability. ...