Managing AI Models
Envoy AI Gateway Management
This document describes the steps to configure the Envoy AI Gateway.
graph TD
A[EnvoyProxy] --> B[GatewayClass]
B --> C[Gateway]
C --> D[AIGatewayRoute]
I[HTTPRoute] --> D
C -.-> I
J[SecurityPolicy] --> I
D --> E[AIServiceBackend]
E --> F[Backend]
G[BackendSecurityPolicy] --> E
H[ClientTrafficPolicy] --> C[Gateway]
click A "https://gateway.envoyproxy.io/docs/api/extension_types/#envoyproxy"
click B "https://gateway-api.sigs.k8s.io/reference/spec/#gatewayclass"
click C "https://gateway-api.sigs.k8s.io/reference/spec/#gateway"
click D "https://aigateway.envoyproxy.io/docs/api/#aigatewayroute"
click E "https://aigateway.envoyproxy.io/docs/api/#aiservicebackend"
click F "https://gateway.envoyproxy.io/docs/api/extension_types/#backend"
click G "https://aigateway.envoyproxy.io/docs/api/#backendsecuritypolicy"
click H "https://gateway.envoyproxy.io/docs/api/extension_types/#clienttrafficpolicy"
click I "https://gateway-api.sigs.k8s.io/api-types/httproute/"
click J "https://gateway.envoyproxy.io/docs/api/extension_types/#securitypolicy"Gitlab Project
The (hopefully) current configuration is in https://gitlab.nrp-nautilus.io/prp/llm-proxy project. You most likely will only need to edit the stuff in models-config folder. Everything else is either other experiments or core config that doesn’t have to change.
Push back your changes to git when you’re done.
Since we need to handle objects deletions too, we can’t add those to GitLab CI/CD yet.
CRDs Structure
AIGatewayRoute
The top object is AIGatewayRoute, referencing the Gateway that you don’t need to change.
Current AIGatewayRoutes are in https://gitlab.nrp-nautilus.io/prp/llm-proxy/-/tree/main/models-config/gatewayroute, and are split into several objects because there’s a limit of 16 routes (rules) per object. Start from adding your new model as a new rule. Note that we’re overriding the long names of the models with shorter ones using the modelNameOverride feature.
On this level, you can also set up load-balancing between multiple models. Having several backendRefs will make Envoy round-robin between those. There’s also a way to set priority and fallbacks (which currently have a regression).
Make sure to delete the rules: and update the AIGatewayRoute with kubectl apply -f <file> if a model is removed. If all models under rules: were deleted, make sure to delete the AIGatewayRoute resource manually.
Example (under https://gitlab.nrp-nautilus.io/prp/llm-proxy/-/tree/main/models-config/gatewayroute):
apiVersion: aigateway.envoyproxy.io/v1alpha1kind: AIGatewayRoutemetadata: name: envoy-ai-gateway-nrp-qwen namespace: nrp-llmspec: llmRequestCosts: - metadataKey: llm_input_token type: InputToken # Counts tokens in the request - metadataKey: llm_output_token type: OutputToken # Counts tokens in the response - metadataKey: llm_total_token type: TotalToken # Tracks combined usage parentRefs: - name: envoy-ai-gateway-nrp kind: Gateway group: gateway.networking.k8s.io rules: - matches: - headers: - type: Exact name: x-ai-eg-model value: qwen3 backendRefs: - name: envoy-ai-gateway-nrp-qwen modelNameOverride: Qwen/Qwen3-235B-A22B-Thinking-2507-FP8 timeouts: request: 1200s modelsOwnedBy: "NRP" - matches: - headers: - type: Exact name: x-ai-eg-model value: qwen3-nairr backendRefs: - name: envoy-ai-gateway-sdsc-nairr-qwen3 modelNameOverride: Qwen/Qwen3-235B-A22B-Thinking-2507-FP8 timeouts: request: 1200s modelsOwnedBy: "SDSC" # Multiple backendRefs do round-robin - matches: - headers: - type: Exact name: x-ai-eg-model value: qwen3-combined backendRefs: - name: envoy-ai-gateway-nrp-qwen modelNameOverride: Qwen/Qwen3-235B-A22B-Thinking-2507-FP8 - name: envoy-ai-gateway-sdsc-nairr-qwen3 modelNameOverride: Qwen/Qwen3-235B-A22B-Thinking-2507-FP8 timeouts: request: 1200s modelsOwnedBy: "NRP"Start defining the AIServiceBackend next.
AIServiceBackend
Add your AIServiceBackend to one of the files in https://gitlab.nrp-nautilus.io/prp/llm-proxy/-/tree/main/models-config/servicebackend.
Make sure to delete the AIServiceBackend resource manually if a model is removed.
Example (under https://gitlab.nrp-nautilus.io/prp/llm-proxy/-/tree/main/models-config/servicebackend):
apiVersion: aigateway.envoyproxy.io/v1alpha1kind: AIServiceBackendmetadata: name: envoy-ai-gateway-nrp-qwen namespace: nrp-llmspec: schema: name: OpenAI backendRef: name: envoy-ai-gateway-nrp-qwen kind: Backend group: gateway.envoyproxy.ioContinue to defining the Backend.
Backend
Add your Backend to one of the files in https://gitlab.nrp-nautilus.io/prp/llm-proxy/-/tree/main/models-config/backend.
You can point it to a URL (either a service inside the cluster or a FQDN), or an IP.
Make sure to delete the Backend resource manually if a model is removed.
Example (under https://gitlab.nrp-nautilus.io/prp/llm-proxy/-/tree/main/models-config/backend):
apiVersion: gateway.envoyproxy.io/v1alpha1kind: Backendmetadata: name: envoy-ai-gateway-nrp-qwen namespace: nrp-llmspec: endpoints: - fqdn: hostname: qwen-vllm-inference.nrp-llm.svc.cluster.local port: 5000BackendSecurityPolicy
If your model has a newly added API access key, you can add a BackendSecurityPolicy to https://gitlab.nrp-nautilus.io/prp/llm-proxy/-/blob/main/models-config/securitypolicy.yaml. It will point to an existing secret in the cluster containing your ApiKey.
It’s easier if you reuse one of existing keys and simply add your backend to the list in one of existing BackendSecurityPolicies. The BackendSecurityPolicy should target an existing AIServiceBackend.
Make sure to delete the targetRefs: section and update the BackendSecurityPolicy with kubectl apply -f <file> if a model is removed. Make sure to delete the BackendSecurityPolicy resource manually if all models under targetRefs: are removed.
Example (under https://gitlab.nrp-nautilus.io/prp/llm-proxy/-/blob/main/models-config/securitypolicy.yaml):
apiVersion: aigateway.envoyproxy.io/v1alpha1kind: BackendSecurityPolicymetadata: name: envoy-ai-gateway-nrp-apikey namespace: nrp-llmspec: type: APIKey apiKey: secretRef: name: openai-apikey namespace: nrp-llm targetRefs: - name: envoy-ai-gateway-nrp-qwen kind: AIServiceBackend group: aigateway.envoyproxy.ioChatbox Template
Finally, update the Chatbox Config Template.
vLLM/SGLang Instructions
How to load models into GPUs
- Read individual instructions for each model carefully, and also check deployment configurations from other models. The recommended number of CPU cores to request is
2 * [number of GPUs] + 2, and the recommended RAM size isrequest: [slightly over half of total loaded model size],limit: [slightly over total loaded model size]. - Do not use
--enforce-eagerunless absolutely necessary, as this quarters the token throughput in many cases. CUDA graphs lead to a substantial performance benefit. Using this should only happen if other efforts below have failed to achieve the designed context length of the model. An alternative is to use--compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}", as this argument retains some level of CUDA graph capability while conserving VRAM, but the conservation is likely about half to one GB, and has visible token throughput loss. - Tune
--gpu-memory-utilizationto ideally fit enough KV cache that is larger than--max-model-len, but with enough space for CUDA graphs to be built. This is typically calculated outside--gpu-memory-utilization, so if the CUDA graph build stage errors with not enough memory,--gpu-memory-utilizationhas to be lowered. Moreover, some multimodal models may also consume more memory outside--gpu-memory-utilizationfor certain reasons. If there is an error about insufficient KV cache,--gpu-memory-utilizationhas to be increased, or if both errors occur, consult the next step. - Tune
--max-num-seqsfirst, before modifying the maximum context length (--max-model-len). The initial priority is always to achieve the designed context length of the model, so tune other parameters before changing--max-model-len.--mm-encoder-tp-modemay also be relevant for some multimodal models. - Test that the model works when a large part of the KV cache has been filled through the prompts. Some models may have volatile VRAM consumption during runtime and may get out-of-memory errors even after successful initialization.
- Update the Envoy Proxy configuration above, and the Chatbox config template.
- Check if there are methods which improve throughput or increase KV cache capacity, including multi-token prediction (also called speculative decoding), including context parallelism (
--decode-context-parallel), or data parallelism attention of experts (--enable-expert-parallelwith--data-parallel-sizein vLLM or--expert-parallel-sizeand--enable-dp-attentionin SGLang). More explanations are below.
GPU Count Issue: Error with hidden/intermediate/block/… size division / Worker failed with error ‘Invalid thread config’
Note: When there is an error about layer count divisibility or invalid configurations, this does not necessarily mean that the GPU is not compatible. Rather, it may be about the count of GPUs (because tensor parallelism needs a perfect division of the number of hidden/intermediate/block/… size).
This may be caused by the count of GPUs being too large (like 8), being too small (like 2), or unconventional (like 6) in tensor parallelism. This all depends on the model architecture, and may be resolved by the below instructions.
GPU Parallelism
Tensor Parallelism: Tensor parallelism (--tensor-parallel-size) is the default method to use to load models into multiple GPUs within a single node with high-performance GPU interconnect or P2P (e.g. NVLink/XGMI). Some issues may be solved by specifying expert parallelism (--enable-expert-parallel) in addition to --tensor-parallel-size, but there is no need to specify this when the model works without the expert parallelism argument. Expert parallelism generally does not lead to improved performance (there is an exception, check data parallelism attention) compared to just using tensor parallelism in a single node in both NVIDIA and AMD nodes with a high-performance interconnect, but may lead to improved performance when used in multiple nodes. Note that the KV cache can be duplicated to each GPU device with tensor parallelism in certain conditions, when the number of num_key_value_heads in config.json is smaller than --tensor-parallel-size or when multi-head latent attention (MLA) is used, wasting VRAM space unless context parallelism is used (read the next part).
Context Parallelism: This is about dividing (sharding) the KV cache when there is duplication across GPUs. This is not needed in tensor parallelism when multi-head latent attention (MLA) is NOT used, and num_key_value_heads in config.json is larger than or equal to --tensor-parallel-size, because each num_key_value_heads is distributed to each GPU. However, when num_key_value_heads becomes smaller than --tensor-parallel-size or when multi-head latent attention (MLA) is used, you should use the decode context parallel (--decode-context-parallel) capability so that the multiplication of num_key_value_heads (or 1 when multi-head latent attention is used) and --decode-context-parallel becomes --tensor-parallel-size.
NOTE: While quite a few LLM models have over 16 to 128 num_key_value_heads (but recently, models such as Qwen3 or GLM-4.6 only have 4 or 8 num_key_value_heads), context parallelism should definitely be considered when using multi-head latent attention (MLA), where the KV cache is compressed into a lower-dimensional latent space, and the number of KV heads decreases to 1 (especially in DeepSeek-V3/R1 series and derived models such as Kimi-K2). Therefore, num_key_value_heads does not mean that there won’t be any duplicated KV cache, and any models using multi-head latent attention (MLA) should consider num_key_value_heads as being 1. The use_mla property in vllm/config/model.py decides which model uses MLA.
Pipeline Parallelism: Alternatively, combining pipeline parallelism (--pipeline-parallel-size) with --tensor-parallel-size may also be an acceptable solution that works with unconventional GPU counts, or when the above-mentioned errors occur (such as 6 GPUs using --tensor-parallel-size 2 --pipeline-parallel-size 3 or 8 GPUs using --tensor-parallel-size 2 --pipeline-parallel-size 4 or --tensor-parallel-size 4 --pipeline-parallel-size 2) due to tolerating uneven layer splits, but has some VRAM overhead and may lead to suboptimal performance. However, pipeline parallelism in GPUs on a node lack high-performance interconnect (e.g. no NVLink/XGMI) may be better than tensor parallelism. Similar to tensor parallelism, expert parallelism (--enable-expert-parallel) may be added as adequate if the configuration without it does not work, but first try without it enabled. Pipeline parallelism is the recommended way to scale models across multiple nodes when the weights and KV Cache do not fit in one node, but unless required, prefer tensor parallelism within single nodes with a high-performance interconnect.
Data Parallelism: Data parallelism (--data-parallel-size) in its original design has the most VRAM overhead, due to copying the same weights redundantly across different GPUs, instead of dispersing weights across multiple GPUs. All of the activation layers still need to be loaded into each GPU in normal cases, thus having VRAM overhead. Data parallelism typically leads to an improvement in throughput when there are GPUs or nodes to spare, and is also able to combine with tensor parallelism or pipeline parallelism for large models that do not fit in one node. But in essence, data parallelism does not assist in reducing VRAM consumption, only increasing throughput in the same conditions.
Data Parallelism Attention (with Expert Parallelism): However, expert parallelism (--enable-expert-parallel) makes data parallelism relevant for distributing models across multiple GPUs, because this allows different GPUs to load different expert layers instead of all GPUs loading the same layers. Mixture of Experts (MoE) models with data parallelism attention may reduce KV cache consumption (--enable-expert-parallel with --data-parallel-size in vLLM or --enable-dp-attention with --expert-parallel-size in SGLang), and may possibly conserve KV cache compared to the above options. However, because the attention layers are duplicated in multiple GPUs due to data parallelism, more VRAM may be consumed, possibly defeating the goal of reducing KV cache consumption. External load balancing through an external router, such as Envoy, combined with data parallelism and expert parallelism, is a good way to balance performance improvements and KV cache consumption through inter-instance coordination using RPC communication.
Further reading: https://rocm.docs.amd.com/en/docs-7.1.0/how-to/rocm-for-ai/inference-optimization/vllm-optimization.html, https://docs.vllm.ai/en/latest/configuration/engine_args/, https://docs.vllm.ai/en/latest/serving/parallelism_scaling/, https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/, https://docs.vllm.ai/en/latest/serving/context_parallel_deployment/, https://github.com/vllm-project/vllm/issues/10142, https://github.com/vllm-project/vllm/issues/22821, https://github.com/vllm-project/vllm/issues/28278, https://github.com/vllm-project/vllm/issues/5951, https://github.com/vllm-project/vllm/issues/17569, https://github.com/vllm-project/vllm/issues/4232
