Skip to main content

Cilium Performance Test

This document describes how to run network performance tests on cilium installed on a TKE cluster, and presents the measured results for each recommended installation scheme.

Cilium provides the official cilium connectivity perf performance test tool. It uses netperf to deploy Pods in the cluster and run TCP_RR (request-response latency), TCP_STREAM (throughput), and other tests, covering same-node / cross-node × Pod network / Host network — a total of four network combinations.

Test Methods

One-Click Script

The one-click install script cilium.sh provides a perf subcommand that wraps cilium connectivity perf:

bash -c "$(curl -sfL https://raw.githubusercontent.com/imroc/tke-guide/main/static/scripts/cilium.sh)" -- perf

If the GitHub URL is not accessible from your network, use the site mirror:

bash -c "$(curl -sfL https://imroc.cc/tke/scripts/cilium.sh)" -- perf
About Concurrent Stream Count

For instance types like SA5 that support burst bandwidth, the default 4 streams may not trigger bursting. We recommend using 8 streams:

# Using cilium CLI directly
cilium connectivity perf --streams 8

# Or through the one-click script
bash -c "$(curl -sfL ...)" -- perf --streams 8

8 streams have been verified to stably saturate the burst bandwidth on all SA5 specifications, giving a more accurate throughput ceiling.

Compared to running cilium connectivity perf directly, the script does the following:

  • Replaces images: The netperf image is replaced with a mirror accessible from the TKE internal network (quay.tencentcloudcr.com/cilium/network-perf), so nodes can pull the image without public network access
  • Automatically cleans up previous resources: Cleans up the cilium-test-* namespace left from previous runs beforehand. cilium connectivity perf runs kubectl delete ns cilium-test-1 at startup, but TKE's gatekeeper prevents namespace deletion when it still contains Pods, so without pre-cleaning the script would hang (see FAQ)
  • Duration measurement: Prints total elapsed time after the test completes

Manual Testing

First install the cilium CLI:

cilium connectivity perf \
--streams 8 \
--performance-image quay.tencentcloudcr.com/cilium/network-perf:3.20-1772622563-6fd6a90

Common cilium connectivity perf parameters:

  • --duration 10s: Each RR/STREAM test runs for 10 seconds
  • --samples 1: Run each test once (can be increased for averaging)
  • --streams: TCP_STREAM_MULTI concurrent stream count (default 4, recommended 8)
  • --rr / --throughput / --throughput-multi: Test TCP_RR, TCP_STREAM, TCP_STREAM_MULTI by default
  • --pod-net / --host-net / --other-node / --same-node: All enabled by default (covers Pod network + Host network + same/cross-node, 4 combinations)
  • Add --udp for UDP testing, --crr for TCP_CRR (reconnects on each connection), --bandwidth for bandwidth rate limiting

See cilium connectivity perf --help for more parameters.

Test Mode Description

Test TypeDescriptionWhat It Measures
TCP_RRTCP Request-Response, repeatedly sends small requests and waits for responsesLatency (µs, lower is better); OP/s = transactions per second
TCP_STREAMSingle TCP stream continuous sendSingle-stream throughput (Mb/s, higher is better)
TCP_STREAM_MULTIMultiple concurrent TCP streams (adjust with --streams)Multi-stream concurrent throughput (Mb/s)

Network Combination Description

ScenarioNodeDescriptionData Path
pod-to-podsame-nodeclient Pod → same-node server Podclient veth → cilium ebpf → server veth
pod-to-podother-nodeclient Pod → cross-node server Podclient veth → cilium ebpf → NIC out → underlay → peer NIC → cilium ebpf → server veth
host-to-hostsame-nodeclient (hostNetwork) → same-node server (hostNetwork)host stack → host stack (does not go through cilium veth path)
host-to-hostother-nodeclient (hostNetwork) → cross-node server (hostNetwork)host stack → NIC → underlay → peer NIC → host stack
Notes on Interpreting Results

Performance data strongly depends on the instance type, VPC bandwidth, kernel version, and other concurrent workloads. The values provided in this document are measured on idle, newly created clusters and are intended only as a reference for comparing different cilium installation schemes — they should not be used as production performance baselines.

The TCP_STREAM test in cilium connectivity perf (based on netperf) uses relatively large default buffers and limited per-connection PPS, making it difficult to trigger the burst bandwidth mechanism specific to Tencent Cloud instance types. In multi-stream scenarios (--streams 8), cross-node throughput generally reflects the instance's baseline bandwidth, but the burst ceiling requires tools like iperf3 for accurate measurement. This document focuses on relative differences between modes rather than absolute values.

Test Environment

ItemValue
Kubernetes Versionv1.34.1 (containerd 1.7.28)
Cilium Versionv1.19.5
Cilium CLI Versionv0.19.4
Node OSTencentOS Server 4 (kernel 6.6.117-45.7.3.tl4.x86_64)
Node Count3 nodes
InstallationOne-click install script cilium.sh install
perf parameters--streams 8 (8 concurrent streams)
Native ModeLegacy Host Routing (see Native Routing Details)
Overlay ModeBPF Host Routing (see Native Routing Details)

Test Instance Types

SpecvCPUMemoryBaseline/Burst BandwidthQueue CountRationale
SA5.LARGE8 (4C)48G1.5 / 10 Gbps4Most common entry-level spec on TKE
SA5.2XLARGE16 (8C)816G3 / 10 Gbps8Common upgrade spec, double queue count

Test Results Overview

TCP_RR (Request-Response Latency, µs)

SpecModeScenarioNodeMeanP50P90P99OP/s
4COverlaypod-to-podsame-node31.2731344431736
4CNativepod-to-podsame-node37.6237415326409
4COverlaypod-to-podother-node94.299410011610576
4CNativepod-to-podother-node112.831131191388843
8COverlaypod-to-podsame-node31.9432344331092
8CNativepod-to-podsame-node38.0337415126135
8COverlaypod-to-podother-node106.211051141279394
8CNativepod-to-podother-node94.13939910910598

TCP_STREAM / TCP_STREAM_MULTI (Throughput, Mb/s)

SpecModeScenarioNodeSingle StreamMulti-stream (8 concurrent)
4COverlaypod-to-podsame-node22,99775,623
4CNativepod-to-podsame-node29,32964,128
4COverlaypod-to-podother-node11,11611,721
4CNativepod-to-podother-node10,76711,296
8COverlaypod-to-podsame-node25,53794,666
8CNativepod-to-podsame-node21,41088,831
8COverlaypod-to-podother-node11,11311,148
8CNativepod-to-podother-node10,76810,776

Same-node multi-stream throughput far exceeds the NIC bandwidth limit because data is transmitted over the loopback device (not through the physical NIC), and is limited by CPU and kernel stack performance.

Comparative Analysis

Key Metrics Comparison

MetricOverlay vs Native (4C)Overlay vs Native (8C)Trend Consistency
Same-node pod-to-pod TCP_RR MeanOverlay 17% fasterOverlay 16% faster✅ Consistent
Same-node pod-to-pod TCP_RR P99Overlay 17% fasterOverlay 16% faster✅ Consistent
Same-node multi-stream throughputOverlay 15% higherOverlay 7% higher✅ Consistent
Cross-node multi-stream throughputNearly identical (~11.5 Gbps)Nearly identical (~11 Gbps)✅ Consistent
Cross-node pod-to-pod TCP_RRHigh variance, inconsistentHigh variance, inconsistent❌ See below

Key Findings

Same-Node Latency: Overlay Has a Clear and Stable Advantage

Overlay's BPF host routing performs endpoint lookup + redirect at the cilium_host device ingress, skipping the full netfilter / conntrack / FIB overhead that Legacy host routing must go through in Native mode. This advantage is not affected by vCPU count or VPC topology. Run-to-run deviation across multiple tests is < 1%, making this conclusion reliable.

MetricOverlayNativeDifference
Same-node RR Mean31.27µs37.62µsOverlay 17% faster
Same-node RR P9944µs53µsOverlay 17% faster
Same-node multi-stream throughput75.6 Gbps64.1 GbpsOverlay 15% higher

Same-node multi-stream throughput (loopback) far exceeds the physical NIC limit and measures CPU/kernel stack processing capability. Overlay is consistently 7-15% higher, but both represent "local data copying" and cannot be used as cross-node performance references.

Cross-Node Latency: High Variance, No Stable Conclusion

Cross-node latency is heavily affected by the VPC physical topology (switch hops / physical distance between nodes). The same pair of nodes may have different latency baselines across different clusters and VPC subnets. Across three test runs, the trend varied (sometimes Overlay was faster, sometimes Native), indicating the difference is within statistical noise.

Practical conclusion: Cross-node latency is not a determining factor for choosing between modes — the difference (~10-20µs) is far smaller than application-layer latency variance and is imperceptible in production.

Cross-Node Multi-Stream Throughput: No Difference

Both instance types show stable multi-stream throughput around ~11 Gbps, with no substantial difference between Native and Overlay. This value is close to the actual VPC bandwidth ceiling for SA5 instances (SA5 burst bandwidth is 10 Gbps), indicating that cilium's data plane is not the bottleneck at this granularity.

About Throughput Stability

Cross-node multi-stream throughput occasionally falls near the baseline bandwidth (~1.7 Gbps) in some runs. This is because SA5's burst bandwidth uses a credit-based mechanism that requires specific conditions to trigger. The run-to-run variance is not a mode difference between Native and Overlay, but rather the netperf test stack's PPS not being high enough to consume burst credits. We recommend running 2-3 times and using the data that best reflects peak performance.

Cross-Node Single-Stream Throughput: Unstable, Not a Comparison Metric

Single-stream throughput typically falls at or near the baseline bandwidth (1.5-3 Gbps), occasionally triggering bursting. No mode difference conclusions can be drawn from this.

Recommendations

ScenarioRecommendationReason
Same-node high-frequency small packets (RPC / KV database / MQ broker)Overlay (VPC-CNI) ⭐BPF host routing provides a stable ~17% latency advantage for same-node small-packet workloads — the most reliable difference conclusion
Require Pod IP consistency with VPC IP (VPC routing / CLB / security groups / CCN)Native Routing (VPC-CNI) ⭐Pod IP direct-to-VPC is Native's core value; cross-node throughput is identical to Overlay
Cross-node high-volume traffic (stream count ≥ 8)No differenceBoth modes saturate the VPC bandwidth ceiling with multi-stream concurrency
Cross-node distributed servicesNo differenceCross-node latency is more affected by VPC topology than by mode difference; the gap (10-20µs) is imperceptible at the application layer
East-west NetworkPolicy / Hubble / KPR / Egress GatewayNo differenceThese are application-layer capabilities of cilium, unrelated to the host routing path
Operational simplicity (no VPC-CNI chaining dependency / no TKE VPC-CNI limitations)Overlay (VPC-CNI) ⭐In Overlay mode, cilium fully manages Pod networking without relying on VPC-CNI's CNI chaining — simpler configuration and more straightforward troubleshooting

Summary

The only reliable difference is same-node latency: Overlay is about 17% faster than Native. Cross-node shows no stable difference. If your workloads primarily involve cross-node communication, performance is not a deciding factor for choosing between modes — both achieve identical cross-node multi-stream throughput, and all core capabilities (NetworkPolicy / Hubble / KPR) are fully available. Choose based on operational preference and environmental conditions.

Further reading: VPC-CNI Native Routing Details provides an in-depth explanation of the two host routing implementations and their enabling conditions.

FAQ

Why clean up the cilium-test-* namespace before running perf?

cilium connectivity perf starts with kubectl delete ns cilium-test-1. However, TKE clusters have the gatekeeper policy baseline.gatekeeper.sh / block-namespace-deletion-rule enabled, which prevents deleting a namespace that still contains Pods:

admission webhook "baseline.gatekeeper.sh" denied the request:
[block-namespace-deletion-rule] The Namespace cilium-test-1 is not allowed
to be deleted. Reason: It is not allowed to delete a namespace when it
includes any pod resource.

If a previous cilium connectivity test run had failed test cases (e.g., the LRP edge case in Native mode), cilium-cli by default retains the test resources (namespace + Deployment + Pod) for troubleshooting — these Pods then block the perf run's namespace deletion step, resulting in:

🔥 [cls-cluster] Deleting connectivity check deployments...
⌛ [cls-cluster] Waiting for namespace cilium-test-1 to disappear
(hangs forever)

cilium.sh perf performs automatic cleanup before the main flow: first delete resources that own Pods (Deployment / DaemonSet / StatefulSet / ReplicaSet / Job / CronJob) → wait for Pods to actually disappear (using --grace-period=0 --force if necessary) → then delete the namespace. This bypasses the gatekeeper restriction and prevents the script from hanging.

If you are running cilium connectivity perf manually and it hangs, run the following cleanup commands manually before retrying:

for ns in cilium-test-1 cilium-test-ccnp1 cilium-test-ccnp2; do
kubectl -n $ns delete deployment,daemonset,statefulset,replicaset,job,cronjob --all --wait=false --ignore-not-found
done
sleep 30 # wait for Pods to actually disappear
for ns in cilium-test-1 cilium-test-ccnp1 cilium-test-ccnp2; do
kubectl delete ns $ns --ignore-not-found
done

For SA5 instances, the queue count = vCPU count (capped at 48). Measured results:

SpecQueue Count--streams=4--streams=8--streams=16
4C4~1.7 Gbps~11.8 Gbps
8C8~1.7 Gbps~11.1 Gbps~3.4 Gbps

8 streams saturate the burst ceiling on both instance types; 16 streams actually decrease throughput (per-stream bandwidth is diluted, PPS insufficient to consume burst credits). Therefore, --streams 8 is recommended as a general-purpose value. If switching to a different instance type, adjust proportionally based on the queue count — the rule is: set to 2x the target instance's queue count (capped at 64), which typically saturates the burst bandwidth.