Skip to content

Commit 2935c0d

Browse files
authored
docs(perf): Considerations for packetparser on high-core-count systems (#1965)
## Summary Documents community-reported performance considerations when running the `packetparser` plugin on high-core-count systems (32+ cores) under sustained network load. ## Changes - **packetparser.md**: Added performance considerations section with user-reported observations, current implementation details (PERF_EVENT_ARRAY), and mitigation options - **intro.md**: Added known limitations section highlighting performance considerations on large nodes - **architecture.md**: Added note about data transfer mechanisms and their performance implications - **performance.md**: New troubleshooting guide with diagnostic steps and mitigation options - **README.md**: Added performance warning in known limitations section ## Key Points - Acknowledges these are community user reports, not independently verified by maintainers - Accurately reflects current implementation (only supports perf arrays, no ring buffer support yet) - References external analysis ([blog post](https://blog.zmalik.dev/p/who-will-observe-the-observability) and [KubeCon talk](https://www.youtube.com/watch?v=J-Zx64mJzVk)) in packetparser.md only - Provides actionable guidance: use Basic metrics, enable sampling, monitor impact - Notes that the team is evaluating options for future releases ## Related Issue If this pull request is related to any issue, please mention it here. Additionally, make sure that the issue is assigned to you before submitting this pull request. ## Checklist - [x] I have read the [contributing documentation](https://retina.sh/docs/Contributing/overview). - [x] I signed and signed-off the commits (`git commit -S -s ...`). See [this documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification) on signing commits. - [x] I have correctly attributed the author(s) of the code. - [x] I have tested the changes locally. - [x] I have followed the project's style guidelines. - [x] I have updated the documentation, if necessary. - [x] I have added tests, if applicable. ## Screenshots (if applicable) or Testing Completed <img width="2541" height="1280" alt="image" src="https://github.com/user-attachments/assets/e043da80-2be0-4725-bb43-4245e60a7b5f" /> ## Additional Notes Add any additional notes or context about the pull request here. --- Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more information on how to contribute to this project.
1 parent 5787b99 commit 2935c0d

File tree

6 files changed

+231
-32
lines changed

6 files changed

+231
-32
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,10 @@ Retina lets you **investigate network issues on-demand** and **continuously moni
3333

3434
See [retina.sh](http://retina.sh) for documentation and examples.
3535

36+
## Known Limitations
37+
38+
⚠️ **Performance on High-Core-Count Systems**: Community users have reported performance considerations when using Advanced metrics (with `packetparser` plugin) on nodes with 32+ CPU cores under high network load. Consider starting with Basic metrics mode on large node types. See [Known Limitations](https://retina.sh/docs/Introduction/intro#known-limitations) for details.
39+
3640
## Capabilities
3741

3842
Retina has two major features:

docs/01-Introduction/01-intro.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,3 +94,18 @@ Check out our talk from KubeCon 2024 which goes into this topic even further - [
9494
The following are known system requirements for installing Retina:
9595

9696
- Minimum Linux Kernel Version: v5.4.0
97+
98+
## Known Limitations
99+
100+
### Performance Considerations for High-Core-Count Systems
101+
102+
Community users have reported performance considerations when using **Advanced metrics with the `packetparser` plugin** on nodes with high CPU core counts (32+ cores) under sustained, high-volume network load.
103+
104+
If you plan to deploy Retina in Advanced mode on large node types with network-intensive workloads, consider:
105+
106+
1. **Start with Basic metrics mode** (does not use `packetparser`)
107+
2. Enable `dataSamplingRate` if you need Advanced metrics
108+
3. Monitor CPU usage and network throughput after deployment
109+
4. See [`packetparser` performance considerations](../03-Metrics/plugins/Linux/packetparser.md#performance-considerations) for more information
110+
111+
The Retina team is evaluating options to address these reported concerns in future releases.

docs/01-Introduction/02-architecture.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ The plugins have a very specific scope by design, and Retina is designed to be e
1414

1515
The plugins are responsible for installing the eBPF programs into the host kernel during startup. These eBPF programs collect metrics from events in the kernel level, which are then passed to the user space where they are parsed and converted into a `flow` data structure. Depending on the Control Plane being used, the data will either be sent to a Retina Enricher, or written to an external channel which is consumed by a Hubble observer - more on this in the [Control Plane](#control-plane) section below. It is not required for a plugin to use eBPF, it can also use syscalls or other API calls. In either case, the plugins will implement the same [interface](https://github.com/microsoft/retina/blob/main/pkg/plugin/registry/registry.go).
1616

17+
**Data Transfer Mechanisms:** eBPF programs transfer data from kernel to user space using specialized data structures. The `packetparser` plugin currently uses **perf arrays** (BPF_MAP_TYPE_PERF_EVENT_ARRAY), which create per-CPU buffers. Community users have reported performance considerations with this approach on high-core-count systems. See [packetparser performance considerations](../03-Metrics/plugins/Linux/packetparser.md#performance-considerations) for details.
18+
1719
Some examlpes of existing Retina plugins:
1820

1921
- Drop Reason - measures the number of packets/bytes dropped and the reason and the direction of the drop.

docs/03-Metrics/plugins/Linux/packetparser.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,39 @@ The `packetparser` plugin requires the `CAP_NET_ADMIN` and `CAP_SYS_ADMIN` capab
1515

1616
`packetparser` does not produce Basic metrics. In Advanced mode (refer to [Metric Modes](../../modes/modes.md)), the plugin transforms an eBPF result into an enriched `Flow` by adding Pod information based on IP. It then sends the `Flow` to an external channel, enabling *several modules* to generate Pod-Level metrics.
1717

18+
## Performance Considerations
19+
20+
### Reported Performance Impact on High-Core-Count Systems
21+
22+
Community users have reported performance considerations when running the `packetparser` plugin on systems with high CPU core counts (32+ cores) under sustained network load. While these reports have not been independently verified by the Retina maintainers, we document them here for awareness.
23+
24+
**User-Reported Observations:**
25+
26+
A detailed analysis by a Retina user (see [this blog post](https://blog.zmalik.dev/p/who-will-observe-the-observability)) and [KubeCon 2025 talk](https://www.youtube.com/watch?v=J-Zx64mJzVk) documented performance degradation that scaled non-linearly with CPU core count on nodes running network-intensive, multi-threaded workloads.
27+
28+
**Current Implementation:**
29+
30+
The `packetparser` plugin currently uses **BPF_MAP_TYPE_PERF_EVENT_ARRAY** for kernel-to-userspace data transfer. This architecture creates per-CPU buffers that must be polled by a single reader thread. On systems with many CPU cores, this can lead to:
31+
32+
- Increased context switching overhead
33+
- Memory access patterns that may not scale linearly
34+
- Potential NUMA-related penalties on multi-socket systems
35+
36+
**Alternative Approaches:**
37+
38+
Alternative data transfer mechanisms like BPF ring buffers (BPF_MAP_TYPE_RINGBUF, available in Linux kernel 5.8+) use a shared buffer architecture that may perform better on high-core-count systems. However, **Retina does not currently support ring buffers for packetparser**. Future versions may provide configurable data transfer mechanisms.
39+
40+
#### If You Experience Performance Issues
41+
42+
If you observe performance degradation on high-core-count nodes:
43+
44+
1. **Disable `packetparser`**: Use Basic metrics mode which doesn't require this plugin
45+
2. **Enable Sampling**: Use the `dataSamplingRate` configuration option (see [Sampling](#sampling) section)
46+
3. **Use High Data Aggregation**: Configure `high` [data aggregation](../../../05-Concepts/data-aggregation.md)
47+
4. **Monitor Impact**: Watch for elevated CPU usage, context switches, or throughput changes
48+
49+
**Note:** The Retina team is evaluating options for addressing reported performance concerns, including potential support for alternative data transfer mechanisms. Community feedback and contributions are welcome.
50+
1851
## Sampling
1952

2053
Since `packetparser` produces many enriched `Flow` objects it can be quite expensive for user space to process. Thus, when operating in `high` [data aggregation](../../../05-Concepts/data-aggregation.md) level optional sampling for reported packets is available via the `dataSamplingRate` configuration option.
Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# Performance Troubleshooting
2+
3+
This guide helps diagnose and address potential performance issues when running Retina, particularly the `packetparser` plugin on high-core-count systems.
4+
5+
## Background
6+
7+
Community users have reported performance considerations when running the `packetparser` plugin (used in Advanced metrics mode) on systems with high CPU core counts under sustained network load. For detailed background, see the [`packetparser` performance considerations](../03-Metrics/plugins/Linux/packetparser.md#performance-considerations).
8+
9+
## Symptoms to Monitor
10+
11+
Watch for these indicators after deploying Retina:
12+
13+
- **Decreased network throughput** compared to baseline
14+
- **High CPU usage** by Retina agent pods
15+
- **Elevated context switches** on nodes running Retina
16+
- **Increased latency** in network-intensive applications
17+
18+
## Diagnostic Steps
19+
20+
### Step 1: Identify Your Configuration
21+
22+
Check which plugins are enabled:
23+
24+
```bash
25+
kubectl get configmap retina-config -n kube-system -o yaml | grep enabledPlugin
26+
```
27+
28+
If `packetparser` is enabled, you're running Advanced metrics mode which is more resource-intensive.
29+
30+
### Step 2: Check Node Specifications
31+
32+
```bash
33+
# Check core count on nodes
34+
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu
35+
36+
# Identify nodes with high core counts (32+)
37+
kubectl get nodes -o json | jq '.items[] | select((.status.capacity.cpu | tonumber) >= 32) | {name: .metadata.name, cpu: .status.capacity.cpu}'
38+
```
39+
40+
### Step 3: Monitor Retina Resource Usage
41+
42+
```bash
43+
# Check CPU and memory usage of Retina pods
44+
kubectl top pods -n kube-system -l app=retina
45+
46+
# For more detailed analysis, check specific pod on a node
47+
RETINA_POD=$(kubectl get pods -n kube-system -l app=retina -o jsonpath='{.items[0].metadata.name}')
48+
kubectl top pod $RETINA_POD -n kube-system
49+
```
50+
51+
### Step 4: Establish Performance Baseline
52+
53+
Before and after Retina deployment, measure:
54+
55+
- Network throughput (using your application's metrics or tools like iperf3)
56+
- Application response times
57+
- CPU utilization on nodes
58+
59+
## Mitigation Options
60+
61+
If you observe performance impact, consider these approaches:
62+
63+
### Option 1: Use Basic Metrics Mode (Recommended)
64+
65+
Basic metrics mode provides node-level observability without the `packetparser` plugin:
66+
67+
```bash
68+
# Reinstall or upgrade Retina without packetparser
69+
helm upgrade retina oci://ghcr.io/microsoft/retina/charts/retina \
70+
--set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\]" \
71+
--reuse-values
72+
```
73+
74+
**Trade-off:** You'll have node-level metrics only, not pod-level metrics.
75+
76+
### Option 2: Enable Data Sampling
77+
78+
Reduce event volume by sampling packets:
79+
80+
```yaml
81+
apiVersion: v1
82+
kind: ConfigMap
83+
metadata:
84+
name: retina-config
85+
namespace: kube-system
86+
data:
87+
config.yaml: |
88+
dataSamplingRate: 10 # Sample 1 out of every 10 packets
89+
```
90+
91+
**Trade-off:** Reduced data granularity, but lower overhead.
92+
93+
### Option 3: Use High Data Aggregation Level
94+
95+
Reduce events at the eBPF level:
96+
97+
```yaml
98+
apiVersion: v1
99+
kind: ConfigMap
100+
metadata:
101+
name: retina-config
102+
namespace: kube-system
103+
data:
104+
config.yaml: |
105+
dataAggregationLevel: "high"
106+
```
107+
108+
**Trade-off:** Disables host interface monitoring; API server latency metrics may be less reliable.
109+
110+
### Option 4: Selective Deployment
111+
112+
Deploy Retina only on nodes where you need detailed observability:
113+
114+
```yaml
115+
# Use node selectors or taints/tolerations
116+
apiVersion: apps/v1
117+
kind: DaemonSet
118+
spec:
119+
template:
120+
spec:
121+
nodeSelector:
122+
retina-enabled: "true"
123+
```
124+
125+
## Advanced Diagnostics
126+
127+
### Inspecting eBPF Maps
128+
129+
To see what data structures Retina is using:
130+
131+
```bash
132+
# Access the node
133+
kubectl debug node/<node-name> -it --image=ubuntu
134+
135+
# In the debug container, enter the host namespace
136+
chroot /host
137+
138+
# List BPF maps (requires bpftool)
139+
bpftool map list | grep retina
140+
141+
# Check the packetparser map type
142+
bpftool map show name retina_packetparser_events
143+
```
144+
145+
Currently, `packetparser` uses `BPF_MAP_TYPE_PERF_EVENT_ARRAY`.
146+
147+
### Monitoring Event Rates (Advanced)
148+
149+
If you have bpftrace available on nodes:
150+
151+
```bash
152+
# Monitor perf_event activity
153+
sudo bpftrace -e '
154+
kprobe:perf_event_output { @events = count(); }
155+
interval:s:5 { print(@events); clear(@events); }
156+
'
157+
```
158+
159+
High event rates may correlate with increased CPU usage.
160+
161+
## Reporting Issues
162+
163+
If you experience performance issues, please report them with:
164+
165+
1. **Node specifications**: CPU count, memory, kernel version
166+
2. **Retina configuration**: Version, enabled plugins, configuration settings
167+
3. **Workload characteristics**: Network throughput, number of pods, traffic patterns
168+
4. **Performance metrics**: CPU usage, network throughput before/after, specific observations
169+
170+
Open an issue at: <https://github.com/microsoft/retina/issues>
171+
172+
## Further Resources
173+
174+
- [Packetparser Performance Considerations](../03-Metrics/plugins/Linux/packetparser.md#performance-considerations)
175+
- [Data Aggregation Levels](../05-Concepts/data-aggregation.md)
176+
- [Configuration Options](../02-Installation/03-Config.md)

0 commit comments

Comments
 (0)