diff --git a/docs/en/TOC.md b/docs/en/TOC.md index 917d7cd2ae9..faee467ea7b 100644 --- a/docs/en/TOC.md +++ b/docs/en/TOC.md @@ -37,6 +37,7 @@ + Advanced - [Accelerate Data Access by MEM or SSD](samples/accelerate_data_by_mem_or_ssd.md) - [Alluxio Tieredstore Configuration](samples/tieredstore_config.md) + - [Alluxio S3 High-Concurrency Read Tuning](samples/alluxio_s3_high_concurrency.md) - [Pod Scheduling Optimization](operation/pod_schedule_optimization.md) - [Pod Scheduling Base on Runtime Tiered Locality](operation/tiered_locality_schedule.md) - [Set FUSE clean policy](samples/fuse_clean_policy.md) diff --git a/docs/en/samples/alluxio_s3_high_concurrency.md b/docs/en/samples/alluxio_s3_high_concurrency.md new file mode 100644 index 00000000000..fee27dee9fa --- /dev/null +++ b/docs/en/samples/alluxio_s3_high_concurrency.md @@ -0,0 +1,193 @@ +# Alluxio S3 High-Concurrency Read Tuning + +This document provides a tuning profile for high-concurrency read workloads that use AlluxioRuntime with an S3-compatible backend. + +This profile was validated while investigating [issue #5802](https://github.com/fluid-cloudnative/fluid/issues/5802), where fio reads over an S3-backed AlluxioRuntime could hang at high concurrency. It does not change Alluxio internals. Users can apply the configuration through `spec.properties` and FUSE args. + +## Scenario + +The issue was reproduced with an environment close to: + +- Kubernetes v1.26.7 +- Fluid v1.0.8 and Fluid master at the time of investigation +- Alluxio 2.9.5 +- SeaweedFS 3.80 as an S3-compatible backend +- One Alluxio master, one worker, and FUSE +- 64 files in S3, each about 5GiB + +The fio command was: + +```bash +FILES=$(seq -f "/data/file%g" 0 63 | paste -sd:) +fio -iodepth=1 -rw=read -ioengine=libaio -bs=256k \ + -numjobs= -group_reporting -size=5G \ + --filename="$FILES" -name=read_test --readonly -direct=1 --runtime=60 +``` + +Observed behavior without this tuning profile: + +- `numjobs=8` and `numjobs=16` completed. +- Higher concurrency, such as `numjobs=32` or `numjobs=64`, could hang. +- The test Pod could fail to delete normally after the hang. +- Force deletion could leave fio or FUSE state stuck on the node. + +The validation suggests this tuning mainly mitigates Alluxio 2.9.5 FUSE/client read-path pressure under high-concurrency S3 reads. In the reproduced environment, JNI-FUSE could hit path-lock timeout symptoms. When using JNR/libfuse2, S3 thread/client-pool tuning and disabling direct memory IO were also required to make repeated `numjobs=64` stable. + +## Recommended Runtime Configuration + +Use this profile only for S3 or S3-compatible high-concurrency read workloads. Keep the default behavior for other workloads unless you have validated the same tuning in your own environment. + +```yaml +apiVersion: data.fluid.io/v1alpha1 +kind: AlluxioRuntime +metadata: + name: my-s3 +spec: + replicas: 1 + master: + resources: + requests: + cpu: 8 + memory: 32Gi + limits: + cpu: 8 + memory: 32Gi + worker: + resources: + requests: + cpu: 8 + memory: 32Gi + limits: + cpu: 8 + memory: 64Gi + fuse: + jvmOptions: + - "-Xmx16G" + - "-Xms16G" + - "-XX:+UseG1GC" + - "-XX:MaxDirectMemorySize=32g" + - "-XX:+UnlockExperimentalVMOptions" + - "-XX:ActiveProcessorCount=16" + resources: + requests: + cpu: 16 + memory: 32Gi + limits: + cpu: 16 + memory: 64Gi + args: + - fuse + - --fuse-opts=kernel_cache,rw,allow_other,entry_timeout=60,attr_timeout=60,max_background=256,congestion_threshold=256 + properties: + alluxio.fuse.jnifuse.enabled: "false" + alluxio.fuse.jnifuse.libfuse.version: "2" + alluxio.underfs.s3.threads.max: "2048" + alluxio.user.block.worker.client.pool.max: "8192" + alluxio.user.block.size.bytes.default: "64MB" + alluxio.user.streaming.reader.chunk.size.bytes: "64MB" + alluxio.user.local.reader.chunk.size.bytes: "64MB" + alluxio.worker.network.reader.buffer.size: "64MB" + alluxio.user.direct.memory.io.enabled: "false" + tieredstore: + levels: + - mediumtype: SSD + path: /path/to/ssd/mount + quota: 100G + high: "0.95" + low: "0.6" +``` + +Important details: + +- Set `alluxio.fuse.jnifuse.enabled=false` and `alluxio.fuse.jnifuse.libfuse.version=2` to use JNR/libfuse2. +- Remove `max_idle_threads=*` from FUSE args when using libfuse2. `max_idle_threads` is a libfuse3 option. +- Increase S3 threads and worker client pool size for high-concurrency reads. +- Use larger read chunks and buffers to reduce request fragmentation. +- Set `alluxio.user.direct.memory.io.enabled=false`. In the reproduced environment, this was required for repeated `numjobs=64` stability. + +## Dataset Example + +Store access keys in a Kubernetes Secret instead of hardcoding them in YAML. + +```yaml +apiVersion: data.fluid.io/v1alpha1 +kind: Dataset +metadata: + name: my-s3 +spec: + mounts: + - mountPoint: s3://// + name: s3 + options: + alluxio.underfs.s3.endpoint: + alluxio.underfs.s3.endpoint.region: + encryptOptions: + - name: aws.accessKeyId + valueFrom: + secretKeyRef: + name: mysecret + key: aws.accessKeyId + - name: aws.secretKey + valueFrom: + secretKeyRef: + name: mysecret + key: aws.secretKey +``` + +## Test Pod Example + +Mount the dataset and run fio from `/data`. Use an image that includes the `fio` binary, or install `fio` in the test container before running the benchmark. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: fio-reader +spec: + restartPolicy: Never + containers: + - name: client + image: + securityContext: + runAsUser: 0 + command: ["/bin/bash", "-lc", "sleep infinity"] + volumeMounts: + - mountPath: /data + name: data + readOnly: true + subPath: s3 + volumes: + - name: data + persistentVolumeClaim: + claimName: my-s3 + readOnly: true +``` + +## Validation Result + +In the validation environment, after applying the above profile through Fluid-generated AlluxioRuntime configuration: + +```text +numjobs=8: passed +numjobs=16: passed +numjobs=32: passed +numjobs=64: passed +repeat numjobs=64: passed +test Pod deletion: passed +Alluxio master/worker/fuse restart count: 0 +``` + +The following error symptoms were not observed after applying the profile: + +- `DeadlineExceededRuntimeException` +- `Timer expired` +- `OutOfDirectMemoryError` + +`TempBlockMeta not found` warnings could still appear in Alluxio logs, but fio completed successfully, test Pods deleted normally, and Runtime components stayed healthy in the validation environment. + +## Risks and Scope + +- This is a tuning/configuration profile, not an upstream Alluxio internal fix. +- The values were validated for the reproduced S3-compatible workload in issue #5802. Different S3 backends, object sizes, network latency, and concurrency levels may still require tuning. +- Disabling direct memory IO improves stability for this workload, but it may affect performance. +- If the same symptoms continue after applying this profile, collect FUSE logs, worker logs, node process states, mount information, and kubelet logs before force-deleting Pods. diff --git a/docs/zh/TOC.md b/docs/zh/TOC.md index fdaa795b0c2..c455b620cdf 100644 --- a/docs/zh/TOC.md +++ b/docs/zh/TOC.md @@ -42,6 +42,7 @@ + 进阶使用 - [使用内存加速和SSD加速配置](samples/accelerate_data_by_mem_or_ssd.md) - [AlluxioRuntime分层存储配置](samples/tieredstore_config.md) + - [Alluxio S3 高并发读调优](samples/alluxio_s3_high_concurrency.md) - [通过Webhook机制优化Pod调度](operation/pod_schedule_optimization.md) - [基于Runtime分层位置信息的应用Pod调度](operation/tiered_locality_schedule.md) - [如何开启 FUSE 自动恢复能力](samples/fuse_recover.md) @@ -84,4 +85,3 @@ - [如何使用Go客户端创建、删除fluid资源](dev/use_go_create_resource.md) - [如何使用其他客户端(非GO语言)](dev/multiple-client-support.md) - [通过REST API访问](samples/api_proxy.md) - diff --git a/docs/zh/samples/alluxio_s3_high_concurrency.md b/docs/zh/samples/alluxio_s3_high_concurrency.md new file mode 100644 index 00000000000..8645d58f7d0 --- /dev/null +++ b/docs/zh/samples/alluxio_s3_high_concurrency.md @@ -0,0 +1,193 @@ +# Alluxio S3 高并发读调优 + +本文提供一组面向 AlluxioRuntime + S3 兼容后端高并发读场景的调优配置。 + +这组配置来自 [issue #5802](https://github.com/fluid-cloudnative/fluid/issues/5802) 的排查:fio 通过 S3 后端的 AlluxioRuntime 读取数据时,在高并发下可能挂住。它不修改 Alluxio 内部实现,用户可以通过 `spec.properties` 和 FUSE args 显式配置。 + +## 场景 + +问题在接近以下环境中复现: + +- Kubernetes v1.26.7 +- Fluid v1.0.8 和排查时的 Fluid master +- Alluxio 2.9.5 +- SeaweedFS 3.80 作为 S3 兼容后端 +- 1 个 Alluxio master,1 个 worker,以及 FUSE +- S3 中 64 个文件,每个约 5GiB + +fio 命令如下: + +```bash +FILES=$(seq -f "/data/file%g" 0 63 | paste -sd:) +fio -iodepth=1 -rw=read -ioengine=libaio -bs=256k \ + -numjobs= -group_reporting -size=5G \ + --filename="$FILES" -name=read_test --readonly -direct=1 --runtime=60 +``` + +未应用这组调优配置时观察到的现象: + +- `numjobs=8` 和 `numjobs=16` 能完成。 +- 更高并发,例如 `numjobs=32` 或 `numjobs=64`,可能挂住。 +- 挂住后测试 Pod 可能无法正常删除。 +- 强制删除后,节点上的 fio 或 FUSE 状态仍可能残留卡住。 + +验证结果表明,这组调优主要缓解 Alluxio 2.9.5 在 S3 高并发读下的 FUSE/client 读路径压力。在复现环境中,JNI-FUSE 可能出现路径锁超时;使用 JNR/libfuse2 时,还需要同时调大 S3 线程和 worker client pool,并关闭 direct memory IO,才能让重复 `numjobs=64` 稳定通过。 + +## 推荐 Runtime 配置 + +仅建议在 S3 或 S3 兼容后端的高并发读场景中使用这组配置。其他工作负载建议保持默认行为,除非你已经在自己的环境中验证了同样的调优。 + +```yaml +apiVersion: data.fluid.io/v1alpha1 +kind: AlluxioRuntime +metadata: + name: my-s3 +spec: + replicas: 1 + master: + resources: + requests: + cpu: 8 + memory: 32Gi + limits: + cpu: 8 + memory: 32Gi + worker: + resources: + requests: + cpu: 8 + memory: 32Gi + limits: + cpu: 8 + memory: 64Gi + fuse: + jvmOptions: + - "-Xmx16G" + - "-Xms16G" + - "-XX:+UseG1GC" + - "-XX:MaxDirectMemorySize=32g" + - "-XX:+UnlockExperimentalVMOptions" + - "-XX:ActiveProcessorCount=16" + resources: + requests: + cpu: 16 + memory: 32Gi + limits: + cpu: 16 + memory: 64Gi + args: + - fuse + - --fuse-opts=kernel_cache,rw,allow_other,entry_timeout=60,attr_timeout=60,max_background=256,congestion_threshold=256 + properties: + alluxio.fuse.jnifuse.enabled: "false" + alluxio.fuse.jnifuse.libfuse.version: "2" + alluxio.underfs.s3.threads.max: "2048" + alluxio.user.block.worker.client.pool.max: "8192" + alluxio.user.block.size.bytes.default: "64MB" + alluxio.user.streaming.reader.chunk.size.bytes: "64MB" + alluxio.user.local.reader.chunk.size.bytes: "64MB" + alluxio.worker.network.reader.buffer.size: "64MB" + alluxio.user.direct.memory.io.enabled: "false" + tieredstore: + levels: + - mediumtype: SSD + path: /path/to/ssd/mount + quota: 100G + high: "0.95" + low: "0.6" +``` + +注意事项: + +- 设置 `alluxio.fuse.jnifuse.enabled=false` 和 `alluxio.fuse.jnifuse.libfuse.version=2`,使用 JNR/libfuse2。 +- 使用 libfuse2 时,需要从 FUSE args 中移除 `max_idle_threads=*`。`max_idle_threads` 是 libfuse3 参数。 +- 调大 S3 threads 和 worker client pool,以承载高并发读。 +- 增大 read chunk 和 buffer,减少请求碎片化。 +- 设置 `alluxio.user.direct.memory.io.enabled=false`。在验证环境中,这是让重复 `numjobs=64` 稳定通过的关键。 + +## Dataset 示例 + +请使用 Kubernetes Secret 保存访问密钥,不要把 AK/SK 硬编码到 YAML 中。 + +```yaml +apiVersion: data.fluid.io/v1alpha1 +kind: Dataset +metadata: + name: my-s3 +spec: + mounts: + - mountPoint: s3://// + name: s3 + options: + alluxio.underfs.s3.endpoint: + alluxio.underfs.s3.endpoint.region: + encryptOptions: + - name: aws.accessKeyId + valueFrom: + secretKeyRef: + name: mysecret + key: aws.accessKeyId + - name: aws.secretKey + valueFrom: + secretKeyRef: + name: mysecret + key: aws.secretKey +``` + +## 测试 Pod 示例 + +挂载 Dataset,并在 `/data` 下运行 fio。请使用包含 `fio` 命令的镜像,或在测试容器中安装 `fio` 后再运行压测。 + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: fio-reader +spec: + restartPolicy: Never + containers: + - name: client + image: + securityContext: + runAsUser: 0 + command: ["/bin/bash", "-lc", "sleep infinity"] + volumeMounts: + - mountPath: /data + name: data + readOnly: true + subPath: s3 + volumes: + - name: data + persistentVolumeClaim: + claimName: my-s3 + readOnly: true +``` + +## 验证结果 + +在验证环境中,通过 Fluid 生成的 AlluxioRuntime 配置应用上述调优配置后: + +```text +numjobs=8: passed +numjobs=16: passed +numjobs=32: passed +numjobs=64: passed +repeat numjobs=64: passed +test Pod deletion: passed +Alluxio master/worker/fuse restart count: 0 +``` + +应用调优配置后未再观察到以下错误症状: + +- `DeadlineExceededRuntimeException` +- `Timer expired` +- `OutOfDirectMemoryError` + +Alluxio 日志中仍可能出现 `TempBlockMeta not found` 警告,但在验证环境中,fio 可以成功完成,测试 Pod 可以正常删除,Runtime 组件也保持健康。 + +## 风险和适用范围 + +- 这是一个调优配置,不是 Alluxio 内部实现修复。 +- 这些参数已在 issue #5802 的 S3 兼容工作负载中验证。不同 S3 后端、对象大小、网络延迟和并发级别可能仍需要调参。 +- 关闭 direct memory IO 可以提升这个工作负载的稳定性,但可能影响性能。 +- 如果应用这组配置后仍出现同类问题,请在强制删除 Pod 前收集 FUSE 日志、worker 日志、节点进程状态、mount 信息和 kubelet 日志。