Skip to content

<fix>[host]: adaptive EMA-based ping timeout for large-scale clusters [ZSTAC-67534]#3391

Open
MatheMatrix wants to merge 7 commits into5.5.12from
sync/ye.zou/fix/ZSTAC-67534
Open

<fix>[host]: adaptive EMA-based ping timeout for large-scale clusters [ZSTAC-67534]#3391
MatheMatrix wants to merge 7 commits into5.5.12from
sync/ye.zou/fix/ZSTAC-67534

Conversation

@MatheMatrix
Copy link
Owner

Summary

  • 大规模集群(3000+物理机)中,物理机频繁误报失联又连接
  • 根因: ping timeout 使用静态值,大规模环境下 ping 响应时间自然较长导致超时误判
  • 修复: 实现 EMA (指数移动平均) 自适应超时算法,per-host 跟踪 ping 响应时间,timeout = max(configured, EMA * 3.0)

Changes

  • HostTrackImpl.java: 新增 EMA 计算逻辑、per-host 响应时间跟踪、getAdaptiveTimeout() 静态方法
  • KVMHost.java: 两处 asyncJsonPost timeout 参数改为调用 HostTrackImpl.getAdaptiveTimeout()

Test Plan

  • 小规模集群: 超时值 = 配置值 (EMA 未积累足够数据时使用配置值作为 floor)
  • 大规模集群: EMA 自适应增大超时值,消除误报
  • 动态调整: 网络恢复后 EMA 逐渐缩小超时值

Resolves: ZSTAC-67534

sync from gitlab !9239

Resolves: ZSTAC-67534

Change-Id: I337a5bf8efa9cad20e39f947d5c06e944003205c
@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

引入主机级 EMA 自适应 ping 超时并为 Agent ping 引入专用的 ping HTTP 连接池与调用路径;新增全局配置、REST 接口/实现改动与连接池参数调整,并添加无 Spring 上下文的连接池与 ThreadLocal 生命周期单元测试。

Changes

Cohort / File(s) Summary
自适应 Ping 超时实现
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java, compute/src/main/java/org/zstack/compute/host/HostGlobalConfig.java
新增基于 EMA 的 per-host 响应时间存储与计算,添加 public static long getAdaptiveTimeout(String hostUuid)updatePingResponseEma(...),并在成功 ping 路径记录/更新 EMA;新增布尔全局配置 PING_ADAPTIVE_TIMEOUT_ENABLED
KVM 主机调用点
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
将原有 ping REST 调用替换为 RESTFacade.asyncJsonPostForPing(...),并用 HostTrackImpl.getAdaptiveTimeout(self.getUuid()) 作为超时参数。
REST 接口与实现(连接池/专用 ping 客户端)
header/src/main/java/org/zstack/header/rest/RESTFacade.java, core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java, core/src/main/java/org/zstack/core/rest/TimeoutRestTemplate.java
在接口层增加 asyncJsonPostForPing(...) 与带池参数的 createRestTemplate(...) 重载;实现层新增 pingAsyncRestTemplateASYNC_TEMPLATE_OVERRIDE ThreadLocal、ping 专用调用路径、连接池统计/转储与定期监控,并在每次 ping 调用前注入专用模板且确保清理。
REST 客户端与连接池相关配置变更
core/src/main/java/org/zstack/core/CoreGlobalProperty.java, core/src/main/java/org/zstack/core/cloudbus/CloudBusImpl3.java
调整 REST 相关默认值(RESTFacade.maxPerRouteRESTFacade.maxTotal)、新增 RESTFacade.connectionRequestTimeout 属性,并在 CloudBus 初始化中显式传入池参数以创建 RestTemplate(包含 maxPerRoute/maxTotal 等)。
单元测试(连接池与 ThreadLocal 行为)
test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java, test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java
新增无 Spring 上下文单元测试:验证 createRestTemplate(...) 的多参数与兼容路径、零池参数退回行为,以及通过反射检查 pingAsyncRestTemplateASYNC_TEMPLATE_OVERRIDE 存在、类型及 ThreadLocal 清理语义。

Sequence Diagram(s)

sequenceDiagram
autonumber
participant KVM as KVMHost
participant REST as RESTFacade
participant Impl as RESTFacadeImpl
participant Pool as HTTP_Ping_Pool
participant Track as HostTrackImpl

KVM->>REST: asyncJsonPostForPing(url, cmd, callback)
REST->>Impl: delegate asyncJsonPostForPing(...)
Impl->>Impl: ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate)
Impl->>Pool: execute HTTP POST via pingAsyncRestTemplate
Pool-->>Impl: HTTP response
Impl->>Impl: ASYNC_TEMPLATE_OVERRIDE.remove()
Impl->>KVM: invoke callback with result
KVM->>Track: record pingStart/pingEnd → responseTime
Track->>Track: updatePingResponseEma(hostUuid, responseTimeSec)
Track->>KVM: getAdaptiveTimeout(hostUuid) (used for future pings)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

诗歌

🐰 我数着每次应答的秒,
EMA 在我爪间悄悄生长,
超时学会了随主机呼吸,
我在连接池边轻跳欢唱,
胡萝卜为稳健干杯 🥕


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Title check ❌ Error 标题包含必需的格式([host])和 JIRA key(ZSTAC-67534),但长度为 83 字符,超过了 72 字符的限制。 将标题缩短至 72 字符以内,例如:[host]: adaptive EMA ping timeout for large clusters [ZSTAC-67534]
Docstring Coverage ⚠️ Warning Docstring coverage is 16.28% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed 拉取请求描述与变更集高度相关,详细说明了问题根因、解决方案和测试计划,涵盖了代码更改的各个方面。
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sync/ye.zou/fix/ZSTAC-67534

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java (1)

45-45: pingResponseEma 目前是无界增长结构,存在长期内存膨胀风险

Line 45 使用 ConcurrentHashMap 持久保存每个主机 EMA,但没有淘汰策略;在主机频繁增删/迁移的长期运行环境里会持续累积历史 UUID。

建议修改(引入过期淘汰)
-    private static final ConcurrentHashMap<String, Double> pingResponseEma = new ConcurrentHashMap<>();
+    private static final Cache<String, Double> pingResponseEma = CacheBuilder.newBuilder()
+            .expireAfterAccess(24, TimeUnit.HOURS)
+            .build();

@@
-        Double ema = pingResponseEma.get(hostUuid);
+        Double ema = pingResponseEma.getIfPresent(hostUuid);

@@
-        pingResponseEma.merge(hostUuid, responseTimeSec,
+        pingResponseEma.asMap().merge(hostUuid, responseTimeSec,
                 (oldEma, sample) -> EMA_ALPHA * sample + (1 - EMA_ALPHA) * oldEma);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` at line 45,
The static pingResponseEma ConcurrentHashMap in HostTrackImpl grows unbounded
and should be converted to a bounded/expiring structure to avoid memory bloat;
replace pingResponseEma with a cache that evicts entries (for example Guava
Cache or Caffeine) configured with a maximumSize and
expireAfterAccess/expireAfterWrite, or implement a
ConcurrentHashMap<Pair<Double, timestamp>> plus a periodic cleanup task that
removes entries older than a threshold; ensure all accesses currently using
pingResponseEma (reads/writes in HostTrackImpl methods) are updated to use the
chosen cache API or consult the timestamp before using/removing entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 51-59: getAdaptiveTimeout currently flips to EMA-based adaptive
timeout as soon as pingResponseEma contains a value; add a global compatibility
switch (e.g., a new HostGlobalConfig like ENABLE_ADAPTIVE_PING_TIMEOUT
defaulting to false) and update getAdaptiveTimeout to return the historical
configured timeout (HostGlobalConfig.PING_HOST_TIMEOUT.value...) when the switch
is off; when the switch is on keep the existing logic using pingResponseEma and
EMA_SAFETY_FACTOR so teams can gray-release and rollback the adaptive behavior.

---

Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Line 45: The static pingResponseEma ConcurrentHashMap in HostTrackImpl grows
unbounded and should be converted to a bounded/expiring structure to avoid
memory bloat; replace pingResponseEma with a cache that evicts entries (for
example Guava Cache or Caffeine) configured with a maximumSize and
expireAfterAccess/expireAfterWrite, or implement a
ConcurrentHashMap<Pair<Double, timestamp>> plus a periodic cleanup task that
removes entries older than a threshold; ensure all accesses currently using
pingResponseEma (reads/writes in HostTrackImpl methods) are updated to use the
chosen cache API or consult the timestamp before using/removing entries.

ℹ️ Review info

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c5e4f49 and 20160c7.

📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java

Resolves: ZSTAC-67534

Change-Id: I3f5bfa3af90a5ee419a9886e9e119551f81455e2
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java (1)

53-63: 建议增加最小样本门槛后再启用 EMA 超时

Line 58-63 目前只要有一个样本就会使用自适应超时。单次异常慢包会立即放大 timeout,稳定性不足。建议在样本达到阈值后再启用 EMA。

🛠️ 参考改法
@@
     private static final double EMA_ALPHA = 0.2;
     private static final double EMA_SAFETY_FACTOR = 3.0;
+    private static final int EMA_MIN_SAMPLE_COUNT = 3;
     private static final ConcurrentHashMap<String, Double> pingResponseEma = new ConcurrentHashMap<>();
+    private static final ConcurrentHashMap<String, AtomicInteger> pingResponseSampleCount = new ConcurrentHashMap<>();
@@
         Double ema = pingResponseEma.get(hostUuid);
-        if (ema == null) {
+        AtomicInteger sampleCount = pingResponseSampleCount.get(hostUuid);
+        if (ema == null || sampleCount == null || sampleCount.get() < EMA_MIN_SAMPLE_COUNT) {
             return configured;
         }
@@
     private static void updatePingResponseEma(String hostUuid, double responseTimeSec) {
         pingResponseEma.merge(hostUuid, responseTimeSec,
                 (oldEma, sample) -> EMA_ALPHA * sample + (1 - EMA_ALPHA) * oldEma);
+        pingResponseSampleCount.computeIfAbsent(hostUuid, k -> new AtomicInteger(0)).incrementAndGet();
     }

Also applies to: 70-73

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` around
lines 53 - 63, In getAdaptiveTimeout ensure EMA-based adaptive timeout is only
used after a minimum number of samples: introduce or use a sample-count map
(e.g., pingResponseCount) and a MIN_SAMPLES threshold (constant or config),
check that pingResponseCount.get(hostUuid) >= MIN_SAMPLES before using
pingResponseEma.get(hostUuid) to compute adaptive = ceil(ema *
EMA_SAFETY_FACTOR); otherwise return the configured timeout; apply the same
sample-threshold guard to the other EMA usage block mentioned (lines 70-73) so
transient single slow responses don't immediately inflate timeouts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 42-45: pingResponseEma 是一个静态 ConcurrentHashMap 存放主机的 EMA
数据,但没有清理路径会导致内存泄漏;在 HostTrackImpl 中为主机停止/删除或不再跟踪的路径添加清理逻辑以删除对应 key(例如在
stopTrackHost/handleHostDisconnected/handleHostDeleted 或任何取消跟踪的回调中调用
pingResponseEma.remove(hostUuid)),并且可选地在 HostTrackImpl 中新增一个统一方法 (e.g.
cleanupPingEmaForHost(String hostUuid))
以集中删除/记录;如果需要长期自动回收,可替换或补充为基于时间的回收(TTL)策略或使用弱引用缓存,但首要修复是在主机生命周期结束处显式移除
pingResponseEma 条目以避免无界增长。

---

Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 53-63: In getAdaptiveTimeout ensure EMA-based adaptive timeout is
only used after a minimum number of samples: introduce or use a sample-count map
(e.g., pingResponseCount) and a MIN_SAMPLES threshold (constant or config),
check that pingResponseCount.get(hostUuid) >= MIN_SAMPLES before using
pingResponseEma.get(hostUuid) to compute adaptive = ceil(ema *
EMA_SAFETY_FACTOR); otherwise return the configured timeout; apply the same
sample-threshold guard to the other EMA usage block mentioned (lines 70-73) so
transient single slow responses don't immediately inflate timeouts.

ℹ️ Review info

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20160c7 and c8365ef.

⛔ Files ignored due to path filters (1)
  • conf/globalConfig/host.xml is excluded by !**/*.xml
📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/host/HostGlobalConfig.java
  • compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java

R2: add asyncJsonPostForPing() using a dedicated AsyncRestTemplate (maxPerRoute=1, maxTotal=3000) to prevent ping starvation under high business traffic. ThreadLocal ASYNC_TEMPLATE_OVERRIDE injects the ping pool transparently without changing asyncJson() signature.

Resolves: ZSTAC-67534

Change-Id: Id3721eb04c291f742421170792ba713a8e446e23
…ation

TestConnectionPoolConfig: verify createRestTemplate(4-param) explicit pool
and backward-compat 2-param variant (R1).
TestPingPoolIsolation: verify ASYNC_TEMPLATE_OVERRIDE ThreadLocal structure
and cleanup lifecycle via reflection (R2).

Resolves: ZSTAC-67534

Change-Id: I9f3a2e6b7c8d1e2f3a4b5c6d7e8f9a0b1c2d3e4
…Status

R5: store PoolingNHttpClientConnectionManager refs for async and ping
pools. Register DumpConnectionPoolStatus debug signal handler to dump
leased/available/pending stats at runtime.

Resolves: ZSTAC-67534

Change-Id: Ib2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
core/src/main/java/org/zstack/core/cloudbus/CloudBusImpl3.java (1)

140-145: 硬编码的魔法值 10 应该提取为常量或配置项。

第 145 行的 10 是一个魔法值,缺乏语义说明。根据编码规范,应当使用常量或配置属性替代硬编码数值,以提升可读性和可维护性。

建议在 CloudBusGlobalProperty 中定义一个配置项(如 HTTP_MAX_CONN_WAIT),或在本类中定义一个常量:

♻️ 建议的修改
+    private static final int HTTP_MAX_CONNECTION_WAIT = 10;
+
     // R1+R3: explicit pool params matching chain queue capacity (CloudBusGlobalProperty.HTTP_MAX_CONN)
     private final static TimeoutRestTemplate http = RESTFacade.createRestTemplate(
             CoreGlobalProperty.REST_FACADE_READ_TIMEOUT,
             CoreGlobalProperty.REST_FACADE_CONNECT_TIMEOUT,
             CloudBusGlobalProperty.HTTP_MAX_CONN,
-            10);
+            HTTP_MAX_CONNECTION_WAIT);

As per coding guidelines, 避免使用魔法值(Magic Value):直接使用未经定义的数值或字符串应替换为枚举或常量。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/cloudbus/CloudBusImpl3.java` around lines
140 - 145, Replace the hardcoded magic value 10 passed into
RESTFacade.createRestTemplate by introducing a named constant or configuration
property; e.g., add a new CloudBusGlobalProperty (e.g., HTTP_MAX_CONN_WAIT) or a
private static final constant in CloudBusImpl3 and use that identifier in the
call to RESTFacade.createRestTemplate where TimeoutRestTemplate http is
initialized, so the meaning of the value is explicit and configurable instead of
the literal 10.
test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java (1)

12-12: 测试类命名建议改为后缀式。

Line 12 的 TestConnectionPoolConfig 建议改为 ConnectionPoolConfigTest(或 ConnectionPoolConfigCase),以符合测试类统一命名规则。

As per coding guidelines "测试类需要以 TestCase 结尾".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java`
at line 12, Rename the test class TestConnectionPoolConfig to follow the
project's test naming convention by using a suffix form (e.g.,
ConnectionPoolConfigTest or ConnectionPoolConfigCase); update the class
declaration name and any references/usages (imports, test runners, build/test
configs) that refer to TestConnectionPoolConfig so they point to the new class
name and ensure tests continue to compile and run.
core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java (1)

223-228: ping 连接池参数建议去魔法值并外置配置。

Line 225-228 的 10000/3000/1/3000 是硬编码,后续不同规模集群很难按场景调优。建议提取常量或全局配置项,避免把容量与超时策略固化在代码里。

As per coding guidelines "避免使用魔法值(Magic Value):直接使用未经定义的数值或字符串应替换为常量或枚举".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java` around lines 223
- 228, The ping AsyncRestTemplate is constructed with hard-coded magic values
(10000, 3000, 1, 3000) in RESTFacadeImpl; update the code to pull these
parameters from configurable constants or external configuration and use named
constants to replace the literal numbers. Specifically, change the call that
initializes pingAsyncRestTemplate (the createAsyncRestTemplate invocation) to
read timeouts and pool sizes from a configuration source or defined static final
fields (e.g., PING_READ_TIMEOUT_MS, PING_CONNECT_TIMEOUT_MS, PING_MAX_PER_ROUTE,
PING_MAX_TOTAL) and pass those variables into createAsyncRestTemplate so the
values are not embedded as literals.
test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java (1)

16-16: 测试类命名不符合约定后缀。

Line 16 的类名 TestPingPoolIsolation 未以 TestCase 结尾,建议重命名为 PingPoolIsolationTest(或 PingPoolIsolationCase)以满足统一约定。

As per coding guidelines "测试类需要以 TestCase 结尾".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java` at
line 16, Rename the test class TestPingPoolIsolation to follow the convention
(e.g., PingPoolIsolationTest or PingPoolIsolationCase): update the class
declaration in TestPingPoolIsolation to the chosen name and adjust any
references/imports/usages (test runner, annotations, constructors) accordingly
so the JVM and build tooling find the test; also rename the source file to match
the new class name to keep filename/classname consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/src/main/java/org/zstack/core/CoreGlobalProperty.java`:
- Around line 41-50: The change tightens RESTFacade.readTimeout default to 60000
which breaks backward compatibility; revert the default value in the
`@GlobalProperty` on REST_FACADE_READ_TIMEOUT to the original (pre-change) value
and implement a compatibility flag/property (e.g., add a new `@GlobalProperty`
boolean REST_FACADE_USE_TIGHTER_DEFAULT or REST_FACADE_NEW_INSTALL_ONLY) and
update the code that reads REST_FACADE_READ_TIMEOUT to use the tighter 60000
only when that flag is true (or when a "new install" marker is present), leaving
REST_FACADE_CONNECT_TIMEOUT/REST_FACADE_ECHO_TIMEOUT/REST_FACADE_MAX_PER_ROUTE/REST_FACADE_MAX_TOTAL
behavior unchanged; reference the static fields REST_FACADE_READ_TIMEOUT and the
`@GlobalProperty` annotation on it to locate and apply the conditional logic.

In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 615-623: The ThreadLocal override in asyncJsonPostForPing
currently calls ASYNC_TEMPLATE_OVERRIDE.remove() which clears any outer nested
value; change asyncJsonPostForPing to save the previous value (e.g. a local
variable old = ASYNC_TEMPLATE_OVERRIDE.get()), set ASYNC_TEMPLATE_OVERRIDE to
pingAsyncRestTemplate, call asyncJsonPost(url, body, callback) in the try, and
in finally restore the previous value (if old is null call remove() else
ASYNC_TEMPLATE_OVERRIDE.set(old)) so nested callers keep their template; update
references to ASYNC_TEMPLATE_OVERRIDE, pingAsyncRestTemplate, and asyncJsonPost
accordingly.

In `@header/src/main/java/org/zstack/header/rest/RESTFacade.java`:
- Around line 126-131: The createRestTemplate method hardcodes
connectionRequestTimeout to 8000 which bypasses the configurable
RESTFacade.connectionRequestTimeout; add an overloaded createRestTemplate(int
readTimeout, int connectTimeout, int maxTotal, int maxPerRoute, int
connectionRequestTimeout) and update the existing createRestTemplate to call the
new overload (or accept a connectionRequestTimeout parameter) so
TimeoutHttpComponentsClientHttpRequestFactory.setConnectionRequestTimeout(Math.min(connectTimeout
* 2, connectionRequestTimeout)) is used instead of 8000; ensure all callers of
createRestTemplate use the new overload or the shared
RESTFacade.connectionRequestTimeout value.

In `@test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java`:
- Around line 15-40: The current tests only assert non-null; enhance them to
verify the actual pool and timeout settings by extracting the internal request
factory and http client from the returned TimeoutRestTemplate (from
RESTFacade.createRestTemplate) and inspecting the underlying Apache components:
cast to HttpComponentsClientHttpRequestFactory to get the HttpClient, downcast
to CloseableHttpClient to obtain the ConnectionManager
(PoolingHttpClientConnectionManager) and assert its getMaxTotal() and
getDefaultMaxPerRoute() match the supplied maxTotal/maxPerRoute, and read the
RequestConfig/connectionRequestTimeout (or the factory’s timeout fields) to
assert connectionRequestTimeout/connectionRequestTimeout behavior; update tests
like test4ParamCreateRestTemplate_returnsNonNull,
testZeroPoolParams_fallsBackToApacheDefaults and
testPingPoolParams_smallPerRoute_largeTotalForCluster to include these
reflective/assertive checks.

In `@test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java`:
- Around line 45-49: The test method
testAsyncTemplateOverride_isNullOnFreshThread() has mismatched semantics: it
reads the ThreadLocal on the current thread rather than a truly fresh thread;
update the test to either rename the method to reflect current-thread initial
state (e.g., testAsyncTemplateOverride_currentThreadInitialState) or change the
implementation to actually create a new Thread and assert tl.get() is null
there; locate the ThreadLocal via getOverrideTL() and the assertion using
Assert.assertNull and adjust the name or add a new Thread that calls tl.get() so
the test semantics and implementation match.

---

Nitpick comments:
In `@core/src/main/java/org/zstack/core/cloudbus/CloudBusImpl3.java`:
- Around line 140-145: Replace the hardcoded magic value 10 passed into
RESTFacade.createRestTemplate by introducing a named constant or configuration
property; e.g., add a new CloudBusGlobalProperty (e.g., HTTP_MAX_CONN_WAIT) or a
private static final constant in CloudBusImpl3 and use that identifier in the
call to RESTFacade.createRestTemplate where TimeoutRestTemplate http is
initialized, so the meaning of the value is explicit and configurable instead of
the literal 10.

In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 223-228: The ping AsyncRestTemplate is constructed with hard-coded
magic values (10000, 3000, 1, 3000) in RESTFacadeImpl; update the code to pull
these parameters from configurable constants or external configuration and use
named constants to replace the literal numbers. Specifically, change the call
that initializes pingAsyncRestTemplate (the createAsyncRestTemplate invocation)
to read timeouts and pool sizes from a configuration source or defined static
final fields (e.g., PING_READ_TIMEOUT_MS, PING_CONNECT_TIMEOUT_MS,
PING_MAX_PER_ROUTE, PING_MAX_TOTAL) and pass those variables into
createAsyncRestTemplate so the values are not embedded as literals.

In `@test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java`:
- Line 12: Rename the test class TestConnectionPoolConfig to follow the
project's test naming convention by using a suffix form (e.g.,
ConnectionPoolConfigTest or ConnectionPoolConfigCase); update the class
declaration name and any references/usages (imports, test runners, build/test
configs) that refer to TestConnectionPoolConfig so they point to the new class
name and ensure tests continue to compile and run.

In `@test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java`:
- Line 16: Rename the test class TestPingPoolIsolation to follow the convention
(e.g., PingPoolIsolationTest or PingPoolIsolationCase): update the class
declaration in TestPingPoolIsolation to the chosen name and adjust any
references/imports/usages (test runner, annotations, constructors) accordingly
so the JVM and build tooling find the test; also rename the source file to match
the new class name to keep filename/classname consistency.

ℹ️ Review info

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8365ef and c9f2f22.

📒 Files selected for processing (7)
  • core/src/main/java/org/zstack/core/CoreGlobalProperty.java
  • core/src/main/java/org/zstack/core/cloudbus/CloudBusImpl3.java
  • core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java
  • header/src/main/java/org/zstack/header/rest/RESTFacade.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
  • test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java
  • test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java
🚧 Files skipped from review as they are similar to previous changes (1)
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java

Comment on lines 41 to 50
@GlobalProperty(name = "RESTFacade.readTimeout", defaultValue = "60000")
public static int REST_FACADE_READ_TIMEOUT;
@GlobalProperty(name = "RESTFacade.connectTimeout", defaultValue = "15000")
public static int REST_FACADE_CONNECT_TIMEOUT;
@GlobalProperty(name = "RESTFacade.echoTimeout", defaultValue = "60")
public static int REST_FACADE_ECHO_TIMEOUT;
@GlobalProperty(name = "RESTFacade.maxPerRoute", defaultValue = "2")
@GlobalProperty(name = "RESTFacade.maxPerRoute", defaultValue = "50")
public static int REST_FACADE_MAX_PER_ROUTE;
@GlobalProperty(name = "RESTFacade.maxTotal", defaultValue = "128")
@GlobalProperty(name = "RESTFacade.maxTotal", defaultValue = "2048")
public static int REST_FACADE_MAX_TOTAL;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

默认超时直接收紧有向后兼容风险。

Line 41 将 RESTFacade.readTimeout 默认值改为 60000,会影响所有未显式配置的历史部署,可能让原本可用的慢路径请求在升级后变为超时失败。建议通过兼容开关或“仅新装生效”的策略发布该默认值变更。

As per coding guidelines "向后兼容原则:之前的代码产生的行为不要直接去改动,即使是 bug 也可能用户有依赖,不应直接改动,可以考虑升级用户一套行为、新安装用户一套行为,开关控制等".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/CoreGlobalProperty.java` around lines 41 -
50, The change tightens RESTFacade.readTimeout default to 60000 which breaks
backward compatibility; revert the default value in the `@GlobalProperty` on
REST_FACADE_READ_TIMEOUT to the original (pre-change) value and implement a
compatibility flag/property (e.g., add a new `@GlobalProperty` boolean
REST_FACADE_USE_TIGHTER_DEFAULT or REST_FACADE_NEW_INSTALL_ONLY) and update the
code that reads REST_FACADE_READ_TIMEOUT to use the tighter 60000 only when that
flag is true (or when a "new install" marker is present), leaving
REST_FACADE_CONNECT_TIMEOUT/REST_FACADE_ECHO_TIMEOUT/REST_FACADE_MAX_PER_ROUTE/REST_FACADE_MAX_TOTAL
behavior unchanged; reference the static fields REST_FACADE_READ_TIMEOUT and the
`@GlobalProperty` annotation on it to locate and apply the conditional logic.

Comment on lines +615 to +623
@Override
public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) {
// R2: inject dedicated ping pool via ThreadLocal; cleared in finally to prevent leaks
ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate);
try {
asyncJsonPost(url, body, callback);
} finally {
ASYNC_TEMPLATE_OVERRIDE.remove();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

ThreadLocal 覆盖值在嵌套调用下会被误清理。

Line 622 直接 remove() 会在“外层已设置覆盖值、内层再次调用”时把外层状态也清掉。建议保存旧值并在 finally 恢复,避免嵌套调用时模板选择被破坏。

建议修复
 `@Override`
 public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) {
-    ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate);
+    AsyncRestTemplate previous = ASYNC_TEMPLATE_OVERRIDE.get();
+    ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate);
     try {
         asyncJsonPost(url, body, callback);
     } finally {
-        ASYNC_TEMPLATE_OVERRIDE.remove();
+        if (previous == null) {
+            ASYNC_TEMPLATE_OVERRIDE.remove();
+        } else {
+            ASYNC_TEMPLATE_OVERRIDE.set(previous);
+        }
     }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@Override
public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) {
// R2: inject dedicated ping pool via ThreadLocal; cleared in finally to prevent leaks
ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate);
try {
asyncJsonPost(url, body, callback);
} finally {
ASYNC_TEMPLATE_OVERRIDE.remove();
}
`@Override`
public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) {
AsyncRestTemplate previous = ASYNC_TEMPLATE_OVERRIDE.get();
ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate);
try {
asyncJsonPost(url, body, callback);
} finally {
if (previous == null) {
ASYNC_TEMPLATE_OVERRIDE.remove();
} else {
ASYNC_TEMPLATE_OVERRIDE.set(previous);
}
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java` around lines 615
- 623, The ThreadLocal override in asyncJsonPostForPing currently calls
ASYNC_TEMPLATE_OVERRIDE.remove() which clears any outer nested value; change
asyncJsonPostForPing to save the previous value (e.g. a local variable old =
ASYNC_TEMPLATE_OVERRIDE.get()), set ASYNC_TEMPLATE_OVERRIDE to
pingAsyncRestTemplate, call asyncJsonPost(url, body, callback) in the try, and
in finally restore the previous value (if old is null call remove() else
ASYNC_TEMPLATE_OVERRIDE.set(old)) so nested callers keep their template; update
references to ASYNC_TEMPLATE_OVERRIDE, pingAsyncRestTemplate, and asyncJsonPost
accordingly.

Comment on lines +126 to 131
static TimeoutRestTemplate createRestTemplate(int readTimeout, int connectTimeout, int maxTotal, int maxPerRoute) {
HttpComponentsClientHttpRequestFactory factory = new TimeoutHttpComponentsClientHttpRequestFactory();
factory.setReadTimeout(readTimeout);
factory.setConnectTimeout(connectTimeout);
factory.setConnectionRequestTimeout(connectTimeout * 2);
factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, 8000));

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

connectionRequestTimeout 上限被硬编码,和全局配置能力不一致。

Line 130 固定 8000 会让该模板路径无法复用 RESTFacade.connectionRequestTimeout 的可配置能力,最终出现同一组件内超时策略不一致。建议增加带 connectionRequestTimeout 参数的重载,由调用方传入上限。

建议重构方向
 static TimeoutRestTemplate createRestTemplate(int readTimeout, int connectTimeout, int maxTotal, int maxPerRoute) {
-    HttpComponentsClientHttpRequestFactory factory = new TimeoutHttpComponentsClientHttpRequestFactory();
+    return createRestTemplate(readTimeout, connectTimeout, maxTotal, maxPerRoute, 8000);
+}
+
+static TimeoutRestTemplate createRestTemplate(int readTimeout, int connectTimeout, int maxTotal, int maxPerRoute, int connectionRequestTimeout) {
+    HttpComponentsClientHttpRequestFactory factory = new TimeoutHttpComponentsClientHttpRequestFactory();
     factory.setReadTimeout(readTimeout);
     factory.setConnectTimeout(connectTimeout);
-    factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, 8000));
+    factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, connectionRequestTimeout));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/rest/RESTFacade.java` around lines 126
- 131, The createRestTemplate method hardcodes connectionRequestTimeout to 8000
which bypasses the configurable RESTFacade.connectionRequestTimeout; add an
overloaded createRestTemplate(int readTimeout, int connectTimeout, int maxTotal,
int maxPerRoute, int connectionRequestTimeout) and update the existing
createRestTemplate to call the new overload (or accept a
connectionRequestTimeout parameter) so
TimeoutHttpComponentsClientHttpRequestFactory.setConnectionRequestTimeout(Math.min(connectTimeout
* 2, connectionRequestTimeout)) is used instead of 8000; ensure all callers of
createRestTemplate use the new overload or the shared
RESTFacade.connectionRequestTimeout value.

Comment on lines +15 to +40
public void test4ParamCreateRestTemplate_returnsNonNull() {
// R1: explicit pool params must succeed
TimeoutRestTemplate tmpl = RESTFacade.createRestTemplate(60000, 3000, 50, 2);
Assert.assertNotNull("createRestTemplate(4-param) must return non-null template", tmpl);
}

@Test
public void test2ParamCreateRestTemplate_backwardCompat() {
// old callers delegating to 4-param with (0, 0) must still work
TimeoutRestTemplate tmpl = RESTFacade.createRestTemplate(60000, 3000);
Assert.assertNotNull("createRestTemplate(2-param) backward-compat must return non-null", tmpl);
}

@Test
public void testZeroPoolParams_fallsBackToApacheDefaults() {
// maxTotal=0, maxPerRoute=0 → skip explicit pool; SSL path still applies
TimeoutRestTemplate tmpl = RESTFacade.createRestTemplate(60000, 3000, 0, 0);
Assert.assertNotNull("zero-pool fallback must return non-null template", tmpl);
}

@Test
public void testPingPoolParams_smallPerRoute_largeTotalForCluster() {
// P0 ping pool: maxPerRoute=1 (one ping/host), maxTotal=3000 (full cluster)
TimeoutRestTemplate tmpl = RESTFacade.createRestTemplate(10000, 3000, 3000, 1);
Assert.assertNotNull("ping-pool createRestTemplate must return non-null template", tmpl);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

当前断言过弱,无法证明连接池配置真的生效。

Line 15-40 只校验“非空”不足以覆盖本次改动目标(池大小与隔离策略)。建议补充对 maxTotaldefaultMaxPerRouteconnectionRequestTimeout 的可观察断言(可用反射读取 requestFactory/httpClient/connectionManager)。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java`
around lines 15 - 40, The current tests only assert non-null; enhance them to
verify the actual pool and timeout settings by extracting the internal request
factory and http client from the returned TimeoutRestTemplate (from
RESTFacade.createRestTemplate) and inspecting the underlying Apache components:
cast to HttpComponentsClientHttpRequestFactory to get the HttpClient, downcast
to CloseableHttpClient to obtain the ConnectionManager
(PoolingHttpClientConnectionManager) and assert its getMaxTotal() and
getDefaultMaxPerRoute() match the supplied maxTotal/maxPerRoute, and read the
RequestConfig/connectionRequestTimeout (or the factory’s timeout fields) to
assert connectionRequestTimeout/connectionRequestTimeout behavior; update tests
like test4ParamCreateRestTemplate_returnsNonNull,
testZeroPoolParams_fallsBackToApacheDefaults and
testPingPoolParams_smallPerRoute_largeTotalForCluster to include these
reflective/assertive checks.

Comment on lines +45 to +49
public void testAsyncTemplateOverride_isNullOnFreshThread() throws Exception {
ThreadLocal<AsyncRestTemplate> tl = getOverrideTL();
// on a new thread (or after cleanup) the ThreadLocal must be null
Assert.assertNull("ASYNC_TEMPLATE_OVERRIDE must be null on fresh thread — verifies no leak from prior call", tl.get());
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

“fresh thread”测试语义与实现不一致。

Line 45-49 当前只在当前线程读取 ThreadLocal,并未创建新线程;测试名会给出错误信号。建议改名为“currentThreadInitialState”或实际新建线程后断言。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java`
around lines 45 - 49, The test method
testAsyncTemplateOverride_isNullOnFreshThread() has mismatched semantics: it
reads the ThreadLocal on the current thread rather than a truly fresh thread;
update the test to either rename the method to reflect current-thread initial
state (e.g., testAsyncTemplateOverride_currentThreadInitialState) or change the
implementation to actually create a new Thread and assert tl.get() is null
there; locate the ThreadLocal via getOverrideTL() and the assertion using
Assert.assertNull and adjust the name or add a new Thread that calls tl.get() so
the test semantics and implementation match.

R5: submit periodic task (every 60s) to check async and ping pool
utilization. Log [POOL-ALARM] WARNING when async pool > 80% or ping
pool has pending requests, enabling ops to detect congestion before
ConnectionPoolTimeoutException occurs.

Resolves: ZSTAC-67534

Change-Id: Ic3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java (1)

681-689: ⚠️ Potential issue | 🟠 Major

恢复 ThreadLocal 旧值,避免嵌套调用误清理外层覆盖模板。

Line 688 直接 ASYNC_TEMPLATE_OVERRIDE.remove() 会在嵌套调用时把外层已设置值一并清掉,后续同线程请求会错误回落到默认 asyncRestTemplate

建议修复
 `@Override`
 public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) {
-    // R2: inject dedicated ping pool via ThreadLocal; cleared in finally to prevent leaks
-    ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate);
+    AsyncRestTemplate previous = ASYNC_TEMPLATE_OVERRIDE.get();
+    ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate);
     try {
         asyncJsonPost(url, body, callback);
     } finally {
-        ASYNC_TEMPLATE_OVERRIDE.remove();
+        if (previous == null) {
+            ASYNC_TEMPLATE_OVERRIDE.remove();
+        } else {
+            ASYNC_TEMPLATE_OVERRIDE.set(previous);
+        }
     }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java` around lines 681
- 689, 在 asyncJsonPostForPing 中不要直接 remove ThreadLocal;改为在设置
ASYNC_TEMPLATE_OVERRIDE 前保存旧值(ASYNC_TEMPLATE_OVERRIDE.get()),调用 asyncJsonPost 后在
finally 中恢复旧值:若旧值为 null 则 remove,否则重新 set
旧值。引用的标识符:asyncJsonPostForPing、ASYNC_TEMPLATE_OVERRIDE、pingAsyncRestTemplate、asyncJsonPost。
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 207-208: The code in RESTFacadeImpl currently hardcodes the ping
pool maxTotal as 3000 in the sb.append call (the string formatting that prints
"ping-pool(P0): maxTotal=...") which violates the magic-value rule and breaks
for >3000 clusters; change this to use a configurable value instead of the
literal 3000: read the pool max from the existing global
concurrency/configuration source or introduce a new configurable constant (e.g.,
PING_POOL_MAX or a method getPingPoolMax()) and replace the literal in the
sb.append formatting and any other locations that write 3000 (also update where
ping pool is created/initialized so the actual pool uses the same configurable
maxTotal rather than a hardcoded 3000). Ensure you update all occurrences (the
sb.append formatting and pool initialization code paths) to reference the same
config variable.

---

Duplicate comments:
In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 681-689: 在 asyncJsonPostForPing 中不要直接 remove ThreadLocal;改为在设置
ASYNC_TEMPLATE_OVERRIDE 前保存旧值(ASYNC_TEMPLATE_OVERRIDE.get()),调用 asyncJsonPost 后在
finally 中恢复旧值:若旧值为 null 则 remove,否则重新 set
旧值。引用的标识符:asyncJsonPostForPing、ASYNC_TEMPLATE_OVERRIDE、pingAsyncRestTemplate、asyncJsonPost。

ℹ️ Review info

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9f2f22 and b785677.

📒 Files selected for processing (1)
  • core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
header/src/main/java/org/zstack/header/rest/RESTFacade.java (1)

130-130: ⚠️ Potential issue | 🟠 Major

connectionRequestTimeout 在同步模板路径仍是硬编码。

Line 130 固定为 8000,会绕过新增的 RESTFacade.connectionRequestTimeout 配置,导致模板超时策略不一致。这个问题与之前评审结论一致,建议补充带 connectionRequestTimeout 参数的重载并让调用方传入配置值。

♻️ 建议修复
 static TimeoutRestTemplate createRestTemplate(int readTimeout, int connectTimeout, int maxTotal, int maxPerRoute) {
-    HttpComponentsClientHttpRequestFactory factory = new TimeoutHttpComponentsClientHttpRequestFactory();
+    return createRestTemplate(readTimeout, connectTimeout, maxTotal, maxPerRoute, 8000);
+}
+
+static TimeoutRestTemplate createRestTemplate(int readTimeout, int connectTimeout, int maxTotal, int maxPerRoute, int connectionRequestTimeout) {
+    HttpComponentsClientHttpRequestFactory factory = new TimeoutHttpComponentsClientHttpRequestFactory();
     factory.setReadTimeout(readTimeout);
     factory.setConnectTimeout(connectTimeout);
-    factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, 8000));
+    factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, connectionRequestTimeout));
#!/bin/bash
set -euo pipefail

# 1) 定位硬编码 connectionRequestTimeout
rg -n "Math\\.min\\(connectTimeout \\* 2, 8000\\)" --type=java

# 2) 检查 createRestTemplate 重载与调用点
rg -n "createRestTemplate\\(" --type=java
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/rest/RESTFacade.java` at line 130, The
line in RESTFacade where
factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, 8000))
hardcodes 8000 and bypasses the new RESTFacade.connectionRequestTimeout config;
add an overload of createRestTemplate that accepts a connectionRequestTimeout
parameter (or use the existing RESTFacade.connectionRequestTimeout value) and
replace calls to createRestTemplate(...) that now rely on the hardcoded path so
they pass RESTFacade.connectionRequestTimeout into the
factory.setConnectionRequestTimeout call; update all createRestTemplate callers
to use the new overload (or pass the instance field) so the computed timeout
uses the configured connectionRequestTimeout instead of 8000.
core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java (1)

685-693: ⚠️ Potential issue | 🟠 Major

ThreadLocal 覆盖值在嵌套调用场景会被误清理。

Line 692 在 finally 里直接 remove(),会把外层已存在的覆盖模板也清掉。应恢复旧值,而不是无条件清理。

🛠 建议修复
 `@Override`
 public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) {
-    ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate);
+    AsyncRestTemplate previous = ASYNC_TEMPLATE_OVERRIDE.get();
+    ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate);
     try {
         asyncJsonPost(url, body, callback);
     } finally {
-        ASYNC_TEMPLATE_OVERRIDE.remove();
+        if (previous == null) {
+            ASYNC_TEMPLATE_OVERRIDE.remove();
+        } else {
+            ASYNC_TEMPLATE_OVERRIDE.set(previous);
+        }
     }
 }
#!/bin/bash
set -euo pipefail

# 输出 asyncJsonPostForPing 方法片段,确认 finally 是否无条件 remove()
python - <<'PY'
from pathlib import Path
p = Path("core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java")
lines = p.read_text().splitlines()
start = next(i for i,l in enumerate(lines, 1) if "public void asyncJsonPostForPing" in l)
for i in range(start, min(start + 24, len(lines) + 1)):
    print(f"{i}:{lines[i-1]}")
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java` around lines 685
- 693, The finally block in asyncJsonPostForPing currently calls
ASYNC_TEMPLATE_OVERRIDE.remove(), which will clear any previously set override
in nested calls; instead, capture the previous value (e.g., store
ASYNC_TEMPLATE_OVERRIDE.get() into a local variable) before setting
ASYNC_TEMPLATE_OVERRIDE to pingAsyncRestTemplate, and in the finally restore the
previous value (set the ThreadLocal back to the saved value or remove only if
the saved value was null). Update asyncJsonPostForPing to save/restore the old
value rather than unconditionally removing it; references:
ASYNC_TEMPLATE_OVERRIDE, pingAsyncRestTemplate, asyncJsonPost,
asyncJsonPostForPing.
🧹 Nitpick comments (1)
core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java (1)

257-262: Ping 池初始化中仍有多个裸字面量,建议提取常量统一管理。

Line 258 到 Line 262 的 10000/3000/2 建议与 PING_POOL_MAX_TOTAL 一样提取为具名常量(或配置项),降低后续调参和排障成本。

♻️ 建议重构
+    private static final int PING_READ_TIMEOUT_MS = 10000;
+    private static final int PING_CONNECT_TIMEOUT_MS = 3000;
+    private static final int PING_POOL_MAX_PER_ROUTE = 2;
+
         pingAsyncRestTemplate = createAsyncRestTemplate(
-                10000,              // readTimeout = ping timeout (10s)
-                3000,               // connectTimeout = fast fail (3s)
+                PING_READ_TIMEOUT_MS,
+                PING_CONNECT_TIMEOUT_MS,
                 PING_POOL_MAX_TOTAL, // maxTotal: full cluster capacity
-                2,                  // maxPerRoute: 2 allows 1 in-flight + 1 queued, avoids head-of-line block
+                PING_POOL_MAX_PER_ROUTE,
                 cmRef);

As per coding guidelines "避免使用魔法值(Magic Value):直接使用未经定义的数值或字符串(如 if (status == 5))应替换为枚举或常量"。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java` around lines 257
- 262, The pingAsyncRestTemplate initialization uses magic literals (10000,
3000, 2); extract these into named constants (e.g., PING_READ_TIMEOUT_MS,
PING_CONNECT_TIMEOUT_MS, PING_POOL_MAX_PER_ROUTE or make them configurable) and
replace the literals in the call to createAsyncRestTemplate so all pool/timeout
settings (PING_POOL_MAX_TOTAL, the new constants) are centrally defined and
documented; update any related javadoc/config comments for
pingAsyncRestTemplate/createAsyncRestTemplate to reference the new constants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 172-173: The RTT measurement in HostTrackImpl currently uses
System.currentTimeMillis() (pingStartMs) before sending the bus message and
again when computing RTT in the CloudBusCallBack; switch to a monotonic clock by
using System.nanoTime() for start and end deltas (e.g., store pingStartNs from
System.nanoTime(), compute deltaNs = System.nanoTime() - pingStartNs) and
convert to milliseconds for the EMA/timeout logic (deltaMs = deltaNs /
1_000_000L); update both the send-site (where pingStartMs is set) and the
callback-site (where RTT is calculated and fed into the EMA/timeout) to use the
new variables so EMA and adaptive timeout code in HostTrackImpl remain monotonic
and immune to system clock adjustments.

---

Duplicate comments:
In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 685-693: The finally block in asyncJsonPostForPing currently calls
ASYNC_TEMPLATE_OVERRIDE.remove(), which will clear any previously set override
in nested calls; instead, capture the previous value (e.g., store
ASYNC_TEMPLATE_OVERRIDE.get() into a local variable) before setting
ASYNC_TEMPLATE_OVERRIDE to pingAsyncRestTemplate, and in the finally restore the
previous value (set the ThreadLocal back to the saved value or remove only if
the saved value was null). Update asyncJsonPostForPing to save/restore the old
value rather than unconditionally removing it; references:
ASYNC_TEMPLATE_OVERRIDE, pingAsyncRestTemplate, asyncJsonPost,
asyncJsonPostForPing.

In `@header/src/main/java/org/zstack/header/rest/RESTFacade.java`:
- Line 130: The line in RESTFacade where
factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, 8000))
hardcodes 8000 and bypasses the new RESTFacade.connectionRequestTimeout config;
add an overload of createRestTemplate that accepts a connectionRequestTimeout
parameter (or use the existing RESTFacade.connectionRequestTimeout value) and
replace calls to createRestTemplate(...) that now rely on the hardcoded path so
they pass RESTFacade.connectionRequestTimeout into the
factory.setConnectionRequestTimeout call; update all createRestTemplate callers
to use the new overload (or pass the instance field) so the computed timeout
uses the configured connectionRequestTimeout instead of 8000.

---

Nitpick comments:
In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 257-262: The pingAsyncRestTemplate initialization uses magic
literals (10000, 3000, 2); extract these into named constants (e.g.,
PING_READ_TIMEOUT_MS, PING_CONNECT_TIMEOUT_MS, PING_POOL_MAX_PER_ROUTE or make
them configurable) and replace the literals in the call to
createAsyncRestTemplate so all pool/timeout settings (PING_POOL_MAX_TOTAL, the
new constants) are centrally defined and documented; update any related
javadoc/config comments for pingAsyncRestTemplate/createAsyncRestTemplate to
reference the new constants.

ℹ️ Review info

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b785677 and 1df823c.

📒 Files selected for processing (4)
  • compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java
  • core/src/main/java/org/zstack/core/CoreGlobalProperty.java
  • core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java
  • header/src/main/java/org/zstack/header/rest/RESTFacade.java

Comment on lines +172 to 173
final long pingStartMs = System.currentTimeMillis();
bus.send(msg, new CloudBusCallBack(null) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f -name "HostTrackImpl.java"

Repository: MatheMatrix/zstack

Length of output: 292


🏁 Script executed:

cat -n compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java | sed -n '160,200p'

Repository: MatheMatrix/zstack

Length of output: 2717


🌐 Web query:

Java System.nanoTime() vs System.currentTimeMillis() elapsed time measurement best practice

💡 Result:

Best practice (elapsed time)

Use System.nanoTime() to measure elapsed time (durations, timeouts, benchmarking inside one JVM). It’s specifically intended for elapsed-time measurement and uses a fixed-but-arbitrary origin, so only differences are meaningful. It also has at least the resolution of currentTimeMillis() and is not tied to wall-clock time. [1]

long t0 = System.nanoTime();
// work
long elapsedNanos = System.nanoTime() - t0;

When to use System.currentTimeMillis()

Use System.currentTimeMillis() for wall-clock timestamps (e.g., logging, “what time is it?”, storing an epoch-millis timestamp). It returns milliseconds since 1970-01-01T00:00Z (UTC), but its granularity may be tens of milliseconds depending on OS, and it can jump due to clock adjustments. [1]

Prefer java.time for “time-of-day” APIs

For most application code, prefer Instant.now() (optionally via an injected Clock for testability) rather than calling currentTimeMillis() directly; Clock.millis() is defined as equivalent to System.currentTimeMillis(). [2]

Rule of thumb:

  • Elapsed time / timeouts: nanoTime() [1]
  • Current date/time timestamp: Instant.now() / Clock (or currentTimeMillis() if you truly need a primitive epoch-millis) [1][2]

Sources:
[1] Oracle JDK java.lang.System Javadoc (nanoTime/currentTimeMillis) (docs.oracle.com)
[2] Oracle JDK java.time.Clock Javadoc (docs.oracle.com)

Citations:


请改用单调时钟统计 RTT,避免 EMA 被系统时钟跳变污染

第 172 行与第 186 行使用 System.currentTimeMillis() 计算响应时间。当系统时钟调整(NTP 校时、时钟回拨)时,会产生负值或异常大值,直接污染 EMA 计算,导致自适应超时抖动或长期偏大。应改用 System.nanoTime()(单调时钟),它不受系统时钟调整影响。

建议修复
-            final long pingStartMs = System.currentTimeMillis();
+            final long pingStartNs = System.nanoTime();
@@
-                    double responseTimeSec = (System.currentTimeMillis() - pingStartMs) / 1000.0;
+                    double responseTimeSec = Math.max(0D,
+                            (System.nanoTime() - pingStartNs) / 1_000_000_000.0);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` around
lines 172 - 173, The RTT measurement in HostTrackImpl currently uses
System.currentTimeMillis() (pingStartMs) before sending the bus message and
again when computing RTT in the CloudBusCallBack; switch to a monotonic clock by
using System.nanoTime() for start and end deltas (e.g., store pingStartNs from
System.nanoTime(), compute deltaNs = System.nanoTime() - pingStartNs) and
convert to milliseconds for the EMA/timeout logic (deltaMs = deltaNs /
1_000_000L); update both the send-site (where pingStartMs is set) and the
callback-site (where RTT is calculated and fed into the EMA/timeout) to use the
new variables so EMA and adaptive timeout code in HostTrackImpl remain monotonic
and immune to system clock adjustments.

…ut, ping pool

CRITICAL: bake SSL into PoolingHttpClientConnectionManager Registry (HttpClientBuilder silently ignores setSSLContext when setConnectionManager is used)
HIGH: add pingResponseEma.remove in untrackHost to prevent unbounded map growth
HIGH: revert REST_FACADE_READ_TIMEOUT default 60s->300s to avoid regression on long-running agent ops
HIGH: ping pool maxPerRoute 1->2 to allow 1 in-flight + 1 queued, avoiding head-of-line block
MEDIUM: extract PING_POOL_MAX_TOTAL=3000 constant; align createAsyncRestTemplate param order with createRestTemplate
MEDIUM: fix health-check threshold to integer arithmetic (leased*5 > maxTotal*4)
LOW: eliminate ThreadLocal double-read; use pingConnManager.getMaxTotal() in debug handler

Resolves: ZSTAC-67534

Change-Id: Ia2328781ab00ba07dbb10046eac45cbb715ca733
@MatheMatrix MatheMatrix force-pushed the sync/ye.zou/fix/ZSTAC-67534 branch from 1df823c to 91c1e02 Compare February 27, 2026 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants