<fix>[host]: adaptive EMA-based ping timeout for large-scale clusters [ZSTAC-67534]#3391
<fix>[host]: adaptive EMA-based ping timeout for large-scale clusters [ZSTAC-67534]#3391MatheMatrix wants to merge 7 commits into5.5.12from
Conversation
Resolves: ZSTAC-67534 Change-Id: I337a5bf8efa9cad20e39f947d5c06e944003205c
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Walkthrough引入主机级 EMA 自适应 ping 超时并为 Agent ping 引入专用的 ping HTTP 连接池与调用路径;新增全局配置、REST 接口/实现改动与连接池参数调整,并添加无 Spring 上下文的连接池与 ThreadLocal 生命周期单元测试。 Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant KVM as KVMHost
participant REST as RESTFacade
participant Impl as RESTFacadeImpl
participant Pool as HTTP_Ping_Pool
participant Track as HostTrackImpl
KVM->>REST: asyncJsonPostForPing(url, cmd, callback)
REST->>Impl: delegate asyncJsonPostForPing(...)
Impl->>Impl: ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate)
Impl->>Pool: execute HTTP POST via pingAsyncRestTemplate
Pool-->>Impl: HTTP response
Impl->>Impl: ASYNC_TEMPLATE_OVERRIDE.remove()
Impl->>KVM: invoke callback with result
KVM->>Track: record pingStart/pingEnd → responseTime
Track->>Track: updatePingResponseEma(hostUuid, responseTimeSec)
Track->>KVM: getAdaptiveTimeout(hostUuid) (used for future pings)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 诗歌
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (1 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java (1)
45-45:pingResponseEma目前是无界增长结构,存在长期内存膨胀风险Line 45 使用
ConcurrentHashMap持久保存每个主机 EMA,但没有淘汰策略;在主机频繁增删/迁移的长期运行环境里会持续累积历史 UUID。建议修改(引入过期淘汰)
- private static final ConcurrentHashMap<String, Double> pingResponseEma = new ConcurrentHashMap<>(); + private static final Cache<String, Double> pingResponseEma = CacheBuilder.newBuilder() + .expireAfterAccess(24, TimeUnit.HOURS) + .build(); @@ - Double ema = pingResponseEma.get(hostUuid); + Double ema = pingResponseEma.getIfPresent(hostUuid); @@ - pingResponseEma.merge(hostUuid, responseTimeSec, + pingResponseEma.asMap().merge(hostUuid, responseTimeSec, (oldEma, sample) -> EMA_ALPHA * sample + (1 - EMA_ALPHA) * oldEma);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` at line 45, The static pingResponseEma ConcurrentHashMap in HostTrackImpl grows unbounded and should be converted to a bounded/expiring structure to avoid memory bloat; replace pingResponseEma with a cache that evicts entries (for example Guava Cache or Caffeine) configured with a maximumSize and expireAfterAccess/expireAfterWrite, or implement a ConcurrentHashMap<Pair<Double, timestamp>> plus a periodic cleanup task that removes entries older than a threshold; ensure all accesses currently using pingResponseEma (reads/writes in HostTrackImpl methods) are updated to use the chosen cache API or consult the timestamp before using/removing entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 51-59: getAdaptiveTimeout currently flips to EMA-based adaptive
timeout as soon as pingResponseEma contains a value; add a global compatibility
switch (e.g., a new HostGlobalConfig like ENABLE_ADAPTIVE_PING_TIMEOUT
defaulting to false) and update getAdaptiveTimeout to return the historical
configured timeout (HostGlobalConfig.PING_HOST_TIMEOUT.value...) when the switch
is off; when the switch is on keep the existing logic using pingResponseEma and
EMA_SAFETY_FACTOR so teams can gray-release and rollback the adaptive behavior.
---
Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Line 45: The static pingResponseEma ConcurrentHashMap in HostTrackImpl grows
unbounded and should be converted to a bounded/expiring structure to avoid
memory bloat; replace pingResponseEma with a cache that evicts entries (for
example Guava Cache or Caffeine) configured with a maximumSize and
expireAfterAccess/expireAfterWrite, or implement a
ConcurrentHashMap<Pair<Double, timestamp>> plus a periodic cleanup task that
removes entries older than a threshold; ensure all accesses currently using
pingResponseEma (reads/writes in HostTrackImpl methods) are updated to use the
chosen cache API or consult the timestamp before using/removing entries.
ℹ️ Review info
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.javaplugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
Resolves: ZSTAC-67534 Change-Id: I3f5bfa3af90a5ee419a9886e9e119551f81455e2
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java (1)
53-63: 建议增加最小样本门槛后再启用 EMA 超时Line 58-63 目前只要有一个样本就会使用自适应超时。单次异常慢包会立即放大 timeout,稳定性不足。建议在样本达到阈值后再启用 EMA。
🛠️ 参考改法
@@ private static final double EMA_ALPHA = 0.2; private static final double EMA_SAFETY_FACTOR = 3.0; + private static final int EMA_MIN_SAMPLE_COUNT = 3; private static final ConcurrentHashMap<String, Double> pingResponseEma = new ConcurrentHashMap<>(); + private static final ConcurrentHashMap<String, AtomicInteger> pingResponseSampleCount = new ConcurrentHashMap<>(); @@ Double ema = pingResponseEma.get(hostUuid); - if (ema == null) { + AtomicInteger sampleCount = pingResponseSampleCount.get(hostUuid); + if (ema == null || sampleCount == null || sampleCount.get() < EMA_MIN_SAMPLE_COUNT) { return configured; } @@ private static void updatePingResponseEma(String hostUuid, double responseTimeSec) { pingResponseEma.merge(hostUuid, responseTimeSec, (oldEma, sample) -> EMA_ALPHA * sample + (1 - EMA_ALPHA) * oldEma); + pingResponseSampleCount.computeIfAbsent(hostUuid, k -> new AtomicInteger(0)).incrementAndGet(); }Also applies to: 70-73
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` around lines 53 - 63, In getAdaptiveTimeout ensure EMA-based adaptive timeout is only used after a minimum number of samples: introduce or use a sample-count map (e.g., pingResponseCount) and a MIN_SAMPLES threshold (constant or config), check that pingResponseCount.get(hostUuid) >= MIN_SAMPLES before using pingResponseEma.get(hostUuid) to compute adaptive = ceil(ema * EMA_SAFETY_FACTOR); otherwise return the configured timeout; apply the same sample-threshold guard to the other EMA usage block mentioned (lines 70-73) so transient single slow responses don't immediately inflate timeouts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 42-45: pingResponseEma 是一个静态 ConcurrentHashMap 存放主机的 EMA
数据,但没有清理路径会导致内存泄漏;在 HostTrackImpl 中为主机停止/删除或不再跟踪的路径添加清理逻辑以删除对应 key(例如在
stopTrackHost/handleHostDisconnected/handleHostDeleted 或任何取消跟踪的回调中调用
pingResponseEma.remove(hostUuid)),并且可选地在 HostTrackImpl 中新增一个统一方法 (e.g.
cleanupPingEmaForHost(String hostUuid))
以集中删除/记录;如果需要长期自动回收,可替换或补充为基于时间的回收(TTL)策略或使用弱引用缓存,但首要修复是在主机生命周期结束处显式移除
pingResponseEma 条目以避免无界增长。
---
Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 53-63: In getAdaptiveTimeout ensure EMA-based adaptive timeout is
only used after a minimum number of samples: introduce or use a sample-count map
(e.g., pingResponseCount) and a MIN_SAMPLES threshold (constant or config),
check that pingResponseCount.get(hostUuid) >= MIN_SAMPLES before using
pingResponseEma.get(hostUuid) to compute adaptive = ceil(ema *
EMA_SAFETY_FACTOR); otherwise return the configured timeout; apply the same
sample-threshold guard to the other EMA usage block mentioned (lines 70-73) so
transient single slow responses don't immediately inflate timeouts.
ℹ️ Review info
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
conf/globalConfig/host.xmlis excluded by!**/*.xml
📒 Files selected for processing (2)
compute/src/main/java/org/zstack/compute/host/HostGlobalConfig.javacompute/src/main/java/org/zstack/compute/host/HostTrackImpl.java
R2: add asyncJsonPostForPing() using a dedicated AsyncRestTemplate (maxPerRoute=1, maxTotal=3000) to prevent ping starvation under high business traffic. ThreadLocal ASYNC_TEMPLATE_OVERRIDE injects the ping pool transparently without changing asyncJson() signature. Resolves: ZSTAC-67534 Change-Id: Id3721eb04c291f742421170792ba713a8e446e23
…ation TestConnectionPoolConfig: verify createRestTemplate(4-param) explicit pool and backward-compat 2-param variant (R1). TestPingPoolIsolation: verify ASYNC_TEMPLATE_OVERRIDE ThreadLocal structure and cleanup lifecycle via reflection (R2). Resolves: ZSTAC-67534 Change-Id: I9f3a2e6b7c8d1e2f3a4b5c6d7e8f9a0b1c2d3e4
…Status R5: store PoolingNHttpClientConnectionManager refs for async and ping pools. Register DumpConnectionPoolStatus debug signal handler to dump leased/available/pending stats at runtime. Resolves: ZSTAC-67534 Change-Id: Ib2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (4)
core/src/main/java/org/zstack/core/cloudbus/CloudBusImpl3.java (1)
140-145: 硬编码的魔法值10应该提取为常量或配置项。第 145 行的
10是一个魔法值,缺乏语义说明。根据编码规范,应当使用常量或配置属性替代硬编码数值,以提升可读性和可维护性。建议在
CloudBusGlobalProperty中定义一个配置项(如HTTP_MAX_CONN_WAIT),或在本类中定义一个常量:♻️ 建议的修改
+ private static final int HTTP_MAX_CONNECTION_WAIT = 10; + // R1+R3: explicit pool params matching chain queue capacity (CloudBusGlobalProperty.HTTP_MAX_CONN) private final static TimeoutRestTemplate http = RESTFacade.createRestTemplate( CoreGlobalProperty.REST_FACADE_READ_TIMEOUT, CoreGlobalProperty.REST_FACADE_CONNECT_TIMEOUT, CloudBusGlobalProperty.HTTP_MAX_CONN, - 10); + HTTP_MAX_CONNECTION_WAIT);As per coding guidelines, 避免使用魔法值(Magic Value):直接使用未经定义的数值或字符串应替换为枚举或常量。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/zstack/core/cloudbus/CloudBusImpl3.java` around lines 140 - 145, Replace the hardcoded magic value 10 passed into RESTFacade.createRestTemplate by introducing a named constant or configuration property; e.g., add a new CloudBusGlobalProperty (e.g., HTTP_MAX_CONN_WAIT) or a private static final constant in CloudBusImpl3 and use that identifier in the call to RESTFacade.createRestTemplate where TimeoutRestTemplate http is initialized, so the meaning of the value is explicit and configurable instead of the literal 10.test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java (1)
12-12: 测试类命名建议改为后缀式。Line 12 的
TestConnectionPoolConfig建议改为ConnectionPoolConfigTest(或ConnectionPoolConfigCase),以符合测试类统一命名规则。As per coding guidelines "测试类需要以
Test或Case结尾".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java` at line 12, Rename the test class TestConnectionPoolConfig to follow the project's test naming convention by using a suffix form (e.g., ConnectionPoolConfigTest or ConnectionPoolConfigCase); update the class declaration name and any references/usages (imports, test runners, build/test configs) that refer to TestConnectionPoolConfig so they point to the new class name and ensure tests continue to compile and run.core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java (1)
223-228: ping 连接池参数建议去魔法值并外置配置。Line 225-228 的
10000/3000/1/3000是硬编码,后续不同规模集群很难按场景调优。建议提取常量或全局配置项,避免把容量与超时策略固化在代码里。As per coding guidelines "避免使用魔法值(Magic Value):直接使用未经定义的数值或字符串应替换为常量或枚举".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java` around lines 223 - 228, The ping AsyncRestTemplate is constructed with hard-coded magic values (10000, 3000, 1, 3000) in RESTFacadeImpl; update the code to pull these parameters from configurable constants or external configuration and use named constants to replace the literal numbers. Specifically, change the call that initializes pingAsyncRestTemplate (the createAsyncRestTemplate invocation) to read timeouts and pool sizes from a configuration source or defined static final fields (e.g., PING_READ_TIMEOUT_MS, PING_CONNECT_TIMEOUT_MS, PING_MAX_PER_ROUTE, PING_MAX_TOTAL) and pass those variables into createAsyncRestTemplate so the values are not embedded as literals.test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java (1)
16-16: 测试类命名不符合约定后缀。Line 16 的类名
TestPingPoolIsolation未以Test或Case结尾,建议重命名为PingPoolIsolationTest(或PingPoolIsolationCase)以满足统一约定。As per coding guidelines "测试类需要以
Test或Case结尾".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java` at line 16, Rename the test class TestPingPoolIsolation to follow the convention (e.g., PingPoolIsolationTest or PingPoolIsolationCase): update the class declaration in TestPingPoolIsolation to the chosen name and adjust any references/imports/usages (test runner, annotations, constructors) accordingly so the JVM and build tooling find the test; also rename the source file to match the new class name to keep filename/classname consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/src/main/java/org/zstack/core/CoreGlobalProperty.java`:
- Around line 41-50: The change tightens RESTFacade.readTimeout default to 60000
which breaks backward compatibility; revert the default value in the
`@GlobalProperty` on REST_FACADE_READ_TIMEOUT to the original (pre-change) value
and implement a compatibility flag/property (e.g., add a new `@GlobalProperty`
boolean REST_FACADE_USE_TIGHTER_DEFAULT or REST_FACADE_NEW_INSTALL_ONLY) and
update the code that reads REST_FACADE_READ_TIMEOUT to use the tighter 60000
only when that flag is true (or when a "new install" marker is present), leaving
REST_FACADE_CONNECT_TIMEOUT/REST_FACADE_ECHO_TIMEOUT/REST_FACADE_MAX_PER_ROUTE/REST_FACADE_MAX_TOTAL
behavior unchanged; reference the static fields REST_FACADE_READ_TIMEOUT and the
`@GlobalProperty` annotation on it to locate and apply the conditional logic.
In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 615-623: The ThreadLocal override in asyncJsonPostForPing
currently calls ASYNC_TEMPLATE_OVERRIDE.remove() which clears any outer nested
value; change asyncJsonPostForPing to save the previous value (e.g. a local
variable old = ASYNC_TEMPLATE_OVERRIDE.get()), set ASYNC_TEMPLATE_OVERRIDE to
pingAsyncRestTemplate, call asyncJsonPost(url, body, callback) in the try, and
in finally restore the previous value (if old is null call remove() else
ASYNC_TEMPLATE_OVERRIDE.set(old)) so nested callers keep their template; update
references to ASYNC_TEMPLATE_OVERRIDE, pingAsyncRestTemplate, and asyncJsonPost
accordingly.
In `@header/src/main/java/org/zstack/header/rest/RESTFacade.java`:
- Around line 126-131: The createRestTemplate method hardcodes
connectionRequestTimeout to 8000 which bypasses the configurable
RESTFacade.connectionRequestTimeout; add an overloaded createRestTemplate(int
readTimeout, int connectTimeout, int maxTotal, int maxPerRoute, int
connectionRequestTimeout) and update the existing createRestTemplate to call the
new overload (or accept a connectionRequestTimeout parameter) so
TimeoutHttpComponentsClientHttpRequestFactory.setConnectionRequestTimeout(Math.min(connectTimeout
* 2, connectionRequestTimeout)) is used instead of 8000; ensure all callers of
createRestTemplate use the new overload or the shared
RESTFacade.connectionRequestTimeout value.
In `@test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java`:
- Around line 15-40: The current tests only assert non-null; enhance them to
verify the actual pool and timeout settings by extracting the internal request
factory and http client from the returned TimeoutRestTemplate (from
RESTFacade.createRestTemplate) and inspecting the underlying Apache components:
cast to HttpComponentsClientHttpRequestFactory to get the HttpClient, downcast
to CloseableHttpClient to obtain the ConnectionManager
(PoolingHttpClientConnectionManager) and assert its getMaxTotal() and
getDefaultMaxPerRoute() match the supplied maxTotal/maxPerRoute, and read the
RequestConfig/connectionRequestTimeout (or the factory’s timeout fields) to
assert connectionRequestTimeout/connectionRequestTimeout behavior; update tests
like test4ParamCreateRestTemplate_returnsNonNull,
testZeroPoolParams_fallsBackToApacheDefaults and
testPingPoolParams_smallPerRoute_largeTotalForCluster to include these
reflective/assertive checks.
In `@test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java`:
- Around line 45-49: The test method
testAsyncTemplateOverride_isNullOnFreshThread() has mismatched semantics: it
reads the ThreadLocal on the current thread rather than a truly fresh thread;
update the test to either rename the method to reflect current-thread initial
state (e.g., testAsyncTemplateOverride_currentThreadInitialState) or change the
implementation to actually create a new Thread and assert tl.get() is null
there; locate the ThreadLocal via getOverrideTL() and the assertion using
Assert.assertNull and adjust the name or add a new Thread that calls tl.get() so
the test semantics and implementation match.
---
Nitpick comments:
In `@core/src/main/java/org/zstack/core/cloudbus/CloudBusImpl3.java`:
- Around line 140-145: Replace the hardcoded magic value 10 passed into
RESTFacade.createRestTemplate by introducing a named constant or configuration
property; e.g., add a new CloudBusGlobalProperty (e.g., HTTP_MAX_CONN_WAIT) or a
private static final constant in CloudBusImpl3 and use that identifier in the
call to RESTFacade.createRestTemplate where TimeoutRestTemplate http is
initialized, so the meaning of the value is explicit and configurable instead of
the literal 10.
In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 223-228: The ping AsyncRestTemplate is constructed with hard-coded
magic values (10000, 3000, 1, 3000) in RESTFacadeImpl; update the code to pull
these parameters from configurable constants or external configuration and use
named constants to replace the literal numbers. Specifically, change the call
that initializes pingAsyncRestTemplate (the createAsyncRestTemplate invocation)
to read timeouts and pool sizes from a configuration source or defined static
final fields (e.g., PING_READ_TIMEOUT_MS, PING_CONNECT_TIMEOUT_MS,
PING_MAX_PER_ROUTE, PING_MAX_TOTAL) and pass those variables into
createAsyncRestTemplate so the values are not embedded as literals.
In `@test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java`:
- Line 12: Rename the test class TestConnectionPoolConfig to follow the
project's test naming convention by using a suffix form (e.g.,
ConnectionPoolConfigTest or ConnectionPoolConfigCase); update the class
declaration name and any references/usages (imports, test runners, build/test
configs) that refer to TestConnectionPoolConfig so they point to the new class
name and ensure tests continue to compile and run.
In `@test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java`:
- Line 16: Rename the test class TestPingPoolIsolation to follow the convention
(e.g., PingPoolIsolationTest or PingPoolIsolationCase): update the class
declaration in TestPingPoolIsolation to the chosen name and adjust any
references/imports/usages (test runner, annotations, constructors) accordingly
so the JVM and build tooling find the test; also rename the source file to match
the new class name to keep filename/classname consistency.
ℹ️ Review info
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
core/src/main/java/org/zstack/core/CoreGlobalProperty.javacore/src/main/java/org/zstack/core/cloudbus/CloudBusImpl3.javacore/src/main/java/org/zstack/core/rest/RESTFacadeImpl.javaheader/src/main/java/org/zstack/header/rest/RESTFacade.javaplugin/kvm/src/main/java/org/zstack/kvm/KVMHost.javatest/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.javatest/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java
🚧 Files skipped from review as they are similar to previous changes (1)
- plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
| @GlobalProperty(name = "RESTFacade.readTimeout", defaultValue = "60000") | ||
| public static int REST_FACADE_READ_TIMEOUT; | ||
| @GlobalProperty(name = "RESTFacade.connectTimeout", defaultValue = "15000") | ||
| public static int REST_FACADE_CONNECT_TIMEOUT; | ||
| @GlobalProperty(name = "RESTFacade.echoTimeout", defaultValue = "60") | ||
| public static int REST_FACADE_ECHO_TIMEOUT; | ||
| @GlobalProperty(name = "RESTFacade.maxPerRoute", defaultValue = "2") | ||
| @GlobalProperty(name = "RESTFacade.maxPerRoute", defaultValue = "50") | ||
| public static int REST_FACADE_MAX_PER_ROUTE; | ||
| @GlobalProperty(name = "RESTFacade.maxTotal", defaultValue = "128") | ||
| @GlobalProperty(name = "RESTFacade.maxTotal", defaultValue = "2048") | ||
| public static int REST_FACADE_MAX_TOTAL; |
There was a problem hiding this comment.
默认超时直接收紧有向后兼容风险。
Line 41 将 RESTFacade.readTimeout 默认值改为 60000,会影响所有未显式配置的历史部署,可能让原本可用的慢路径请求在升级后变为超时失败。建议通过兼容开关或“仅新装生效”的策略发布该默认值变更。
As per coding guidelines "向后兼容原则:之前的代码产生的行为不要直接去改动,即使是 bug 也可能用户有依赖,不应直接改动,可以考虑升级用户一套行为、新安装用户一套行为,开关控制等".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/main/java/org/zstack/core/CoreGlobalProperty.java` around lines 41 -
50, The change tightens RESTFacade.readTimeout default to 60000 which breaks
backward compatibility; revert the default value in the `@GlobalProperty` on
REST_FACADE_READ_TIMEOUT to the original (pre-change) value and implement a
compatibility flag/property (e.g., add a new `@GlobalProperty` boolean
REST_FACADE_USE_TIGHTER_DEFAULT or REST_FACADE_NEW_INSTALL_ONLY) and update the
code that reads REST_FACADE_READ_TIMEOUT to use the tighter 60000 only when that
flag is true (or when a "new install" marker is present), leaving
REST_FACADE_CONNECT_TIMEOUT/REST_FACADE_ECHO_TIMEOUT/REST_FACADE_MAX_PER_ROUTE/REST_FACADE_MAX_TOTAL
behavior unchanged; reference the static fields REST_FACADE_READ_TIMEOUT and the
`@GlobalProperty` annotation on it to locate and apply the conditional logic.
| @Override | ||
| public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) { | ||
| // R2: inject dedicated ping pool via ThreadLocal; cleared in finally to prevent leaks | ||
| ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate); | ||
| try { | ||
| asyncJsonPost(url, body, callback); | ||
| } finally { | ||
| ASYNC_TEMPLATE_OVERRIDE.remove(); | ||
| } |
There was a problem hiding this comment.
ThreadLocal 覆盖值在嵌套调用下会被误清理。
Line 622 直接 remove() 会在“外层已设置覆盖值、内层再次调用”时把外层状态也清掉。建议保存旧值并在 finally 恢复,避免嵌套调用时模板选择被破坏。
建议修复
`@Override`
public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) {
- ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate);
+ AsyncRestTemplate previous = ASYNC_TEMPLATE_OVERRIDE.get();
+ ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate);
try {
asyncJsonPost(url, body, callback);
} finally {
- ASYNC_TEMPLATE_OVERRIDE.remove();
+ if (previous == null) {
+ ASYNC_TEMPLATE_OVERRIDE.remove();
+ } else {
+ ASYNC_TEMPLATE_OVERRIDE.set(previous);
+ }
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @Override | |
| public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) { | |
| // R2: inject dedicated ping pool via ThreadLocal; cleared in finally to prevent leaks | |
| ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate); | |
| try { | |
| asyncJsonPost(url, body, callback); | |
| } finally { | |
| ASYNC_TEMPLATE_OVERRIDE.remove(); | |
| } | |
| `@Override` | |
| public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) { | |
| AsyncRestTemplate previous = ASYNC_TEMPLATE_OVERRIDE.get(); | |
| ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate); | |
| try { | |
| asyncJsonPost(url, body, callback); | |
| } finally { | |
| if (previous == null) { | |
| ASYNC_TEMPLATE_OVERRIDE.remove(); | |
| } else { | |
| ASYNC_TEMPLATE_OVERRIDE.set(previous); | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java` around lines 615
- 623, The ThreadLocal override in asyncJsonPostForPing currently calls
ASYNC_TEMPLATE_OVERRIDE.remove() which clears any outer nested value; change
asyncJsonPostForPing to save the previous value (e.g. a local variable old =
ASYNC_TEMPLATE_OVERRIDE.get()), set ASYNC_TEMPLATE_OVERRIDE to
pingAsyncRestTemplate, call asyncJsonPost(url, body, callback) in the try, and
in finally restore the previous value (if old is null call remove() else
ASYNC_TEMPLATE_OVERRIDE.set(old)) so nested callers keep their template; update
references to ASYNC_TEMPLATE_OVERRIDE, pingAsyncRestTemplate, and asyncJsonPost
accordingly.
| static TimeoutRestTemplate createRestTemplate(int readTimeout, int connectTimeout, int maxTotal, int maxPerRoute) { | ||
| HttpComponentsClientHttpRequestFactory factory = new TimeoutHttpComponentsClientHttpRequestFactory(); | ||
| factory.setReadTimeout(readTimeout); | ||
| factory.setConnectTimeout(connectTimeout); | ||
| factory.setConnectionRequestTimeout(connectTimeout * 2); | ||
| factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, 8000)); | ||
|
|
There was a problem hiding this comment.
connectionRequestTimeout 上限被硬编码,和全局配置能力不一致。
Line 130 固定 8000 会让该模板路径无法复用 RESTFacade.connectionRequestTimeout 的可配置能力,最终出现同一组件内超时策略不一致。建议增加带 connectionRequestTimeout 参数的重载,由调用方传入上限。
建议重构方向
static TimeoutRestTemplate createRestTemplate(int readTimeout, int connectTimeout, int maxTotal, int maxPerRoute) {
- HttpComponentsClientHttpRequestFactory factory = new TimeoutHttpComponentsClientHttpRequestFactory();
+ return createRestTemplate(readTimeout, connectTimeout, maxTotal, maxPerRoute, 8000);
+}
+
+static TimeoutRestTemplate createRestTemplate(int readTimeout, int connectTimeout, int maxTotal, int maxPerRoute, int connectionRequestTimeout) {
+ HttpComponentsClientHttpRequestFactory factory = new TimeoutHttpComponentsClientHttpRequestFactory();
factory.setReadTimeout(readTimeout);
factory.setConnectTimeout(connectTimeout);
- factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, 8000));
+ factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, connectionRequestTimeout));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@header/src/main/java/org/zstack/header/rest/RESTFacade.java` around lines 126
- 131, The createRestTemplate method hardcodes connectionRequestTimeout to 8000
which bypasses the configurable RESTFacade.connectionRequestTimeout; add an
overloaded createRestTemplate(int readTimeout, int connectTimeout, int maxTotal,
int maxPerRoute, int connectionRequestTimeout) and update the existing
createRestTemplate to call the new overload (or accept a
connectionRequestTimeout parameter) so
TimeoutHttpComponentsClientHttpRequestFactory.setConnectionRequestTimeout(Math.min(connectTimeout
* 2, connectionRequestTimeout)) is used instead of 8000; ensure all callers of
createRestTemplate use the new overload or the shared
RESTFacade.connectionRequestTimeout value.
| public void test4ParamCreateRestTemplate_returnsNonNull() { | ||
| // R1: explicit pool params must succeed | ||
| TimeoutRestTemplate tmpl = RESTFacade.createRestTemplate(60000, 3000, 50, 2); | ||
| Assert.assertNotNull("createRestTemplate(4-param) must return non-null template", tmpl); | ||
| } | ||
|
|
||
| @Test | ||
| public void test2ParamCreateRestTemplate_backwardCompat() { | ||
| // old callers delegating to 4-param with (0, 0) must still work | ||
| TimeoutRestTemplate tmpl = RESTFacade.createRestTemplate(60000, 3000); | ||
| Assert.assertNotNull("createRestTemplate(2-param) backward-compat must return non-null", tmpl); | ||
| } | ||
|
|
||
| @Test | ||
| public void testZeroPoolParams_fallsBackToApacheDefaults() { | ||
| // maxTotal=0, maxPerRoute=0 → skip explicit pool; SSL path still applies | ||
| TimeoutRestTemplate tmpl = RESTFacade.createRestTemplate(60000, 3000, 0, 0); | ||
| Assert.assertNotNull("zero-pool fallback must return non-null template", tmpl); | ||
| } | ||
|
|
||
| @Test | ||
| public void testPingPoolParams_smallPerRoute_largeTotalForCluster() { | ||
| // P0 ping pool: maxPerRoute=1 (one ping/host), maxTotal=3000 (full cluster) | ||
| TimeoutRestTemplate tmpl = RESTFacade.createRestTemplate(10000, 3000, 3000, 1); | ||
| Assert.assertNotNull("ping-pool createRestTemplate must return non-null template", tmpl); | ||
| } |
There was a problem hiding this comment.
当前断言过弱,无法证明连接池配置真的生效。
Line 15-40 只校验“非空”不足以覆盖本次改动目标(池大小与隔离策略)。建议补充对 maxTotal、defaultMaxPerRoute、connectionRequestTimeout 的可观察断言(可用反射读取 requestFactory/httpClient/connectionManager)。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/src/test/java/org/zstack/test/core/rest/TestConnectionPoolConfig.java`
around lines 15 - 40, The current tests only assert non-null; enhance them to
verify the actual pool and timeout settings by extracting the internal request
factory and http client from the returned TimeoutRestTemplate (from
RESTFacade.createRestTemplate) and inspecting the underlying Apache components:
cast to HttpComponentsClientHttpRequestFactory to get the HttpClient, downcast
to CloseableHttpClient to obtain the ConnectionManager
(PoolingHttpClientConnectionManager) and assert its getMaxTotal() and
getDefaultMaxPerRoute() match the supplied maxTotal/maxPerRoute, and read the
RequestConfig/connectionRequestTimeout (or the factory’s timeout fields) to
assert connectionRequestTimeout/connectionRequestTimeout behavior; update tests
like test4ParamCreateRestTemplate_returnsNonNull,
testZeroPoolParams_fallsBackToApacheDefaults and
testPingPoolParams_smallPerRoute_largeTotalForCluster to include these
reflective/assertive checks.
| public void testAsyncTemplateOverride_isNullOnFreshThread() throws Exception { | ||
| ThreadLocal<AsyncRestTemplate> tl = getOverrideTL(); | ||
| // on a new thread (or after cleanup) the ThreadLocal must be null | ||
| Assert.assertNull("ASYNC_TEMPLATE_OVERRIDE must be null on fresh thread — verifies no leak from prior call", tl.get()); | ||
| } |
There was a problem hiding this comment.
“fresh thread”测试语义与实现不一致。
Line 45-49 当前只在当前线程读取 ThreadLocal,并未创建新线程;测试名会给出错误信号。建议改名为“currentThreadInitialState”或实际新建线程后断言。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/src/test/java/org/zstack/test/core/rest/TestPingPoolIsolation.java`
around lines 45 - 49, The test method
testAsyncTemplateOverride_isNullOnFreshThread() has mismatched semantics: it
reads the ThreadLocal on the current thread rather than a truly fresh thread;
update the test to either rename the method to reflect current-thread initial
state (e.g., testAsyncTemplateOverride_currentThreadInitialState) or change the
implementation to actually create a new Thread and assert tl.get() is null
there; locate the ThreadLocal via getOverrideTL() and the assertion using
Assert.assertNull and adjust the name or add a new Thread that calls tl.get() so
the test semantics and implementation match.
R5: submit periodic task (every 60s) to check async and ping pool utilization. Log [POOL-ALARM] WARNING when async pool > 80% or ping pool has pending requests, enabling ops to detect congestion before ConnectionPoolTimeoutException occurs. Resolves: ZSTAC-67534 Change-Id: Ic3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java (1)
681-689:⚠️ Potential issue | 🟠 Major恢复 ThreadLocal 旧值,避免嵌套调用误清理外层覆盖模板。
Line 688 直接
ASYNC_TEMPLATE_OVERRIDE.remove()会在嵌套调用时把外层已设置值一并清掉,后续同线程请求会错误回落到默认asyncRestTemplate。建议修复
`@Override` public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) { - // R2: inject dedicated ping pool via ThreadLocal; cleared in finally to prevent leaks - ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate); + AsyncRestTemplate previous = ASYNC_TEMPLATE_OVERRIDE.get(); + ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate); try { asyncJsonPost(url, body, callback); } finally { - ASYNC_TEMPLATE_OVERRIDE.remove(); + if (previous == null) { + ASYNC_TEMPLATE_OVERRIDE.remove(); + } else { + ASYNC_TEMPLATE_OVERRIDE.set(previous); + } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java` around lines 681 - 689, 在 asyncJsonPostForPing 中不要直接 remove ThreadLocal;改为在设置 ASYNC_TEMPLATE_OVERRIDE 前保存旧值(ASYNC_TEMPLATE_OVERRIDE.get()),调用 asyncJsonPost 后在 finally 中恢复旧值:若旧值为 null 则 remove,否则重新 set 旧值。引用的标识符:asyncJsonPostForPing、ASYNC_TEMPLATE_OVERRIDE、pingAsyncRestTemplate、asyncJsonPost。
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 207-208: The code in RESTFacadeImpl currently hardcodes the ping
pool maxTotal as 3000 in the sb.append call (the string formatting that prints
"ping-pool(P0): maxTotal=...") which violates the magic-value rule and breaks
for >3000 clusters; change this to use a configurable value instead of the
literal 3000: read the pool max from the existing global
concurrency/configuration source or introduce a new configurable constant (e.g.,
PING_POOL_MAX or a method getPingPoolMax()) and replace the literal in the
sb.append formatting and any other locations that write 3000 (also update where
ping pool is created/initialized so the actual pool uses the same configurable
maxTotal rather than a hardcoded 3000). Ensure you update all occurrences (the
sb.append formatting and pool initialization code paths) to reference the same
config variable.
---
Duplicate comments:
In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 681-689: 在 asyncJsonPostForPing 中不要直接 remove ThreadLocal;改为在设置
ASYNC_TEMPLATE_OVERRIDE 前保存旧值(ASYNC_TEMPLATE_OVERRIDE.get()),调用 asyncJsonPost 后在
finally 中恢复旧值:若旧值为 null 则 remove,否则重新 set
旧值。引用的标识符:asyncJsonPostForPing、ASYNC_TEMPLATE_OVERRIDE、pingAsyncRestTemplate、asyncJsonPost。
ℹ️ Review info
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
header/src/main/java/org/zstack/header/rest/RESTFacade.java (1)
130-130:⚠️ Potential issue | 🟠 Major
connectionRequestTimeout在同步模板路径仍是硬编码。Line 130 固定为
8000,会绕过新增的RESTFacade.connectionRequestTimeout配置,导致模板超时策略不一致。这个问题与之前评审结论一致,建议补充带connectionRequestTimeout参数的重载并让调用方传入配置值。♻️ 建议修复
static TimeoutRestTemplate createRestTemplate(int readTimeout, int connectTimeout, int maxTotal, int maxPerRoute) { - HttpComponentsClientHttpRequestFactory factory = new TimeoutHttpComponentsClientHttpRequestFactory(); + return createRestTemplate(readTimeout, connectTimeout, maxTotal, maxPerRoute, 8000); +} + +static TimeoutRestTemplate createRestTemplate(int readTimeout, int connectTimeout, int maxTotal, int maxPerRoute, int connectionRequestTimeout) { + HttpComponentsClientHttpRequestFactory factory = new TimeoutHttpComponentsClientHttpRequestFactory(); factory.setReadTimeout(readTimeout); factory.setConnectTimeout(connectTimeout); - factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, 8000)); + factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, connectionRequestTimeout));#!/bin/bash set -euo pipefail # 1) 定位硬编码 connectionRequestTimeout rg -n "Math\\.min\\(connectTimeout \\* 2, 8000\\)" --type=java # 2) 检查 createRestTemplate 重载与调用点 rg -n "createRestTemplate\\(" --type=java🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@header/src/main/java/org/zstack/header/rest/RESTFacade.java` at line 130, The line in RESTFacade where factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, 8000)) hardcodes 8000 and bypasses the new RESTFacade.connectionRequestTimeout config; add an overload of createRestTemplate that accepts a connectionRequestTimeout parameter (or use the existing RESTFacade.connectionRequestTimeout value) and replace calls to createRestTemplate(...) that now rely on the hardcoded path so they pass RESTFacade.connectionRequestTimeout into the factory.setConnectionRequestTimeout call; update all createRestTemplate callers to use the new overload (or pass the instance field) so the computed timeout uses the configured connectionRequestTimeout instead of 8000.core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java (1)
685-693:⚠️ Potential issue | 🟠 Major
ThreadLocal覆盖值在嵌套调用场景会被误清理。Line 692 在 finally 里直接
remove(),会把外层已存在的覆盖模板也清掉。应恢复旧值,而不是无条件清理。🛠 建议修复
`@Override` public void asyncJsonPostForPing(String url, Object body, AsyncRESTCallback callback) { - ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate); + AsyncRestTemplate previous = ASYNC_TEMPLATE_OVERRIDE.get(); + ASYNC_TEMPLATE_OVERRIDE.set(pingAsyncRestTemplate); try { asyncJsonPost(url, body, callback); } finally { - ASYNC_TEMPLATE_OVERRIDE.remove(); + if (previous == null) { + ASYNC_TEMPLATE_OVERRIDE.remove(); + } else { + ASYNC_TEMPLATE_OVERRIDE.set(previous); + } } }#!/bin/bash set -euo pipefail # 输出 asyncJsonPostForPing 方法片段,确认 finally 是否无条件 remove() python - <<'PY' from pathlib import Path p = Path("core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java") lines = p.read_text().splitlines() start = next(i for i,l in enumerate(lines, 1) if "public void asyncJsonPostForPing" in l) for i in range(start, min(start + 24, len(lines) + 1)): print(f"{i}:{lines[i-1]}") PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java` around lines 685 - 693, The finally block in asyncJsonPostForPing currently calls ASYNC_TEMPLATE_OVERRIDE.remove(), which will clear any previously set override in nested calls; instead, capture the previous value (e.g., store ASYNC_TEMPLATE_OVERRIDE.get() into a local variable) before setting ASYNC_TEMPLATE_OVERRIDE to pingAsyncRestTemplate, and in the finally restore the previous value (set the ThreadLocal back to the saved value or remove only if the saved value was null). Update asyncJsonPostForPing to save/restore the old value rather than unconditionally removing it; references: ASYNC_TEMPLATE_OVERRIDE, pingAsyncRestTemplate, asyncJsonPost, asyncJsonPostForPing.
🧹 Nitpick comments (1)
core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java (1)
257-262: Ping 池初始化中仍有多个裸字面量,建议提取常量统一管理。Line 258 到 Line 262 的
10000/3000/2建议与PING_POOL_MAX_TOTAL一样提取为具名常量(或配置项),降低后续调参和排障成本。♻️ 建议重构
+ private static final int PING_READ_TIMEOUT_MS = 10000; + private static final int PING_CONNECT_TIMEOUT_MS = 3000; + private static final int PING_POOL_MAX_PER_ROUTE = 2; + pingAsyncRestTemplate = createAsyncRestTemplate( - 10000, // readTimeout = ping timeout (10s) - 3000, // connectTimeout = fast fail (3s) + PING_READ_TIMEOUT_MS, + PING_CONNECT_TIMEOUT_MS, PING_POOL_MAX_TOTAL, // maxTotal: full cluster capacity - 2, // maxPerRoute: 2 allows 1 in-flight + 1 queued, avoids head-of-line block + PING_POOL_MAX_PER_ROUTE, cmRef);As per coding guidelines "避免使用魔法值(Magic Value):直接使用未经定义的数值或字符串(如 if (status == 5))应替换为枚举或常量"。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java` around lines 257 - 262, The pingAsyncRestTemplate initialization uses magic literals (10000, 3000, 2); extract these into named constants (e.g., PING_READ_TIMEOUT_MS, PING_CONNECT_TIMEOUT_MS, PING_POOL_MAX_PER_ROUTE or make them configurable) and replace the literals in the call to createAsyncRestTemplate so all pool/timeout settings (PING_POOL_MAX_TOTAL, the new constants) are centrally defined and documented; update any related javadoc/config comments for pingAsyncRestTemplate/createAsyncRestTemplate to reference the new constants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 172-173: The RTT measurement in HostTrackImpl currently uses
System.currentTimeMillis() (pingStartMs) before sending the bus message and
again when computing RTT in the CloudBusCallBack; switch to a monotonic clock by
using System.nanoTime() for start and end deltas (e.g., store pingStartNs from
System.nanoTime(), compute deltaNs = System.nanoTime() - pingStartNs) and
convert to milliseconds for the EMA/timeout logic (deltaMs = deltaNs /
1_000_000L); update both the send-site (where pingStartMs is set) and the
callback-site (where RTT is calculated and fed into the EMA/timeout) to use the
new variables so EMA and adaptive timeout code in HostTrackImpl remain monotonic
and immune to system clock adjustments.
---
Duplicate comments:
In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 685-693: The finally block in asyncJsonPostForPing currently calls
ASYNC_TEMPLATE_OVERRIDE.remove(), which will clear any previously set override
in nested calls; instead, capture the previous value (e.g., store
ASYNC_TEMPLATE_OVERRIDE.get() into a local variable) before setting
ASYNC_TEMPLATE_OVERRIDE to pingAsyncRestTemplate, and in the finally restore the
previous value (set the ThreadLocal back to the saved value or remove only if
the saved value was null). Update asyncJsonPostForPing to save/restore the old
value rather than unconditionally removing it; references:
ASYNC_TEMPLATE_OVERRIDE, pingAsyncRestTemplate, asyncJsonPost,
asyncJsonPostForPing.
In `@header/src/main/java/org/zstack/header/rest/RESTFacade.java`:
- Line 130: The line in RESTFacade where
factory.setConnectionRequestTimeout(Math.min(connectTimeout * 2, 8000))
hardcodes 8000 and bypasses the new RESTFacade.connectionRequestTimeout config;
add an overload of createRestTemplate that accepts a connectionRequestTimeout
parameter (or use the existing RESTFacade.connectionRequestTimeout value) and
replace calls to createRestTemplate(...) that now rely on the hardcoded path so
they pass RESTFacade.connectionRequestTimeout into the
factory.setConnectionRequestTimeout call; update all createRestTemplate callers
to use the new overload (or pass the instance field) so the computed timeout
uses the configured connectionRequestTimeout instead of 8000.
---
Nitpick comments:
In `@core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java`:
- Around line 257-262: The pingAsyncRestTemplate initialization uses magic
literals (10000, 3000, 2); extract these into named constants (e.g.,
PING_READ_TIMEOUT_MS, PING_CONNECT_TIMEOUT_MS, PING_POOL_MAX_PER_ROUTE or make
them configurable) and replace the literals in the call to
createAsyncRestTemplate so all pool/timeout settings (PING_POOL_MAX_TOTAL, the
new constants) are centrally defined and documented; update any related
javadoc/config comments for pingAsyncRestTemplate/createAsyncRestTemplate to
reference the new constants.
ℹ️ Review info
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.javacore/src/main/java/org/zstack/core/CoreGlobalProperty.javacore/src/main/java/org/zstack/core/rest/RESTFacadeImpl.javaheader/src/main/java/org/zstack/header/rest/RESTFacade.java
| final long pingStartMs = System.currentTimeMillis(); | ||
| bus.send(msg, new CloudBusCallBack(null) { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -t f -name "HostTrackImpl.java"Repository: MatheMatrix/zstack
Length of output: 292
🏁 Script executed:
cat -n compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java | sed -n '160,200p'Repository: MatheMatrix/zstack
Length of output: 2717
🌐 Web query:
Java System.nanoTime() vs System.currentTimeMillis() elapsed time measurement best practice
💡 Result:
Best practice (elapsed time)
Use System.nanoTime() to measure elapsed time (durations, timeouts, benchmarking inside one JVM). It’s specifically intended for elapsed-time measurement and uses a fixed-but-arbitrary origin, so only differences are meaningful. It also has at least the resolution of currentTimeMillis() and is not tied to wall-clock time. [1]
long t0 = System.nanoTime();
// work
long elapsedNanos = System.nanoTime() - t0;When to use System.currentTimeMillis()
Use System.currentTimeMillis() for wall-clock timestamps (e.g., logging, “what time is it?”, storing an epoch-millis timestamp). It returns milliseconds since 1970-01-01T00:00Z (UTC), but its granularity may be tens of milliseconds depending on OS, and it can jump due to clock adjustments. [1]
Prefer java.time for “time-of-day” APIs
For most application code, prefer Instant.now() (optionally via an injected Clock for testability) rather than calling currentTimeMillis() directly; Clock.millis() is defined as equivalent to System.currentTimeMillis(). [2]
Rule of thumb:
- Elapsed time / timeouts:
nanoTime()[1] - Current date/time timestamp:
Instant.now()/Clock(orcurrentTimeMillis()if you truly need a primitive epoch-millis) [1][2]
Sources:
[1] Oracle JDK java.lang.System Javadoc (nanoTime/currentTimeMillis) (docs.oracle.com)
[2] Oracle JDK java.time.Clock Javadoc (docs.oracle.com)
Citations:
- 1: https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/lang/System.html?utm_source=openai
- 2: https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/time/Clock.html?utm_source=openai
请改用单调时钟统计 RTT,避免 EMA 被系统时钟跳变污染
第 172 行与第 186 行使用 System.currentTimeMillis() 计算响应时间。当系统时钟调整(NTP 校时、时钟回拨)时,会产生负值或异常大值,直接污染 EMA 计算,导致自适应超时抖动或长期偏大。应改用 System.nanoTime()(单调时钟),它不受系统时钟调整影响。
建议修复
- final long pingStartMs = System.currentTimeMillis();
+ final long pingStartNs = System.nanoTime();
@@
- double responseTimeSec = (System.currentTimeMillis() - pingStartMs) / 1000.0;
+ double responseTimeSec = Math.max(0D,
+ (System.nanoTime() - pingStartNs) / 1_000_000_000.0);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` around
lines 172 - 173, The RTT measurement in HostTrackImpl currently uses
System.currentTimeMillis() (pingStartMs) before sending the bus message and
again when computing RTT in the CloudBusCallBack; switch to a monotonic clock by
using System.nanoTime() for start and end deltas (e.g., store pingStartNs from
System.nanoTime(), compute deltaNs = System.nanoTime() - pingStartNs) and
convert to milliseconds for the EMA/timeout logic (deltaMs = deltaNs /
1_000_000L); update both the send-site (where pingStartMs is set) and the
callback-site (where RTT is calculated and fed into the EMA/timeout) to use the
new variables so EMA and adaptive timeout code in HostTrackImpl remain monotonic
and immune to system clock adjustments.
…ut, ping pool CRITICAL: bake SSL into PoolingHttpClientConnectionManager Registry (HttpClientBuilder silently ignores setSSLContext when setConnectionManager is used) HIGH: add pingResponseEma.remove in untrackHost to prevent unbounded map growth HIGH: revert REST_FACADE_READ_TIMEOUT default 60s->300s to avoid regression on long-running agent ops HIGH: ping pool maxPerRoute 1->2 to allow 1 in-flight + 1 queued, avoiding head-of-line block MEDIUM: extract PING_POOL_MAX_TOTAL=3000 constant; align createAsyncRestTemplate param order with createRestTemplate MEDIUM: fix health-check threshold to integer arithmetic (leased*5 > maxTotal*4) LOW: eliminate ThreadLocal double-read; use pingConnManager.getMaxTotal() in debug handler Resolves: ZSTAC-67534 Change-Id: Ia2328781ab00ba07dbb10046eac45cbb715ca733
1df823c to
91c1e02
Compare
Summary
Changes
HostTrackImpl.java: 新增 EMA 计算逻辑、per-host 响应时间跟踪、getAdaptiveTimeout()静态方法KVMHost.java: 两处asyncJsonPosttimeout 参数改为调用HostTrackImpl.getAdaptiveTimeout()Test Plan
Resolves: ZSTAC-67534
sync from gitlab !9239