Skip to content

refactor(api,crawler,llm): precompile regex patterns and tighten hot paths#3133

Merged
marevol merged 1 commit into
masterfrom
refactor/precompile-patterns-and-cleanup
May 10, 2026
Merged

refactor(api,crawler,llm): precompile regex patterns and tighten hot paths#3133
marevol merged 1 commit into
masterfrom
refactor/precompile-patterns-and-cleanup

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 10, 2026

Summary

Cross-cutting cleanup across api/, crawler/, and llm/ packages: hoist repeated regex calls into static Pattern constants, make a couple of crawler caches thread-safe, and migrate the Date branch of the JSON encoder from SimpleDateFormat to java.time.format.DateTimeFormatter. No behavior changes intended.

Changes Made

API layer (org.codelibs.fess.api)

  • BaseApiManager: hoist "/+" to a static Pattern and replace the long if/else chain in getFormatType with FormatType.valueOf(...) plus an IllegalArgumentException fallback to OTHER.
  • WebApiManagerFactory: grow the webApiManagers array via Arrays.copyOf instead of building an ArrayList per call.
  • WebApiRequest.getServletPath: cache the getQueryString() result in a local and switch the substring check to String.contains.
  • SearchEngineApiManager: pre-compile "\\.\\.+" and "/+" patterns; replace the chained extension if/else with a Map.ofEntries lookup keyed by lowercase extension.
  • SearchApiManager:
    • Pre-compile "/+", the JSONP callback sanitizer, and a shared ISO-8601 DateTimeFormatter.
    • Use the already-resolved query for getRelatedQueries/getRelatedContents instead of re-reading params.getQuery().
    • Use ArrayUtils.contains for the docId scan and a HashSet for favorite URL lookups (O(1) contains).
    • Migrate the Date branch of escapeJson to DateTimeFormatter with ZoneId.systemDefault().
  • ChatApiManager: drop the redundant // Default constructor comment.

Crawler (org.codelibs.fess.crawler)

  • FessCrawlerThread: remove local HTTP_STATUS_OK/HTTP_STATUS_NOT_FOUND constants and reuse Constants.OK_STATUS_CODE, Constants.NOT_FOUND_STATUS_CODE, and Constants.NOT_MODIFIED_STATUS.
  • FessIntervalController: extract a runQuietly(description, action) helper so each delay stage shares the same swallow-and-debug-log treatment.
  • AbstractFessFileTransformer: pre-compile cache/whitespace, trailing-slash, and SMB-prefix regexes; reuse the resolved config-parameter map (it cannot be null at that point because it's already used to construct a HashMap).
  • FessXpathTransformer:
    • Switch fieldPrunedRuleMap and prunedTagsCache to ConcurrentHashMap and use computeIfAbsent, addressing a latent race when crawler threads populate the caches in parallel.
    • Cache compiled Pattern instances for convertUrlMap entries.
    • Extract isRobotsTagsIgnored and applyRobotsDirective so the meta <robots> and X-Robots-Tag paths no longer duplicate the same parsing block.

LLM (org.codelibs.fess.llm)

  • AbstractLlmClient: pre-compile the HTML-tag and control-whitespace regexes; short-circuit wrapUserInput when the message contains no </user_input> to skip the unnecessary String.replace call.

Testing

  • mvn formatter:format && mvn license:format should be re-run before merge if any conventions changed.
  • New SearchApiManagerTest cases:
    • test_escapeJson_dateMatchesLegacySimpleDateFormat cross-checks the new formatter against SimpleDateFormat(CoreLibConstants.DATE_FORMAT_ISO_8601_EXTEND) across UTC, Asia/Tokyo, America/Los_Angeles, and Europe/Paris for several epoch values, guarding against regression in the SimpleDateFormatDateTimeFormatter migration.
    • test_escapeJson_dateFormatShape asserts the output stays a quoted yyyy-MM-dd'T'HH:mm:ss.SSS±HHmm string.
  • Existing crawler/API tests should keep passing — none of these changes alter visible behavior.

Breaking Changes

None.

Additional Notes

  • The ConcurrentHashMap switch in FessXpathTransformer is the only change with potential runtime behavior implications — it's intentionally there because the caches are populated lazily from multiple crawler threads, and previously HashMap.put could race with reads.
  • escapeJson(Date) now uses ZoneId.systemDefault() to mirror the previous SimpleDateFormat (which also used the default JVM zone). The new tests pin this behavior.
  • Reviewers may want to spot-check the FormatType.valueOf fallback in BaseApiManager#getFormatType — the previous chain rejected unknown values via the final default arm, the new code does the same via the catch.

…paths

Hoist repeated `String.replaceAll`/`replaceFirst` calls into static
`Pattern` constants across the API managers, crawler transformers, and
LLM client to avoid recompiling regexes on every request. Along the way:

- BaseApiManager / SearchApiManager: collapse the FormatType lookup to
  `FormatType.valueOf` with an `IllegalArgumentException` fallback, and
  drop the `String.replaceAll` step in path splitting.
- SearchEngineApiManager: replace the chained extension `if/else` block
  with a `Map.ofEntries` lookup keyed by lowercase extension.
- WebApiManagerFactory: grow the manager array via `Arrays.copyOf`
  instead of materializing an `ArrayList` each call.
- WebApiRequest: cache the `getQueryString()` result and switch the
  substring check to `String.contains`.
- SearchApiManager:
  - Use `query` (already resolved) for related-query/related-content
    helpers instead of re-reading `params.getQuery()`.
  - Convert the favorite URL list to a `HashSet` for O(1) `contains`.
  - Replace the manual docId scan with `ArrayUtils.contains`.
  - Migrate the `Date` branch of `escapeJson` from `SimpleDateFormat`
    to a shared `DateTimeFormatter` (ISO-8601 extended), with new tests
    in SearchApiManagerTest pinning the output to the legacy shape.
- FessCrawlerThread: drop the local 200/404 constants and reuse
  `Constants.OK_STATUS_CODE` / `Constants.NOT_FOUND_STATUS_CODE` /
  `Constants.NOT_MODIFIED_STATUS`.
- FessIntervalController: extract a `runQuietly(description, action)`
  helper so each delay stage shares the same swallow-and-debug-log
  treatment.
- AbstractFessFileTransformer: reuse the resolved config-parameter map
  (it can never be null at that point because the surrounding code
  already constructs a `HashMap` from it).
- FessXpathTransformer:
  - Switch `fieldPrunedRuleMap` and `prunedTagsCache` to
    `ConcurrentHashMap` and use `computeIfAbsent`, fixing a latent race
    when crawler threads populate the caches in parallel.
  - Cache compiled `Pattern` instances for `convertUrlMap` entries.
  - Extract `isRobotsTagsIgnored` and `applyRobotsDirective` so meta and
    `X-Robots-Tag` handling no longer duplicate the same parsing block.
- AbstractLlmClient: short-circuit `wrapUserInput` when the message
  contains no `</user_input>` to skip the unnecessary `replace` call.

No behavior changes intended; the new SearchApiManagerTest cases lock
down the date-formatting parity across multiple time zones.
@marevol marevol self-assigned this May 10, 2026
@marevol marevol added the task label May 10, 2026
@marevol marevol added this to the 15.7.0 milestone May 10, 2026
@marevol marevol merged commit 3d57ce2 into master May 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant