refactor(api,crawler,llm): precompile regex patterns and tighten hot paths#3133
Merged
Merged
Conversation
…paths
Hoist repeated `String.replaceAll`/`replaceFirst` calls into static
`Pattern` constants across the API managers, crawler transformers, and
LLM client to avoid recompiling regexes on every request. Along the way:
- BaseApiManager / SearchApiManager: collapse the FormatType lookup to
`FormatType.valueOf` with an `IllegalArgumentException` fallback, and
drop the `String.replaceAll` step in path splitting.
- SearchEngineApiManager: replace the chained extension `if/else` block
with a `Map.ofEntries` lookup keyed by lowercase extension.
- WebApiManagerFactory: grow the manager array via `Arrays.copyOf`
instead of materializing an `ArrayList` each call.
- WebApiRequest: cache the `getQueryString()` result and switch the
substring check to `String.contains`.
- SearchApiManager:
- Use `query` (already resolved) for related-query/related-content
helpers instead of re-reading `params.getQuery()`.
- Convert the favorite URL list to a `HashSet` for O(1) `contains`.
- Replace the manual docId scan with `ArrayUtils.contains`.
- Migrate the `Date` branch of `escapeJson` from `SimpleDateFormat`
to a shared `DateTimeFormatter` (ISO-8601 extended), with new tests
in SearchApiManagerTest pinning the output to the legacy shape.
- FessCrawlerThread: drop the local 200/404 constants and reuse
`Constants.OK_STATUS_CODE` / `Constants.NOT_FOUND_STATUS_CODE` /
`Constants.NOT_MODIFIED_STATUS`.
- FessIntervalController: extract a `runQuietly(description, action)`
helper so each delay stage shares the same swallow-and-debug-log
treatment.
- AbstractFessFileTransformer: reuse the resolved config-parameter map
(it can never be null at that point because the surrounding code
already constructs a `HashMap` from it).
- FessXpathTransformer:
- Switch `fieldPrunedRuleMap` and `prunedTagsCache` to
`ConcurrentHashMap` and use `computeIfAbsent`, fixing a latent race
when crawler threads populate the caches in parallel.
- Cache compiled `Pattern` instances for `convertUrlMap` entries.
- Extract `isRobotsTagsIgnored` and `applyRobotsDirective` so meta and
`X-Robots-Tag` handling no longer duplicate the same parsing block.
- AbstractLlmClient: short-circuit `wrapUserInput` when the message
contains no `</user_input>` to skip the unnecessary `replace` call.
No behavior changes intended; the new SearchApiManagerTest cases lock
down the date-formatting parity across multiple time zones.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cross-cutting cleanup across
api/,crawler/, andllm/packages: hoist repeated regex calls into staticPatternconstants, make a couple of crawler caches thread-safe, and migrate theDatebranch of the JSON encoder fromSimpleDateFormattojava.time.format.DateTimeFormatter. No behavior changes intended.Changes Made
API layer (
org.codelibs.fess.api)"/+"to a staticPatternand replace the longif/elsechain ingetFormatTypewithFormatType.valueOf(...)plus anIllegalArgumentExceptionfallback toOTHER.webApiManagersarray viaArrays.copyOfinstead of building anArrayListper call.getQueryString()result in a local and switch the substring check toString.contains."\\.\\.+"and"/+"patterns; replace the chained extensionif/elsewith aMap.ofEntrieslookup keyed by lowercase extension."/+", the JSONP callback sanitizer, and a shared ISO-8601DateTimeFormatter.queryforgetRelatedQueries/getRelatedContentsinstead of re-readingparams.getQuery().ArrayUtils.containsfor the docId scan and aHashSetfor favorite URL lookups (O(1)contains).Datebranch ofescapeJsontoDateTimeFormatterwithZoneId.systemDefault().// Default constructorcomment.Crawler (
org.codelibs.fess.crawler)HTTP_STATUS_OK/HTTP_STATUS_NOT_FOUNDconstants and reuseConstants.OK_STATUS_CODE,Constants.NOT_FOUND_STATUS_CODE, andConstants.NOT_MODIFIED_STATUS.runQuietly(description, action)helper so each delay stage shares the same swallow-and-debug-log treatment.nullat that point because it's already used to construct aHashMap).fieldPrunedRuleMapandprunedTagsCachetoConcurrentHashMapand usecomputeIfAbsent, addressing a latent race when crawler threads populate the caches in parallel.Patterninstances forconvertUrlMapentries.isRobotsTagsIgnoredandapplyRobotsDirectiveso the meta<robots>andX-Robots-Tagpaths no longer duplicate the same parsing block.LLM (
org.codelibs.fess.llm)wrapUserInputwhen the message contains no</user_input>to skip the unnecessaryString.replacecall.Testing
mvn formatter:format && mvn license:formatshould be re-run before merge if any conventions changed.SearchApiManagerTestcases:test_escapeJson_dateMatchesLegacySimpleDateFormatcross-checks the new formatter againstSimpleDateFormat(CoreLibConstants.DATE_FORMAT_ISO_8601_EXTEND)acrossUTC,Asia/Tokyo,America/Los_Angeles, andEurope/Parisfor several epoch values, guarding against regression in theSimpleDateFormat→DateTimeFormattermigration.test_escapeJson_dateFormatShapeasserts the output stays a quotedyyyy-MM-dd'T'HH:mm:ss.SSS±HHmmstring.Breaking Changes
None.
Additional Notes
ConcurrentHashMapswitch inFessXpathTransformeris the only change with potential runtime behavior implications — it's intentionally there because the caches are populated lazily from multiple crawler threads, and previouslyHashMap.putcould race with reads.escapeJson(Date)now usesZoneId.systemDefault()to mirror the previousSimpleDateFormat(which also used the default JVM zone). The new tests pin this behavior.FormatType.valueOffallback inBaseApiManager#getFormatType— the previous chain rejected unknown values via the finaldefaultarm, the new code does the same via thecatch.