refactor(crawler): tighten FessCrawlerThread hot path#3134
Merged
Conversation
Apply low-risk implementation-level improvements to the per-URL hot path in FessCrawlerThread.isContentUpdated() and related helpers: - Hoist duplicate ComponentUtil.getFessConfig() lookup to method top. - Replace stream(...).forEach(list.add) with Collections.addAll for the role-type list initialization, keeping null-safety. - Drop the redundant blank-URL filter in storeChildUrlsToQueue; getAnchorSet, getChildUrlSet and the parent storeChildUrls already filter blank URLs, so the intermediate Set allocation was wasted. - Guard three logger.debug calls that were missing isDebugEnabled(), avoiding parameter evaluation when DEBUG is off. - Cache responseData.getLastModified() in a local variable to avoid calling it twice. - Filter null entries out of getClientRuleList(); a malformed rule caused computeIfAbsent to return null, which would later NPE in getClient() when accessing Pair.getSecond(). - Remove unused imports (StreamUtil.stream, Collectors) and update the storeChildUrlsToQueue javadoc to match the new behavior. No behavior changes intended. FessCrawlerThreadTest (14 tests) passes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Low-risk, implementation-level cleanups on the per-URL hot path in
FessCrawlerThread. No architectural or interface changes, and no intendedbehavior changes.
Scope is intentionally narrow to minimize regression risk. Larger structural
ideas (e.g. batching
IndexingHelper.getDocumentviaidsQuery) wereconsidered but deferred to separate PRs after profiling.
Changes
src/main/java/org/codelibs/fess/crawler/FessCrawlerThread.java:ComponentUtil.getFessConfig()inisContentUpdated. Theconfig was fetched twice (once for
isIncrementalCrawling(), then again afew lines later). Hoisted to a single call at the top of the method.
stream(getPermissions()).of(s -> s.forEach(list::add))with anull-guarded
Collections.addAll. Drops the Stream pipeline and lambdaallocation per call while preserving the original null-array tolerance of
StreamUtil.stream.storeChildUrlsToQueue.getAnchorSet/getChildUrlSet(upstream) andCrawlerThread.storeChildUrls(downstream) already drop blank URLs, so the intermediate
stream().filter().collect(toSet())was wasted work and an extraSetallocation per call.
logger.debugsites withisDebugEnabled()short-circuits(L168 / L195 / L217). L217 in particular evaluated
fessConfig.getIndexFieldExpires()for the format arg unconditionally; theother two avoid Log4j2 LogEvent setup on the cold path.
responseData.getLastModified()in a local variable. It wascalled twice (null check, then
.getTime()); now once.getClientRuleList. When a rule string ismalformed (no
:), thecomputeIfAbsentmapping function returnsnull.ConcurrentHashMapdoes not cachenullvalues, butStream.mapdoesemit
nullinto the resultingList, andgetClientlater dereferencesPair.getSecond(). Added.filter(Objects::nonNull)before.toList().StreamUtil.stream/Collectorsimports,remove a stale
// add an urlcomment, and reword thestoreChildUrlsToQueueJavadoc to match the new behavior.Intentionally NOT changed
LinkedHashSetingetAnchorSet— preserving anchor insertion order isvisible behavior for crawl prioritization.
isContentUpdatedare already mutuallyexclusive via the
if (responseData == null)guard; no actual duplicaterequest occurs.
IndexingHelper.getDocumentviaidsQuery— requires changes tothe crawler main loop and is out of scope for a low-risk refactor.
Test plan
mvn compile— BUILD SUCCESSmvn test -Dtest=FessCrawlerThreadTest— 14 tests pass, 0 failuresmvn formatter:format && mvn license:format— appliedbefore merge)
🤖 Generated with Claude Code