Skip to content

refactor(crawler): tighten FessCrawlerThread hot path#3134

Merged
marevol merged 1 commit into
masterfrom
refactor/crawler-thread-hot-path
May 20, 2026
Merged

refactor(crawler): tighten FessCrawlerThread hot path#3134
marevol merged 1 commit into
masterfrom
refactor/crawler-thread-hot-path

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 16, 2026

Summary

Low-risk, implementation-level cleanups on the per-URL hot path in
FessCrawlerThread. No architectural or interface changes, and no intended
behavior changes.

Scope is intentionally narrow to minimize regression risk. Larger structural
ideas (e.g. batching IndexingHelper.getDocument via idsQuery) were
considered but deferred to separate PRs after profiling.

Changes

src/main/java/org/codelibs/fess/crawler/FessCrawlerThread.java:

  • Deduplicate ComponentUtil.getFessConfig() in isContentUpdated. The
    config was fetched twice (once for isIncrementalCrawling(), then again a
    few lines later). Hoisted to a single call at the top of the method.
  • Replace stream(getPermissions()).of(s -> s.forEach(list::add)) with a
    null-guarded Collections.addAll.
    Drops the Stream pipeline and lambda
    allocation per call while preserving the original null-array tolerance of
    StreamUtil.stream.
  • Drop the redundant blank-URL filter in storeChildUrlsToQueue.
    getAnchorSet / getChildUrlSet (upstream) and CrawlerThread.storeChildUrls
    (downstream) already drop blank URLs, so the intermediate
    stream().filter().collect(toSet()) was wasted work and an extra Set
    allocation per call.
  • Guard three logger.debug sites with isDebugEnabled() short-circuits
    (L168 / L195 / L217). L217 in particular evaluated
    fessConfig.getIndexFieldExpires() for the format arg unconditionally; the
    other two avoid Log4j2 LogEvent setup on the cold path.
  • Cache responseData.getLastModified() in a local variable. It was
    called twice (null check, then .getTime()); now once.
  • Fix a latent NPE in getClientRuleList. When a rule string is
    malformed (no :), the computeIfAbsent mapping function returns null.
    ConcurrentHashMap does not cache null values, but Stream.map does
    emit null into the resulting List, and getClient later dereferences
    Pair.getSecond(). Added .filter(Objects::nonNull) before .toList().
  • Cleanup: drop unused StreamUtil.stream / Collectors imports,
    remove a stale // add an url comment, and reword the
    storeChildUrlsToQueue Javadoc to match the new behavior.

Intentionally NOT changed

  • LinkedHashSet in getAnchorSet — preserving anchor insertion order is
    visible behavior for crawl prioritization.
  • The two HEAD-request paths in isContentUpdated are already mutually
    exclusive via the if (responseData == null) guard; no actual duplicate
    request occurs.
  • Batching IndexingHelper.getDocument via idsQuery — requires changes to
    the crawler main loop and is out of scope for a low-risk refactor.

Test plan

  • mvn compile — BUILD SUCCESS
  • mvn test -Dtest=FessCrawlerThreadTest — 14 tests pass, 0 failures
  • mvn formatter:format && mvn license:format — applied
  • Manual crawl smoke test against a small web/file source (recommended
    before merge)

🤖 Generated with Claude Code

Apply low-risk implementation-level improvements to the per-URL hot
path in FessCrawlerThread.isContentUpdated() and related helpers:

- Hoist duplicate ComponentUtil.getFessConfig() lookup to method top.
- Replace stream(...).forEach(list.add) with Collections.addAll for
  the role-type list initialization, keeping null-safety.
- Drop the redundant blank-URL filter in storeChildUrlsToQueue;
  getAnchorSet, getChildUrlSet and the parent storeChildUrls already
  filter blank URLs, so the intermediate Set allocation was wasted.
- Guard three logger.debug calls that were missing isDebugEnabled(),
  avoiding parameter evaluation when DEBUG is off.
- Cache responseData.getLastModified() in a local variable to avoid
  calling it twice.
- Filter null entries out of getClientRuleList(); a malformed rule
  caused computeIfAbsent to return null, which would later NPE in
  getClient() when accessing Pair.getSecond().
- Remove unused imports (StreamUtil.stream, Collectors) and update
  the storeChildUrlsToQueue javadoc to match the new behavior.

No behavior changes intended. FessCrawlerThreadTest (14 tests) passes.
@marevol marevol self-assigned this May 20, 2026
@marevol marevol added the task label May 20, 2026
@marevol marevol added this to the 15.7.0 milestone May 20, 2026
@marevol marevol merged commit 5118ac8 into master May 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant