Skip to content

Conversation

@sufubao
Copy link
Collaborator

@sufubao sufubao commented Jan 19, 2026

Summary

  • Implements automatic selection of attention backends when not explicitly specified via CLI
  • Priority order: FA3 > FlashInfer > Triton (always available as fallback)
  • Changes default value for --llm_prefill_att_backend and --llm_decode_att_backend from triton to None (auto-select)

Changes

  • lightllm/common/basemodel/attention/create_utils.py: Added _auto_select_backend() function with helper functions to check FA3/FlashInfer availability and validate backends at runtime
  • lightllm/server/api_cli.py: Updated defaults and help text for attention backend CLI arguments

Test plan

  • Verify auto-selection works on Hopper GPU (should select FA3 if sgl_kernel available)
  • Verify fallback to FlashInfer when FA3 unavailable
  • Verify fallback to Triton when neither FA3 nor FlashInfer available
  • Verify explicit backend specification still works (e.g., --llm_prefill_att_backend triton)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sufubao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an intelligent auto-selection mechanism for attention backends within LightLLM. Instead of requiring users to manually specify a backend, the system will now automatically detect and utilize the most performant available option, prioritizing FA3, then FlashInfer, and falling back to Triton. This change streamlines configuration and ensures that the application leverages the best possible hardware acceleration for attention operations, enhancing overall efficiency and user experience.

Highlights

  • Automatic Backend Selection: The system now automatically selects the optimal attention backend if not explicitly specified via CLI arguments, improving performance and ease of use.
  • Backend Priority Order: A clear priority order has been established for automatic backend selection: FA3 (FlashAttention v3) > FlashInfer > Triton. Triton serves as a reliable fallback.
  • Default CLI Argument Changes: The default values for --llm_prefill_att_backend and --llm_decode_att_backend CLI arguments have been changed from triton to None, enabling the new auto-selection feature by default.
  • Runtime Backend Validation: New utility functions (_is_fa3_available, _is_flashinfer_available, _try_backend) have been added to dynamically check the availability and validate the functionality of different attention backends at runtime.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an automatic selection mechanism for the attention backend, which is a great feature for improving performance out-of-the-box. The implementation correctly prioritizes backends (FA3 > FlashInfer > Triton) and includes availability and runtime validation checks. The changes to the CLI arguments are also correct. My review includes a couple of suggestions to make the validation logic more robust and efficient: one to improve error reporting in the availability check, and another to reduce memory usage during the runtime validation of the flashinfer backend. Overall, the changes are well-implemented.

Comment on lines 60 to 61
except Exception:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The broad except Exception: can hide underlying issues and make debugging difficult. For instance, a CUDA error within is_hopper() would be silently ignored. It's better to catch specific exceptions or at least log the error for better diagnostics.

Suggested change
except Exception:
return False
except Exception as e:
logger.debug(f"FA3 availability check failed: {e}")
return False

import flashinfer # noqa: F401

# Try creating a minimal workspace buffer to verify flashinfer works
_ = torch.empty(128 * 1024 * 1024, dtype=torch.uint8, device="cuda")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Allocating 128MB of CUDA memory just for a runtime validation check seems excessive and can cause an unnecessary memory spike. A much smaller allocation should be sufficient to verify that flashinfer can initialize correctly on the CUDA device.

Suggested change
_ = torch.empty(128 * 1024 * 1024, dtype=torch.uint8, device="cuda")
_ = torch.empty(1 * 1024 * 1024, dtype=torch.uint8, device="cuda")

@sufubao sufubao force-pushed the auto-select-att-backend branch from da291ce to 037c6c5 Compare January 20, 2026 07:06
sufubao and others added 7 commits January 20, 2026 09:43
When users don't explicitly specify an attention backend (--llm_prefill_att_backend
or --llm_decode_att_backend), LightLLM now automatically selects the best available
backend with priority: FA3 > FlashInfer > Triton.

This improves user experience by removing the need to manually configure backends
while ensuring optimal performance based on available hardware and packages.
Run actual attention operations in a spawned subprocess to properly validate
backends. This catches runtime failures and crashes that simple import checks
miss, without affecting the main server process.
- Create lightllm/utils/backend_validators/ with modular structure:
  - base.py: Abstract BackendValidator class with ground truth comparison
  - fa3.py: FA3 validator using flash_attn_varlen_func
  - flashinfer.py: FlashInfer validator using BatchPrefillWithRaggedKVCacheWrapper
  - triton.py: Triton validator using softmax kernel
  - __init__.py: Exports and subprocess validation orchestration

- Each validator compares output against torch.nn.functional.scaled_dot_product_attention

- Refactor create_utils.py to use the new validators, removing duplicated logic
Replace over-engineered backend_validators/ directory (5 files, abstract
classes) with single backend_validator.py using plain functions.

Keeps subprocess isolation and ground truth validation, removes unnecessary
abstraction layers.
@hiworldwzj hiworldwzj force-pushed the auto-select-att-backend branch from 8491756 to 8e0a892 Compare January 20, 2026 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants