feat: auto-select best attention backend when not specified #1185

sufubao · 2026-01-19T13:11:08Z

Summary

Implements automatic selection of attention backends when not explicitly specified via CLI
Priority order: FA3 > FlashInfer > Triton (always available as fallback)
Changes default value for --llm_prefill_att_backend and --llm_decode_att_backend from triton to None (auto-select)

Changes

lightllm/common/basemodel/attention/create_utils.py: Added _auto_select_backend() function with helper functions to check FA3/FlashInfer availability and validate backends at runtime
lightllm/server/api_cli.py: Updated defaults and help text for attention backend CLI arguments

Test plan

Verify auto-selection works on Hopper GPU (should select FA3 if sgl_kernel available)
Verify fallback to FlashInfer when FA3 unavailable
Verify fallback to Triton when neither FA3 nor FlashInfer available
Verify explicit backend specification still works (e.g., --llm_prefill_att_backend triton)

gemini-code-assist · 2026-01-19T13:11:25Z

Summary of Changes

Hello @sufubao, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an intelligent auto-selection mechanism for attention backends within LightLLM. Instead of requiring users to manually specify a backend, the system will now automatically detect and utilize the most performant available option, prioritizing FA3, then FlashInfer, and falling back to Triton. This change streamlines configuration and ensures that the application leverages the best possible hardware acceleration for attention operations, enhancing overall efficiency and user experience.

Highlights

Automatic Backend Selection: The system now automatically selects the optimal attention backend if not explicitly specified via CLI arguments, improving performance and ease of use.
Backend Priority Order: A clear priority order has been established for automatic backend selection: FA3 (FlashAttention v3) > FlashInfer > Triton. Triton serves as a reliable fallback.
Default CLI Argument Changes: The default values for --llm_prefill_att_backend and --llm_decode_att_backend CLI arguments have been changed from triton to None, enabling the new auto-selection feature by default.
Runtime Backend Validation: New utility functions (_is_fa3_available, _is_flashinfer_available, _try_backend) have been added to dynamically check the availability and validate the functionality of different attention backends at runtime.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an automatic selection mechanism for the attention backend, which is a great feature for improving performance out-of-the-box. The implementation correctly prioritizes backends (FA3 > FlashInfer > Triton) and includes availability and runtime validation checks. The changes to the CLI arguments are also correct. My review includes a couple of suggestions to make the validation logic more robust and efficient: one to improve error reporting in the availability check, and another to reduce memory usage during the runtime validation of the flashinfer backend. Overall, the changes are well-implemented.

gemini-code-assist · 2026-01-19T13:12:25Z

lightllm/common/basemodel/attention/create_utils.py

+    except Exception:
+        return False


The broad except Exception: can hide underlying issues and make debugging difficult. For instance, a CUDA error within is_hopper() would be silently ignored. It's better to catch specific exceptions or at least log the error for better diagnostics.

Suggested change

except Exception:

return False

except Exception as e:

logger.debug(f"FA3 availability check failed: {e}")

return False

gemini-code-assist · 2026-01-19T13:12:25Z

lightllm/common/basemodel/attention/create_utils.py

+            import flashinfer  # noqa: F401
+
+            # Try creating a minimal workspace buffer to verify flashinfer works
+            _ = torch.empty(128 * 1024 * 1024, dtype=torch.uint8, device="cuda")


Allocating 128MB of CUDA memory just for a runtime validation check seems excessive and can cause an unnecessary memory spike. A much smaller allocation should be sufficient to verify that flashinfer can initialize correctly on the CUDA device.

Suggested change

_ = torch.empty(128 * 1024 * 1024, dtype=torch.uint8, device="cuda")

_ = torch.empty(1 * 1024 * 1024, dtype=torch.uint8, device="cuda")

When users don't explicitly specify an attention backend (--llm_prefill_att_backend or --llm_decode_att_backend), LightLLM now automatically selects the best available backend with priority: FA3 > FlashInfer > Triton. This improves user experience by removing the need to manually configure backends while ensuring optimal performance based on available hardware and packages.

Run actual attention operations in a spawned subprocess to properly validate backends. This catches runtime failures and crashes that simple import checks miss, without affecting the main server process.

- Create lightllm/utils/backend_validators/ with modular structure: - base.py: Abstract BackendValidator class with ground truth comparison - fa3.py: FA3 validator using flash_attn_varlen_func - flashinfer.py: FlashInfer validator using BatchPrefillWithRaggedKVCacheWrapper - triton.py: Triton validator using softmax kernel - __init__.py: Exports and subprocess validation orchestration - Each validator compares output against torch.nn.functional.scaled_dot_product_attention - Refactor create_utils.py to use the new validators, removing duplicated logic

Replace over-engineered backend_validators/ directory (5 files, abstract classes) with single backend_validator.py using plain functions. Keeps subprocess isolation and ground truth validation, removes unnecessary abstraction layers.

gemini-code-assist bot reviewed Jan 19, 2026

View reviewed changes

sufubao force-pushed the auto-select-att-backend branch from da291ce to 037c6c5 Compare January 20, 2026 07:06

sufubao and others added 7 commits January 20, 2026 09:43

fix: use forked subprocess for backend validation

5bbccae

Run actual attention operations in a spawned subprocess to properly validate backends. This catches runtime failures and crashes that simple import checks miss, without affecting the main server process.

refactor: simplify backend validator to single file

605d317

Replace over-engineered backend_validators/ directory (5 files, abstract classes) with single backend_validator.py using plain functions. Keeps subprocess isolation and ground truth validation, removes unnecessary abstraction layers.

fix: add is_causal=True to ground truth computation

e936f3f

refactor: merge is_available checks into validate functions

44f8039

flashinfer envs.

8e0a892

hiworldwzj force-pushed the auto-select-att-backend branch from 8491756 to 8e0a892 Compare January 20, 2026 09:46

fix

b35ebf0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: auto-select best attention backend when not specified #1185

feat: auto-select best attention backend when not specified #1185

sufubao commented Jan 19, 2026

Uh oh!

gemini-code-assist bot commented Jan 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	_ = torch.empty(128 * 1024 * 1024, dtype=torch.uint8, device="cuda")
	_ = torch.empty(1 * 1024 * 1024, dtype=torch.uint8, device="cuda")

feat: auto-select best attention backend when not specified #1185

Are you sure you want to change the base?

feat: auto-select best attention backend when not specified #1185

Conversation

sufubao commented Jan 19, 2026

Summary

Changes

Test plan

Uh oh!

gemini-code-assist bot commented Jan 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants