Skip to content

Conversation

@Simon-ss7
Copy link
Contributor

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

Bug information

version: 3.12.0
infer_engine: vllm
model: qwen3

When I tried to follow the demo under example/train/grpo/plugin.py/ToolCallScheduler to run multi-turn training, I encountered the following issues:

In colocate mode, the launch fails with a null pointer error indicating that self.infer_engine is None.

In server mode, when current_turn > 1, the returned responses contain duplicated inference results. After inspecting the _add_request method in vllm_engine.py, I found that the inference content for each step does not append the results from previous steps.

PR information

I found that in colocate mode, a tokenizer is passed when initializing the multi_turn_scheduler in rollout_mixin.py, and in server mode, self.tokenizer is also available inside multi_turn_scheduler. Based on this, I adjusted the logic for obtaining the tokenizer in the step method of plugin.py/ToolCallScheduler.

I found that in server mode, the run method of MultiTurnScheduler defines {'role': 'assistant', 'content': None} at lines 352–354 to append inference results, but when current_turn > 1, this entry is removed at lines 236–238. As a result, when I inspected the _add_request method in vllm_engine.py, the input_ids always corresponded to the initial inference content.

After adjusting the logic in the run method so that the appended assistant=None entry is not removed when current_turn > 1, the inference content became correct.

Experiment results

colocate_step_error server_repeat_error

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Simon-ss7, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers crucial bug fixes for the multi-turn scheduler, specifically targeting issues that arise during training with vllm as the inference engine. It rectifies a null pointer error in colocate mode by ensuring proper tokenizer access and eliminates the problem of duplicated inference results in server mode by refining the conversation history management. These adjustments are vital for accurate and stable multi-turn inference workflows.

Highlights

  • Colocate Mode Fix: Resolved a null pointer error in colocate mode by adjusting the tokenizer acquisition logic in plugin.py. The tokenizer is now correctly accessed via self.tokenizer instead of self.infer_engine.default_template.tokenizer, preventing self.infer_engine from being None.
  • Server Mode Inference Duplication Fix: Addressed an issue in server mode where inference results were duplicated for current_turn > 1. The logic in multi_turn.py was refined to prevent the premature removal of the {'role': 'assistant', 'content': None} entry, ensuring that subsequent inference steps correctly append to the conversation history.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves two bugs related to multi-turn inference. The first fix in plugin.py correctly sources the tokenizer, preventing a None access error in colocate mode, which is a solid improvement for robustness. The second, more subtle fix in multi_turn.py adjusts the logic for handling dummy assistant messages in multi-turn conversations. By moving the removal of the dummy message to after the inference call, it ensures the conversation history is correctly passed to the inference engine, fixing the repeated inference issue. The changes are logical and well-explained. I've added one suggestion to improve the clarity of the fix in multi_turn.py for future maintainability.

Comment on lines 245 to 247
if current_turn > 1 and not messages[-1]['content']:
# If it's not the first turn and the last message content is dummy, remove the response
remove_response(messages)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change correctly removes the dummy assistant message after the inference call, which is the key to fixing the repeated inference bug. To improve code clarity and future maintainability, I suggest enhancing the comment to explain why the dummy message was intentionally kept during the infer_async call. This will help other developers understand this subtle but crucial part of the multi-turn logic.

Suggested change
if current_turn > 1 and not messages[-1]['content']:
# If it's not the first turn and the last message content is dummy, remove the response
remove_response(messages)
if current_turn > 1 and not messages[-1]['content']:
# The dummy assistant message was intentionally kept during `infer_async`
# to ensure correct history processing by the template. It is now removed
# before appending the new completion.
remove_response(messages)

@hjh0119
Copy link
Collaborator

hjh0119 commented Jan 6, 2026

LGTM
thanks

@hjh0119 hjh0119 merged commit 28c2479 into modelscope:main Jan 6, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants