Skip to content

[Bug] EAGLE3 Deepseek-r1-distill-llama8b traning Index errors #368

@Sylvan820

Description

@Sylvan820

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via init_process_group or barrier . Using the current device set by the user.
warnings.warn( # warn only once
[rank3]:[W1216 04:21:52.388082556 ProcessGroupNCCL.cpp:5023] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
[rank0]: multiprocess.pool.RemoteTraceback:
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
[rank0]: result = (True, func(*args, **kwds))
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 586, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3697, in _map_single
[rank0]: for i, batch in iter_outputs(shard_iterable):
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3647, in iter_outputs
[rank0]: yield i, apply_function(example, i, offset=offset)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3570, in apply_function
[rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/preprocessing.py", line 370, in preprocess_function
[rank0]: processed = preprocess_conversations(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/preprocessing.py", line 157, in preprocess_conversations
[rank0]: input_ids, loss_mask = parser.parse(
[rank0]: ^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/parse.py", line 82, in parse
[rank0]: conversation = self.tokenizer.apply_chat_template(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1649, in apply_chat_template
[rank0]: isinstance(conversation[0], (list, tuple)) or hasattr(conversation[0], "messages")
[rank0]: ~~~~~~~~~~~~^^^
[rank0]: IndexError: list index out of range
[rank0]: """

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/SpecForge/scripts/train_eagle3.py", line 836, in
[rank0]: main()
[rank0]: File "/workspace/SpecForge/scripts/train_eagle3.py", line 633, in main
[rank0]: train_dataloader, vocab_mapping_path, eval_dataloader = build_dataloaders(
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/scripts/train_eagle3.py", line 388, in build_dataloaders
[rank0]: train_eagle3_dataset = build_eagle3_dataset(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/preprocessing.py", line 403, in build_eagle3_dataset
[rank0]: dataset = dataset.map(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 562, in wrapper
[rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3332, in map
[rank0]: for rank, done, content in iflatmap_unordered(
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 626, in iflatmap_unordered
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 626, in
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get
[rank0]: raise self._value
[rank0]: IndexError: list index out of range
[rank0]:[W1216 04:21:54.710972321 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1216 04:21:55.101000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104920 closing signal SIGTERM
W1216 04:21:55.106000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104921 closing signal SIGTERM
W1216 04:21:55.110000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104922 closing signal SIGTERM
W1216 04:21:55.116000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104923 closing signal SIGTERM
W1216 04:21:55.121000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104924 closing signal SIGTERM
W1216 04:21:55.126000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104925 closing signal SIGTERM
W1216 04:21:55.131000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104926 closing signal SIGTERM
E1216 04:21:56.578000 104852 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 104919) of binary: /workspace/SpecForge/.venv/bin/python3
Traceback (most recent call last):
File "/workspace/SpecForge/.venv/bin/torchrun", line 10, in
sys.exit(main())
^^^^^^
File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/workspace/SpecForge/scripts/train_eagle3.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-12-16_04:21:55
host : tre-1-h20-ipv6-uc-sh-0180
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 104919)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Reproduction

use official template deepseek-r1-distill

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)
NUM_GPUS=${1:-1}
TP_SIZE=${2:-1}

TARGET_MODEL_PATH="/workspace/deepseek-8b"

MODEL_NAME=$(basename "$TARGET_MODEL_PATH")

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun
--standalone
--nproc_per_node 8
$ROOT_DIR/scripts/train_eagle3.py
--target-model-path $TARGET_MODEL_PATH
--draft-model-config $ROOT_DIR/configs/deepseek-8B-eagle3.json
--train-data-path $ROOT_DIR/cache/dataset/sharegpt_train.jsonl
--output-dir $ROOT_DIR/outputs/deepseek-8b-eagle3-sharegpt
--num-epochs 10
--batch-size 1
--tp-size 1
--build-dataset-num-proc 32
--dist-timeout 1200
--learning-rate 1e-4
--max-length 4096
--chat-template deepseek-r1-distill
--cache-dir $ROOT_DIR/cache
--attention-backend sdpa
--target-model-backend sglang
--log-interval 10

Environment

pre-commit
torch==2.8.0
torchaudio==2.8.0
torchvision==0.23.0
transformers==4.57.1
qwen-vl-utils==0.0.11
datasets
setuptools
tqdm
wandb
psutil
numpy
accelerate
pydantic
sglang[all]==0.5.5
openai-harmony

8*A100 80G

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions