[Bug] EAGLE3 Deepseek-r1-distill-llama8b traning  Index errors

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. 
  warnings.warn(  # warn only once
[rank3]:[W1216 04:21:52.388082556 ProcessGroupNCCL.cpp:5023] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
[rank0]: multiprocess.pool.RemoteTraceback: 
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
[rank0]:     result = (True, func(*args, **kwds))
[rank0]:                     ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 586, in _write_generator_to_queue
[rank0]:     for i, result in enumerate(func(**kwargs)):
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3697, in _map_single
[rank0]:     for i, batch in iter_outputs(shard_iterable):
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3647, in iter_outputs
[rank0]:     yield i, apply_function(example, i, offset=offset)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3570, in apply_function
[rank0]:     processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/preprocessing.py", line 370, in preprocess_function
[rank0]:     processed = preprocess_conversations(
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/preprocessing.py", line 157, in preprocess_conversations
[rank0]:     input_ids, loss_mask = parser.parse(
[rank0]:                            ^^^^^^^^^^^^^
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/parse.py", line 82, in parse
[rank0]:     conversation = self.tokenizer.apply_chat_template(
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1649, in apply_chat_template
[rank0]:     isinstance(conversation[0], (list, tuple)) or hasattr(conversation[0], "messages")
[rank0]:                ~~~~~~~~~~~~^^^
[rank0]: IndexError: list index out of range
[rank0]: """

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/SpecForge/scripts/train_eagle3.py", line 836, in <module>
[rank0]:     main()
[rank0]:   File "/workspace/SpecForge/scripts/train_eagle3.py", line 633, in main
[rank0]:     train_dataloader, vocab_mapping_path, eval_dataloader = build_dataloaders(
[rank0]:                                                             ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/SpecForge/scripts/train_eagle3.py", line 388, in build_dataloaders
[rank0]:     train_eagle3_dataset = build_eagle3_dataset(
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/preprocessing.py", line 403, in build_eagle3_dataset
[rank0]:     dataset = dataset.map(
[rank0]:               ^^^^^^^^^^^^
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 562, in wrapper
[rank0]:     out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]:                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3332, in map
[rank0]:     for rank, done, content in iflatmap_unordered(
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 626, in iflatmap_unordered
[rank0]:     [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 626, in <listcomp>
[rank0]:     [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]:      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get
[rank0]:     raise self._value
[rank0]: IndexError: list index out of range
[rank0]:[W1216 04:21:54.710972321 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1216 04:21:55.101000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104920 closing signal SIGTERM
W1216 04:21:55.106000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104921 closing signal SIGTERM
W1216 04:21:55.110000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104922 closing signal SIGTERM
W1216 04:21:55.116000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104923 closing signal SIGTERM
W1216 04:21:55.121000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104924 closing signal SIGTERM
W1216 04:21:55.126000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104925 closing signal SIGTERM
W1216 04:21:55.131000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104926 closing signal SIGTERM
E1216 04:21:56.578000 104852 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 104919) of binary: /workspace/SpecForge/.venv/bin/python3
Traceback (most recent call last):
  File "/workspace/SpecForge/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/workspace/SpecForge/scripts/train_eagle3.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-16_04:21:55
  host      : tre-1-h20-ipv6-uc-sh-0180
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 104919)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================


### Reproduction

**use official template deepseek-r1-distill**

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)
NUM_GPUS=${1:-1}
TP_SIZE=${2:-1}

TARGET_MODEL_PATH="/workspace/deepseek-8b"

MODEL_NAME=$(basename "$TARGET_MODEL_PATH")

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun \
    --standalone \
    --nproc_per_node 8 \
    $ROOT_DIR/scripts/train_eagle3.py \
    --target-model-path $TARGET_MODEL_PATH \
    --draft-model-config $ROOT_DIR/configs/deepseek-8B-eagle3.json \
    --train-data-path $ROOT_DIR/cache/dataset/sharegpt_train.jsonl \
    --output-dir $ROOT_DIR/outputs/deepseek-8b-eagle3-sharegpt \
    --num-epochs 10 \
    --batch-size 1 \
    --tp-size 1 \
    --build-dataset-num-proc 32 \
    --dist-timeout 1200 \
    --learning-rate 1e-4 \
    --max-length 4096 \
    --chat-template deepseek-r1-distill \
    --cache-dir $ROOT_DIR/cache \
    --attention-backend sdpa \
    --target-model-backend sglang \
    --log-interval 10


### Environment

pre-commit
torch==2.8.0
torchaudio==2.8.0
torchvision==0.23.0
transformers==4.57.1
qwen-vl-utils==0.0.11
datasets
setuptools
tqdm
wandb
psutil
numpy
accelerate
pydantic
sglang[all]==0.5.5
openai-harmony

8*A100 80G

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] EAGLE3 Deepseek-r1-distill-llama8b traning Index errors #368

Checklist

Describe the bug

/workspace/SpecForge/scripts/train_eagle3.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-12-16_04:21:55
host : tre-1-h20-ipv6-uc-sh-0180
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 104919)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] EAGLE3 Deepseek-r1-distill-llama8b traning Index errors #368

Description

Checklist

Describe the bug

/workspace/SpecForge/scripts/train_eagle3.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-12-16_04:21:55 host : tre-1-h20-ipv6-uc-sh-0180 rank : 0 (local_rank: 0) exitcode : 1 (pid: 104919) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-12-16_04:21:55
host : tre-1-h20-ipv6-uc-sh-0180
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 104919)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html