-
Notifications
You must be signed in to change notification settings - Fork 128
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via init_process_group or barrier . Using the current device set by the user.
warnings.warn( # warn only once
[rank3]:[W1216 04:21:52.388082556 ProcessGroupNCCL.cpp:5023] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
[rank0]: multiprocess.pool.RemoteTraceback:
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
[rank0]: result = (True, func(*args, **kwds))
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 586, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3697, in _map_single
[rank0]: for i, batch in iter_outputs(shard_iterable):
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3647, in iter_outputs
[rank0]: yield i, apply_function(example, i, offset=offset)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3570, in apply_function
[rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/preprocessing.py", line 370, in preprocess_function
[rank0]: processed = preprocess_conversations(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/preprocessing.py", line 157, in preprocess_conversations
[rank0]: input_ids, loss_mask = parser.parse(
[rank0]: ^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/parse.py", line 82, in parse
[rank0]: conversation = self.tokenizer.apply_chat_template(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1649, in apply_chat_template
[rank0]: isinstance(conversation[0], (list, tuple)) or hasattr(conversation[0], "messages")
[rank0]: ~~~~~~~~~~~~^^^
[rank0]: IndexError: list index out of range
[rank0]: """
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/SpecForge/scripts/train_eagle3.py", line 836, in
[rank0]: main()
[rank0]: File "/workspace/SpecForge/scripts/train_eagle3.py", line 633, in main
[rank0]: train_dataloader, vocab_mapping_path, eval_dataloader = build_dataloaders(
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/scripts/train_eagle3.py", line 388, in build_dataloaders
[rank0]: train_eagle3_dataset = build_eagle3_dataset(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/specforge/data/preprocessing.py", line 403, in build_eagle3_dataset
[rank0]: dataset = dataset.map(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 562, in wrapper
[rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3332, in map
[rank0]: for rank, done, content in iflatmap_unordered(
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 626, in iflatmap_unordered
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 626, in
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get
[rank0]: raise self._value
[rank0]: IndexError: list index out of range
[rank0]:[W1216 04:21:54.710972321 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1216 04:21:55.101000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104920 closing signal SIGTERM
W1216 04:21:55.106000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104921 closing signal SIGTERM
W1216 04:21:55.110000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104922 closing signal SIGTERM
W1216 04:21:55.116000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104923 closing signal SIGTERM
W1216 04:21:55.121000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104924 closing signal SIGTERM
W1216 04:21:55.126000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104925 closing signal SIGTERM
W1216 04:21:55.131000 104852 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 104926 closing signal SIGTERM
E1216 04:21:56.578000 104852 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 104919) of binary: /workspace/SpecForge/.venv/bin/python3
Traceback (most recent call last):
File "/workspace/SpecForge/.venv/bin/torchrun", line 10, in
sys.exit(main())
^^^^^^
File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/SpecForge/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/workspace/SpecForge/scripts/train_eagle3.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-12-16_04:21:55
host : tre-1-h20-ipv6-uc-sh-0180
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 104919)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Reproduction
use official template deepseek-r1-distill
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)
NUM_GPUS=${1:-1}
TP_SIZE=${2:-1}
TARGET_MODEL_PATH="/workspace/deepseek-8b"
MODEL_NAME=$(basename "$TARGET_MODEL_PATH")
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun
--standalone
--nproc_per_node 8
$ROOT_DIR/scripts/train_eagle3.py
--target-model-path $TARGET_MODEL_PATH
--draft-model-config $ROOT_DIR/configs/deepseek-8B-eagle3.json
--train-data-path $ROOT_DIR/cache/dataset/sharegpt_train.jsonl
--output-dir $ROOT_DIR/outputs/deepseek-8b-eagle3-sharegpt
--num-epochs 10
--batch-size 1
--tp-size 1
--build-dataset-num-proc 32
--dist-timeout 1200
--learning-rate 1e-4
--max-length 4096
--chat-template deepseek-r1-distill
--cache-dir $ROOT_DIR/cache
--attention-backend sdpa
--target-model-backend sglang
--log-interval 10
Environment
pre-commit
torch==2.8.0
torchaudio==2.8.0
torchvision==0.23.0
transformers==4.57.1
qwen-vl-utils==0.0.11
datasets
setuptools
tqdm
wandb
psutil
numpy
accelerate
pydantic
sglang[all]==0.5.5
openai-harmony
8*A100 80G