Skip to content

显存训练 #7254

@yang-chenyu104

Description

@yang-chenyu104

进行qwen3-vl-4b-instruct的lora微调,测试megatron框架,在没有用视觉模块训练显存占用6.4G加入视觉模块,显存感觉没有改变,如何开启视觉模块训练,下面是我的运行脚本

2 * 80GiB

export MEGATRON_LM_PATH=/home/Megatron-LM
export NVTE_FLASH_ATTN=1
export NVTE_FUSED_ATTN=0

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \

2. 彻底屏蔽对 APEX 的依赖(解决之前的 RuntimeError)

export NVTE_UB_GATHER_RS_GRAD=0
export NVTE_UB_REDUCE_SCATTER_GRAD=0
export MEGATRON_CORE_GATHER_RS_GRAD=0
export MEGATRON_CORE_REDUCE_SCATTER_GRAD=0

3. 性能与连接优化

export CUDA_DEVICE_MAX_CONNECTIONS=1

MASTER_PORT=29555
NPROC_PER_NODE=2
CUDA_VISIBLE_DEVICES=0,1
megatron sft
--model /home/models/Qwen3-VL-4B-Instruct
--no_initialization false
--dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#2000'
--train_type lora
--attention-backend flash
--tensor_model_parallel_size 2
--sequence_parallel true
--micro_batch_size 1
--global_batch_size 2
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--target_modules all-linear
--freeze_vit false
--freeze_aligner false
--vit_gradient_checkpointing true
--lora_rank 8
--lora_alpha 32
--lora_dtype bfloat16
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-5
--vit_lr 1e-5
--aligner_lr 1e-5
--lr_warmup_fraction 0.05
--min_lr 1e-6
--max_epochs 1
--save megatron_output/Qwen3-VL-4B-Instruct
--save_interval 100
--max_length 2048
--num_workers 4
--no_save_optim true
--no_save_rng true
--dataset_num_proc 4
--model_author swift
--model_name swift-robot | tee megatron_config_log.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions