Add unified logging configuration for all algorithms#34
Conversation
📝 WalkthroughWalkthroughAdds optional logging/config parameters (MLflow, W&B, TensorBoard) across SFT, LoRA-SFT, and OSFT: public APIs extended to accept and forward logging params, loggers auto-detected from params/env, environment fallbacks set, example CLI scripts added, and documentation updated. (34 words) Changes
Sequence Diagram(s)sequenceDiagram
participant User as User CLI / Script
participant Algo as Algorithm (SFT / LoRA / OSFT)
participant Env as Env Vars
participant Trainer as Trainer Backend
participant Logger as Logging Backends (MLflow/W&B/TensorBoard)
User->>Algo: call train(..., mlflow_tracking_uri?, wandb_*, tensorboard_log_dir?, run_name?)
Algo->>Env: set/override MLFLOW_*/WANDB_* env vars if provided (rgba(0,128,0,0.5))
Algo->>Algo: resolve enabled loggers via _get_report_to(params) (rgba(0,0,255,0.5))
Algo->>Trainer: build TrainingArgs (report_to, logging dirs, run_name, others)
Trainer->>Logger: initialize selected loggers (rgba(128,0,128,0.5))
Trainer->>Trainer: run training loop (metrics, checkpoints)
Logger-->>User: persist logs/metrics / UI URLs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/training_hub/algorithms/lora.py (1)
313-329:tensorboard_log_dirparameter is accepted but not utilized.The
tensorboard_log_dirparameter is added to the function signature (Line 414) and optional params, but_build_training_argsnever uses it. When users specify TensorBoard vialoggers=["tensorboard"], HuggingFace's Trainer will log tooutput_dir/runsby default, not the user-specified directory.Consider setting the TensorBoard log directory if provided:
💡 Suggested enhancement
# Set MLflow tracking URI if provided if params.get('mlflow_tracking_uri'): os.environ['MLFLOW_TRACKING_URI'] = params.get('mlflow_tracking_uri') if params.get('mlflow_experiment_name'): os.environ['MLFLOW_EXPERIMENT_NAME'] = params.get('mlflow_experiment_name') + # Set TensorBoard log directory if provided + if params.get('tensorboard_log_dir'): + # HF Trainer uses logging_dir for TensorBoard + training_args.logging_dir = params.get('tensorboard_log_dir') + return training_argsAlternatively, pass
logging_dirdirectly in theSFTConfigconstructor.
🤖 Fix all issues with AI agents
In `@examples/scripts/sft_mlflow_example.py`:
- Around line 31-33: The default for the argparse option defined by
parser.add_argument('--data-path', ...) uses a hardcoded user-specific absolute
path; replace it with a portable default (e.g., a relative path like
"./data/train.jsonl" or a clear placeholder like "<PATH_TO_TRAINING_DATA>") so
other users won't get broken defaults, and keep the existing help text or update
it to mention the expected JSONL format and that the value can be a relative or
absolute path.
In `@src/training_hub/algorithms/osft.py`:
- Around line 412-437: Update the two warnings.warn calls to include
stacklevel=2 so the warning points to the caller (i.e., change warnings.warn(...
) to warnings.warn(..., stacklevel=2)), and when handling the 'loggers' key in
algorithm_params (using SUPPORTED_LOGGERS and the local variable loggers)
forward the validated supported logger values to the backend instead of
discarding them—e.g., capture and filter loggers =
algorithm_params.pop('loggers', None), compute supported_filtered =
list(set(loggers) & SUPPORTED_LOGGERS) and set a backend-facing param such as
algorithm_params['logger_type'] = supported_filtered (matching the convention
used in sft.py) so the mini-trainer receives the allowed loggers.
In `@src/training_hub/algorithms/sft.py`:
- Around line 33-43: The loop variable `l` in the loggers mapping is ambiguous
and should be renamed to `logger` for clarity (see the handling of 'loggers' and
the creation of `mapped` in the block that populates
`algorithm_params['logger_type']`); additionally, `logging_steps` is currently
listed in `unsupported_logging_keys` but is also exposed in the public API and
fed into `optional_params` and `TrainingArgs` (references:
`unsupported_logging_keys`, `training_params`, `TrainingArgs`,
`optional_params`, and the function signature), so either remove `logging_steps`
from `unsupported_logging_keys` so it is preserved into `training_params` and
passed to `TrainingArgs`, or remove it from the public
API/docs/optional_params/signature so it is not advertised; update the
code/comments accordingly to keep the source of truth consistent.
🧹 Nitpick comments (4)
src/training_hub/algorithms/lora.py (1)
332-345: Rename ambiguous loop variablelto improve readability.Static analysis flags
las ambiguous (easily confused with1orI). Use a more descriptive name.✏️ Suggested fix
def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]: """Determine which loggers to report to based on params.""" loggers = params.get('loggers') if loggers: # HuggingFace accepts list of loggers like ["wandb", "tensorboard", "mlflow"] # Filter out 'jsonl' as HF doesn't support it directly hf_supported = ["wandb", "tensorboard", "mlflow"] - filtered = [l for l in loggers if l in hf_supported] + filtered = [logger for logger in loggers if logger in hf_supported] if filtered: return filtered if len(filtered) > 1 else filtered[0] return None elif params.get('wandb_project'): return "wandb" return Nonesrc/training_hub/algorithms/sft.py (2)
1-3: Unused import:warnings.The
warningsmodule is imported but not used anywhere in this file. Remove it or add the intended usage.✏️ Suggested fix
import os -import warnings from typing import Any, Dict, List, Type, Optional
22-28: Consider documenting the parameter support matrix in code comments.The distinction between supported parameters (
run_name,log_level,logger_type,mlflow_*,wandb_*) and unsupported ones (loggers,logging_steps,wandb_run_name,tensorboard_log_dir) could benefit from a brief explanation of why each is categorized as such. This helps future maintainers understand the backend constraints.examples/scripts/sft_mlflow_example.py (1)
77-112: Remove unusedresultvariable.The
resultvariable is assigned but never used. Either remove the assignment or add a comment explaining the return value structure.✏️ Suggested fix
try: - result = sft( + sft( # Model and data model_path=args.model_path,Or, if you want to demonstrate the return value:
print(f"View results in MLflow: {args.mlflow_uri}") + print(f"Training result: {result}") print("=" * 60)
| # Handle logging parameters - validate and warn for unsupported ones | ||
| SUPPORTED_LOGGERS = {"wandb", "jsonl"} | ||
| if 'loggers' in algorithm_params: | ||
| loggers = algorithm_params.pop('loggers', None) | ||
| if loggers: | ||
| unsupported = set(loggers) - SUPPORTED_LOGGERS | ||
| if unsupported: | ||
| warnings.warn( | ||
| f"Loggers {unsupported} are not supported by mini-trainer backend. " | ||
| f"Only {SUPPORTED_LOGGERS} are supported. Ignoring unsupported loggers." | ||
| ) | ||
|
|
||
| # Validate and remove unsupported logging parameters with warnings | ||
| UNSUPPORTED_PARAMS = { | ||
| 'log_level': 'log level configuration', | ||
| 'logging_steps': 'logging steps interval', | ||
| 'tensorboard_log_dir': 'TensorBoard logging', | ||
| 'mlflow_tracking_uri': 'MLflow logging', | ||
| 'mlflow_experiment_name': 'MLflow logging', | ||
| 'run_name': 'run name (use wandb_run_name instead)', | ||
| } | ||
| for param, description in UNSUPPORTED_PARAMS.items(): | ||
| if algorithm_params.pop(param, None) is not None: | ||
| warnings.warn( | ||
| f"Parameter '{param}' ({description}) is not supported by mini-trainer backend. Ignoring." | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, check the file structure and find mini-trainer backend references
fd -t f "train" src/ | head -20Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 65
🏁 Script executed:
#!/bin/bash
# Search for TrainingArgs and how loggers are used
rg -n "class TrainingArgs" -A 10 --type pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 65
🏁 Script executed:
#!/bin/bash
# Search for where loggers might be used in the training backend
rg -n "loggers" --type py -B 2 -A 2Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 12011
🏁 Script executed:
#!/bin/bash
# Check the specific file context around the warning calls
sed -n '410,440p' src/training_hub/algorithms/osft.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1539
🏁 Script executed:
#!/bin/bash
# See if supported loggers are referenced anywhere after being popped
rg -n "SUPPORTED_LOGGERS\|supported.*logger" --type py -B 2 -A 2Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 65
Add stacklevel=2 to warnings.warn and forward supported loggers to the backend.
-
The
warnings.warn()calls at lines 419 and 435 are missingstacklevel=2, which is needed to point warnings to the caller rather than internal code. -
After validating supported loggers (wandb, jsonl), the code pops them from
algorithm_paramsbut never forwards them to the backend. Other algorithms likesft.pyconvert loggers to backend-specific formats (e.g.,logger_type). The mini-trainer backend should receive the supported loggers to function correctly.
✏️ Suggested fix
if 'loggers' in algorithm_params:
loggers = algorithm_params.pop('loggers', None)
if loggers:
unsupported = set(loggers) - SUPPORTED_LOGGERS
if unsupported:
warnings.warn(
f"Loggers {unsupported} are not supported by mini-trainer backend. "
- f"Only {SUPPORTED_LOGGERS} are supported. Ignoring unsupported loggers."
+ f"Only {SUPPORTED_LOGGERS} are supported. Ignoring unsupported loggers.",
+ stacklevel=2
)
+ # Forward supported loggers to the backend
+ supported = [lg for lg in loggers if lg in SUPPORTED_LOGGERS]
+ if supported:
+ algorithm_params['loggers'] = supported
# Validate and remove unsupported logging parameters with warnings
UNSUPPORTED_PARAMS = {
'log_level': 'log level configuration',
'logging_steps': 'logging steps interval',
'tensorboard_log_dir': 'TensorBoard logging',
'mlflow_tracking_uri': 'MLflow logging',
'mlflow_experiment_name': 'MLflow logging',
'run_name': 'run name (use wandb_run_name instead)',
}
for param, description in UNSUPPORTED_PARAMS.items():
if algorithm_params.pop(param, None) is not None:
warnings.warn(
- f"Parameter '{param}' ({description}) is not supported by mini-trainer backend. Ignoring."
+ f"Parameter '{param}' ({description}) is not supported by mini-trainer backend. Ignoring.",
+ stacklevel=2
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Handle logging parameters - validate and warn for unsupported ones | |
| SUPPORTED_LOGGERS = {"wandb", "jsonl"} | |
| if 'loggers' in algorithm_params: | |
| loggers = algorithm_params.pop('loggers', None) | |
| if loggers: | |
| unsupported = set(loggers) - SUPPORTED_LOGGERS | |
| if unsupported: | |
| warnings.warn( | |
| f"Loggers {unsupported} are not supported by mini-trainer backend. " | |
| f"Only {SUPPORTED_LOGGERS} are supported. Ignoring unsupported loggers." | |
| ) | |
| # Validate and remove unsupported logging parameters with warnings | |
| UNSUPPORTED_PARAMS = { | |
| 'log_level': 'log level configuration', | |
| 'logging_steps': 'logging steps interval', | |
| 'tensorboard_log_dir': 'TensorBoard logging', | |
| 'mlflow_tracking_uri': 'MLflow logging', | |
| 'mlflow_experiment_name': 'MLflow logging', | |
| 'run_name': 'run name (use wandb_run_name instead)', | |
| } | |
| for param, description in UNSUPPORTED_PARAMS.items(): | |
| if algorithm_params.pop(param, None) is not None: | |
| warnings.warn( | |
| f"Parameter '{param}' ({description}) is not supported by mini-trainer backend. Ignoring." | |
| ) | |
| # Handle logging parameters - validate and warn for unsupported ones | |
| SUPPORTED_LOGGERS = {"wandb", "jsonl"} | |
| if 'loggers' in algorithm_params: | |
| loggers = algorithm_params.pop('loggers', None) | |
| if loggers: | |
| unsupported = set(loggers) - SUPPORTED_LOGGERS | |
| if unsupported: | |
| warnings.warn( | |
| f"Loggers {unsupported} are not supported by mini-trainer backend. " | |
| f"Only {SUPPORTED_LOGGERS} are supported. Ignoring unsupported loggers.", | |
| stacklevel=2 | |
| ) | |
| # Forward supported loggers to the backend | |
| supported = [lg for lg in loggers if lg in SUPPORTED_LOGGERS] | |
| if supported: | |
| algorithm_params['loggers'] = supported | |
| # Validate and remove unsupported logging parameters with warnings | |
| UNSUPPORTED_PARAMS = { | |
| 'log_level': 'log level configuration', | |
| 'logging_steps': 'logging steps interval', | |
| 'tensorboard_log_dir': 'TensorBoard logging', | |
| 'mlflow_tracking_uri': 'MLflow logging', | |
| 'mlflow_experiment_name': 'MLflow logging', | |
| 'run_name': 'run name (use wandb_run_name instead)', | |
| } | |
| for param, description in UNSUPPORTED_PARAMS.items(): | |
| if algorithm_params.pop(param, None) is not None: | |
| warnings.warn( | |
| f"Parameter '{param}' ({description}) is not supported by mini-trainer backend. Ignoring.", | |
| stacklevel=2 | |
| ) |
🧰 Tools
🪛 Ruff (0.14.14)
419-419: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
435-435: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
🤖 Prompt for AI Agents
In `@src/training_hub/algorithms/osft.py` around lines 412 - 437, Update the two
warnings.warn calls to include stacklevel=2 so the warning points to the caller
(i.e., change warnings.warn(... ) to warnings.warn(..., stacklevel=2)), and when
handling the 'loggers' key in algorithm_params (using SUPPORTED_LOGGERS and the
local variable loggers) forward the validated supported logger values to the
backend instead of discarding them—e.g., capture and filter loggers =
algorithm_params.pop('loggers', None), compute supported_filtered =
list(set(loggers) & SUPPORTED_LOGGERS) and set a backend-facing param such as
algorithm_params['logger_type'] = supported_filtered (matching the convention
used in sft.py) so the mini-trainer receives the allowed loggers.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/training_hub/algorithms/sft.py`:
- Around line 210-220: The SFT flow currently drops wandb_run_name via the
unsupported_logging_keys list instead of forwarding it to HuggingFace
TrainingArguments; update the SFT TrainingArgs builder to map wandb_run_name ->
run_name (same behavior as LoRA's mapping in lora.py) when constructing
TrainingArguments (or, if you prefer not to map, emit a clear warning when
wandb_run_name is present and ignored). Locate the use of
unsupported_logging_keys and the code block that assembles the TrainingArguments
(symbols: wandb_run_name, run_name, TrainingArgs, unsupported_logging_keys) and
either copy the LoRA mapping logic to set run_name = wandb_run_name when
run_name is missing, or add a logger.warn/error indicating wandb_run_name was
ignored.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/training_hub/algorithms/osft.py`:
- Line 163: The docstring for the parameter "loggers" in
src/training_hub/algorithms/osft.py is out of sync with the implementation:
update the docstring text that currently says only "wandb" and "jsonl" are
supported to also include "mlflow" so it matches the SUPPORTED_LOGGERS constant
(SUPPORTED_LOGGERS) used in the module; locate the docstring where the loggers
parameter is documented (around the current description at the top of the
function/class) and add "mlflow" to the supported list wording.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/training_hub/algorithms/sft.py`:
- Line 32: Remove the redundant inner "import warnings" statement (the duplicate
import found near the top of the SFT implementation) so that only the
module-level import is used; locate the duplicate import statement (the literal
"import warnings") in src/training_hub/algorithms/sft.py and delete it, keeping
all existing uses of the warnings module unchanged.
🧹 Nitpick comments (3)
examples/scripts/osft_mlflow_example.py (1)
90-122: Consider removingwandb_run_namefrom this MLflow-focused example.Line 114 sets
wandb_run_name=run_namewith comment "Also set for fallback", but this is an MLflow-focused example and W&B isn't enabled in the loggers list. This may confuse users about which logging parameters are actually in effect.Also, the
resultvariable is assigned but never used.✏️ Suggested fix
- result = osft( + osft( # Model and data model_path=args.model_path, data_path=args.data_path, ckpt_output_dir=args.ckpt_output_dir, # OSFT-specific parameters unfreeze_rank_ratio=args.unfreeze_rank_ratio, # Training parameters num_epochs=args.num_epochs, effective_batch_size=args.effective_batch_size, learning_rate=args.learning_rate, max_seq_len=4096, max_tokens_per_gpu=args.max_tokens_per_gpu, warmup_steps=10, # Checkpointing checkpoint_at_epoch=True, # Logging - enable MLflow mlflow_tracking_uri=args.mlflow_uri, mlflow_experiment_name=args.mlflow_experiment, mlflow_run_name=run_name, - wandb_run_name=run_name, # Also set for fallback # Distributed training nproc_per_node=args.nproc_per_node, nnodes=1, node_rank=0, rdzv_id=100, rdzv_endpoint="127.0.0.1:29500", )docs/guides/logging.md (1)
24-28: Add language specifier to fenced code block.The configuration precedence code block lacks a language specifier.
✏️ Suggested fix
-``` +```text 1. Function kwargs (highest priority) 2. Environment variables 3. Backend defaults (lowest priority)</details> </blockquote></details> <details> <summary>examples/scripts/sft_mlflow_example.py (1)</summary><blockquote> `78-112`: **Minor: Consider the `/dev/shm` assumption and unused `result` variable.** 1. The `result` variable is assigned but never used. 2. Using `/dev/shm` for `data_output_dir` assumes a Linux system with sufficient shared memory. Consider making this configurable or documenting the assumption. <details> <summary>✏️ Suggested fix</summary> ```diff try: - result = sft( + sft( # Model and data model_path=args.model_path, data_path=args.data_path, ckpt_output_dir=args.ckpt_output_dir, # Training parameters num_epochs=args.num_epochs, effective_batch_size=32, learning_rate=1e-5, max_seq_len=4096, max_tokens_per_gpu=args.max_tokens_per_gpu, - # Data processing - data_output_dir="/dev/shm", + # Data processing (use shared memory for performance on Linux) + data_output_dir="/dev/shm", # Change to /tmp on non-Linux systems
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@docs/api/classes/OSFTAlgorithm.md`:
- Around line 151-163: The two markdown tables (the "Logger" table and the
"Parameter" table in OSFTAlgorithm docs) are misaligned and violate the
table-column-style lint rule; fix by aligning all pipe separators and column
content so each column cell width is consistent and headers/separators match the
number of columns (e.g., ensure the Logger table has three properly aligned
columns with header row "Logger | Enabled By | Env Variable Fallback" and the
Parameter table has four aligned columns with header "Parameter | Type | Default
| Description"), and keep parameter names like mlflow_tracking_uri,
mlflow_experiment_name, mlflow_run_name, wandb_project, wandb_entity,
wandb_run_name in their respective rows while matching spacing so markdownlint
passes.
In `@docs/api/functions/osft.md`:
- Around line 121-134: The two Markdown tables in docs/api/functions/osft.md
(the "Logger" table and the "Parameter" table) violate the configured
table-column-style; update both tables so each column separator uses consistent
single-space padding around pipes and the header separator row aligns with the
number of columns (e.g., "| Col | Col |" and "|-----|-----|") — ensure the
Logger table has three columns (Logger, Enabled By, Env Variable Fallback) and
the Parameter table has four columns (Parameter, Type, Default, Description)
with matching separator rows and uniform spacing to satisfy markdownlint.
In `@docs/api/functions/sft.md`:
- Around line 106-121: Adjust the two Markdown tables so they comply with the
table-column-style rule by ensuring consistent single-space padding around pipe
separators and aligned header separators; specifically update the "Logger |
Enabled By | Env Variable Fallback" table and the "Parameter | Type | Default |
Description" table so every cell has one space after and before each pipe (e.g.,
"| MLflow | `mlflow_tracking_uri` | `MLFLOW_TRACKING_URI` |"), and ensure the
header separator rows (the --- lines) match the column widths and have the same
spacing pattern.
In `@examples/scripts/osft_mlflow_example.py`:
- Around line 35-36: The default checkpoint directory for the CLI arg defined in
parser.add_argument('--ckpt-output-dir', ...) should not be /tmp; change the
default to a repo-relative path (e.g., 'checkpoints' or './checkpoints') so
example runs store artifacts inside the project instead of the world-writable
/tmp directory; update the default value in the parser.add_argument call and
adjust any downstream code that reads ckpt_output_dir if it assumes an absolute
path.
In `@src/training_hub/algorithms/osft.py`:
- Around line 163-171: Docstring currently implies run_name is supported but the
mini-trainer backend warns and drops it; update the parameter entry for run_name
in src/training_hub/algorithms/osft.py to explicitly state that run_name is
ignored by the mini-trainer backend (it will be dropped and a warning emitted),
so callers know it has no effect; reference the run_name parameter name and
mention the "mini-trainer backend" and the warning behavior so the
function/class docstring accurately reflects runtime behavior.
- Around line 195-196: The warning emitted when block_size is used with
is_pretraining=False should include a stacklevel so it points to the caller;
update the warnings.warn call inside the conditional (the expression using
is_pretraining and block_size) to pass stacklevel=2 (e.g.,
warnings.warn("block_size only valid with is_pretraining=True", stacklevel=2))
so the warning location refers to the caller rather than internal code.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@examples/scripts/sft_mlflow_example.py`:
- Line 111: The example sets data_output_dir to the Linux-specific "/dev/shm",
which breaks on macOS/Windows; change the SFT/MLflow invocation to use a
portable path (e.g., tempfile.gettempdir() or a subdirectory of the existing
checkpoint path) or add a platform check and fallback—update the call that
passes data_output_dir (look for the data_output_dir= argument in
sft_mlflow_example.py) to compute a cross-platform directory
(Path(checkpoint_output)/"data" or tempfile.gettempdir()/ "sft_data") and ensure
the directory is created before use.
In `@src/training_hub/algorithms/lora.py`:
- Around line 407-415: Add a missing mlflow_run_name param to the LoRA API to
match SFT: update the train() signature to accept mlflow_run_name (same type as
mlflow_experiment_name), add mlflow_run_name to the optional_params dict where
other MLflow options are collected, include it in get_optional_params() so it is
passed through, and add mlflow_run_name to the lora_sft() function signature so
downstream SFT-style calls receive the run name; search for symbols train,
optional_params, get_optional_params, and lora_sft in this file to locate the
exact insertion points and ensure all three places use the same param name and
type.
🧹 Nitpick comments (1)
src/training_hub/algorithms/lora.py (1)
339-339: Rename ambiguous loop variableltologger.The variable name
lis ambiguous (looks like1in some fonts) and is flagged by the linter.✏️ Suggested fix
- filtered = [l for l in loggers if l in hf_supported] + filtered = [logger for logger in loggers if logger in hf_supported]
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/training_hub/algorithms/lora.py (1)
296-316:⚠️ Potential issue | 🟠 MajorWire
tensorboard_log_dirto SFTConfig vialogging_dirparameter.TensorBoard is enabled when
tensorboard_log_diris provided (viareport_to), but the directory path is never passed toSFTConfig. This causes TensorBoard to log to its default directory instead of the user-specified path.🔧 Proposed fix
training_args = SFTConfig( output_dir=params['ckpt_output_dir'], num_train_epochs=num_epochs, per_device_train_batch_size=micro_batch_size, gradient_accumulation_steps=gradient_accumulation_steps, learning_rate=params.get('learning_rate', 2e-4), weight_decay=params.get('weight_decay', 0.01), fp16=params.get('fp16', False), bf16=params.get('bf16', True), max_grad_norm=params.get('max_grad_norm', 0.3), warmup_steps=params.get('warmup_steps', 10), lr_scheduler_type=params.get('lr_scheduler', 'linear'), # Logging logging_steps=params.get('logging_steps', 1), + logging_dir=params.get('tensorboard_log_dir'), save_steps=params.get('save_steps', 500), eval_steps=params.get('eval_steps', 500), save_total_limit=params.get('save_total_limit', 3),
🤖 Fix all issues with AI agents
In `@src/training_hub/algorithms/lora.py`:
- Around line 332-345: _update _get_report_to to first check for an explicit
"loggers" key in params and honor it (allowing users to disable auto-detection
or force reporters): if params.get("loggers") is provided accept either a str or
list, validate entries against the allowed set {"wandb","mlflow","tensorboard"}
and raise ValueError for unknown entries, normalize a single string to return
that string and a list of multiple entries to return the list; if "loggers" is
not provided, preserve the existing auto-detection logic (checking
wandb_project, MLFLOW_TRACKING_URI, tensorboard_log_dir) and return None when
nothing detected. Ensure you reference the _get_report_to function and update
its input validation and return handling accordingly.
- Around line 324-329: The current logic only sets
os.environ['MLFLOW_EXPERIMENT_NAME'] when params.get('mlflow_tracking_uri') is
present; change it so that if params.get('mlflow_experiment_name') is truthy it
always sets os.environ['MLFLOW_EXPERIMENT_NAME'] regardless of
params.get('mlflow_tracking_uri'), while keeping the existing behavior for
os.environ['MLFLOW_TRACKING_URI'] when params.get('mlflow_tracking_uri') is
provided; update the block around params, 'mlflow_tracking_uri', and
'mlflow_experiment_name' to set the experiment name independently.
| def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]: | ||
| """Determine which loggers to report to based on config params.""" | ||
| # Auto-detect loggers from their config parameters | ||
| detected = [] | ||
| if params.get('wandb_project') or os.environ.get('WANDB_PROJECT'): | ||
| detected.append("wandb") | ||
| if params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'): | ||
| detected.append("mlflow") | ||
| if params.get('tensorboard_log_dir'): | ||
| detected.append("tensorboard") | ||
|
|
||
| if detected: | ||
| return detected if len(detected) > 1 else detected[0] | ||
| return None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/training_hub/algorithms/lora.py | sed -n '320,360p'Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1970
🏁 Script executed:
rg "_get_report_to" src/training_hub/algorithms/lora.py -B 2 -A 2Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 555
🏁 Script executed:
rg "loggers" src/training_hub/algorithms/lora.py -B 2 -A 2Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 688
🏁 Script executed:
rg "report_to" src/training_hub/algorithms/lora.py -B 2 -A 2Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 555
🏁 Script executed:
cat -n src/training_hub/algorithms/lora.py | sed -n '310,330p'Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1259
🏁 Script executed:
rg "loggers" src/training_hub/algorithms/ -B 1 -A 1Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1863
Implement explicit loggers parameter selection with validation.
The comment at line 313 states "use loggers list if provided, otherwise fallback to wandb_project", but _get_report_to() ignores any explicit loggers parameter and only auto-detects. This prevents users from disabling auto-detection or forcing specific reporters.
🔧 Proposed fix
def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]:
"""Determine which loggers to report to based on config params."""
+ explicit = params.get('loggers')
+ if explicit is not None:
+ if isinstance(explicit, str):
+ explicit = [explicit]
+ allowed = {"wandb", "mlflow", "tensorboard"}
+ invalid = set(explicit) - allowed
+ if invalid:
+ raise ValueError(f"Unsupported loggers: {sorted(invalid)}")
+ return explicit if len(explicit) > 1 else explicit[0]
+
# Auto-detect loggers from their config parameters
detected = []
if params.get('wandb_project') or os.environ.get('WANDB_PROJECT'):
detected.append("wandb")
if params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'):
detected.append("mlflow")
if params.get('tensorboard_log_dir'):
detected.append("tensorboard")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]: | |
| """Determine which loggers to report to based on config params.""" | |
| # Auto-detect loggers from their config parameters | |
| detected = [] | |
| if params.get('wandb_project') or os.environ.get('WANDB_PROJECT'): | |
| detected.append("wandb") | |
| if params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'): | |
| detected.append("mlflow") | |
| if params.get('tensorboard_log_dir'): | |
| detected.append("tensorboard") | |
| if detected: | |
| return detected if len(detected) > 1 else detected[0] | |
| return None | |
| def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]: | |
| """Determine which loggers to report to based on config params.""" | |
| explicit = params.get('loggers') | |
| if explicit is not None: | |
| if isinstance(explicit, str): | |
| explicit = [explicit] | |
| allowed = {"wandb", "mlflow", "tensorboard"} | |
| invalid = set(explicit) - allowed | |
| if invalid: | |
| raise ValueError(f"Unsupported loggers: {sorted(invalid)}") | |
| return explicit if len(explicit) > 1 else explicit[0] | |
| # Auto-detect loggers from their config parameters | |
| detected = [] | |
| if params.get('wandb_project') or os.environ.get('WANDB_PROJECT'): | |
| detected.append("wandb") | |
| if params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'): | |
| detected.append("mlflow") | |
| if params.get('tensorboard_log_dir'): | |
| detected.append("tensorboard") | |
| if detected: | |
| return detected if len(detected) > 1 else detected[0] | |
| return None |
🤖 Prompt for AI Agents
In `@src/training_hub/algorithms/lora.py` around lines 332 - 345, _update
_get_report_to to first check for an explicit "loggers" key in params and honor
it (allowing users to disable auto-detection or force reporters): if
params.get("loggers") is provided accept either a str or list, validate entries
against the allowed set {"wandb","mlflow","tensorboard"} and raise ValueError
for unknown entries, normalize a single string to return that string and a list
of multiple entries to return the list; if "loggers" is not provided, preserve
the existing auto-detection logic (checking wandb_project, MLFLOW_TRACKING_URI,
tensorboard_log_dir) and return None when nothing detected. Ensure you reference
the _get_report_to function and update its input validation and return handling
accordingly.
| # Logging configuration | ||
| wandb_project: Optional[str] = None, | ||
| wandb_entity: Optional[str] = None, | ||
| wandb_run_name: Optional[str] = None, | ||
| tensorboard_log_dir: Optional[str] = None, | ||
| mlflow_tracking_uri: Optional[str] = None, | ||
| mlflow_experiment_name: Optional[str] = None, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd src/training_hub/algorithms && wc -l lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 94
🏁 Script executed:
cd src/training_hub/algorithms && head -50 lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 2307
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '400,420p' lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1136
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '570,590p' lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1028
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '660,680p' lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 798
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '730,750p' lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1031
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '875,895p' lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 823
🏁 Script executed:
cd src/training_hub/algorithms && grep -n "loggers\|run_name\|log_level" lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 626
🏁 Script executed:
cd src/training_hub/algorithms && grep -n "_build_training_args\|def lora_sft\|def optional_params\|def get_optional_params" lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 308
🏁 Script executed:
cd src/training_hub/algorithms && grep -n "loggers\|run_name\|log_level" sft.py | head -20Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1226
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '1,100p' sft.py | grep -n "class\|def"Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 420
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '90,250p' sft.py | head -80Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 4093
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '320,340p' lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1146
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '246,280p' lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1752
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '308,320p' lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 702
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '245,330p' lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 4138
🏁 Script executed:
cd src/training_hub/algorithms && sed -n '332,360p' lora.pyRepository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1147
🏁 Script executed:
cd src/training_hub && find . -name "*.py" -type f | xargs grep -l "loggers.*Optional\[List" | head -5Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 65
🏁 Script executed:
cd src/training_hub && git log --oneline --all | head -20Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1285
🏁 Script executed:
cd src/training_hub && git log --all --grep="loggers\|run_name\|log_level" --oneline | head -10Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 142
🏁 Script executed:
cd src/training_hub && git show 0d48240 --stat | head -30Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 1795
🏁 Script executed:
cd src/training_hub && git log --all --format="%H %s" | grep -i "expose\|logging\|lora" | head -10Repository: Red-Hat-AI-Innovation-Team/training_hub
Length of output: 582
Add loggers, run_name, and log_level parameters to LoRA APIs to match PR objectives.
The LoRA implementation currently only exposes vendor-specific logging parameters (wandb_project, wandb_entity, etc.) but lacks the generic loggers, run_name, and log_level parameters referenced in the PR intent. Line 313 of _build_training_args contains a comment indicating "use loggers list if provided" but this parameter is not actually exposed in the function signatures. Users cannot explicitly select loggers, set a run name, or control log level without relying on environment variables.
Thread these parameters through:
train()method signature (lines 407-413 and other locations)optional_paramsandget_optional_paramslora_sft()function_build_training_args()to use them when building SFTConfig
In _build_training_args(), use run_name=(run_name or wandb_run_name) prioritizing the generic parameter, and pass log_level to the config.
There was a problem hiding this comment.
This PR does not intend to include those arguments, including them in the description was an oversight.
There was a problem hiding this comment.
@RobotSail, thank you for the clarification! I understand now that the mention of loggers, run_name, and log_level in the PR description was an oversight, and those parameters are not intended to be part of this PR's scope for LoRA. The current implementation with vendor-specific parameters (wandb_project, mlflow_tracking_uri, etc.) is correct as designed.
I appreciate you clearing up the confusion! 🐰
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/training_hub/algorithms/lora.py`:
- Around line 340-341: The code detects TensorBoard via
params.get('tensorboard_log_dir') but never passes that path into the training
args; update the _build_training_args function so when constructing the
SFTConfig (or training args object) you add logging_dir=
params.get('tensorboard_log_dir') (or None when absent) so SFTConfig receives
the TensorBoard directory; locate the SFTConfig instantiation in
_build_training_args and set the logging_dir kwarg accordingly so the detected
"tensorboard" logger actually writes events to the specified directory.
- Around line 313-316: The SFTConfig run_name is only set from
params.get('wandb_run_name') while params may include 'mlflow_run_name'; update
the code that constructs SFTConfig (the call using
report_to=self._get_report_to(params), run_name=...) to accept a generic
run_name (e.g., params.get('run_name') falling back to wandb_run_name) or, if
keeping separate names, set the MLflow run name when 'mlflow_run_name' is
present (for example by setting
os.environ['MLFLOW_RUN_NAME']=params['mlflow_run_name'] before trainer/logging
initialization or invoking mlflow.start_run(run_name=...)). Ensure this change
touches the same SFTConfig construction site and uses params['mlflow_run_name']
or a unified params['run_name'] so MLflow receives the intended run name.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/training_hub/algorithms/sft.py (1)
124-214:⚠️ Potential issue | 🟠 MajorAdd parameter filtering before passing to TrainingArgs, matching the OSFT pattern.
The logging parameters (
wandb_*,mlflow_*,tensorboard_*) are passed directly toTrainingArgs(**training_params)without filtering. Unlike the OSFT backend which usesfields(TrainingArgs)to filter unsupported parameters, the SFT backend risks runtime failures if the installedinstructlab-trainingversion doesn't support these fields. Either filter parameters using field introspection (like OSFT does) or confirm that all versions ≥0.13.0 support these fields.
🤖 Fix all issues with AI agents
In `@src/training_hub/algorithms/sft.py`:
- Around line 25-36: The current auto-detection in sft.py collects
detected_loggers and unconditionally sets algorithm_params['logger_type'], which
can override an explicit caller-provided value and misses some logger signals;
update the detection to (1) check all relevant keys/envs for MLflow and W&B
(include 'mlflow_experiment_name', 'mlflow_run_name', 'wandb_run_name',
'wandb_entity' and corresponding env vars) when building detected_loggers, and
(2) only set algorithm_params['logger_type'] = ','.join(detected_loggers) if
algorithm_params does not already contain a truthy 'logger_type' (i.e., respect
an explicit value); keep the detection logic in the same block around
detected_loggers and algorithm_params so changes are localized.
| # Auto-detect loggers from config parameters | ||
| detected_loggers = [] | ||
| if algorithm_params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'): | ||
| detected_loggers.append('mlflow') | ||
| if algorithm_params.get('wandb_project') or os.environ.get('WANDB_PROJECT'): | ||
| detected_loggers.append('wandb') | ||
| if algorithm_params.get('tensorboard_log_dir'): | ||
| detected_loggers.append('tensorboard') | ||
|
|
||
| if detected_loggers: | ||
| algorithm_params['logger_type'] = ','.join(detected_loggers) | ||
|
|
There was a problem hiding this comment.
Avoid overriding explicit logger_type and detect from all logger params.
Right now, auto-detection can override a caller-provided logger_type and it won’t enable W&B/MLflow when only wandb_run_name/wandb_entity or mlflow_experiment_name/mlflow_run_name are set. That can silently disable logging or change logger selection.
🔧 Proposed adjustment
- detected_loggers = []
- if algorithm_params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'):
+ detected_loggers = []
+ if (
+ algorithm_params.get('mlflow_tracking_uri')
+ or algorithm_params.get('mlflow_experiment_name')
+ or algorithm_params.get('mlflow_run_name')
+ or os.environ.get('MLFLOW_TRACKING_URI')
+ ):
detected_loggers.append('mlflow')
- if algorithm_params.get('wandb_project') or os.environ.get('WANDB_PROJECT'):
+ if (
+ algorithm_params.get('wandb_project')
+ or algorithm_params.get('wandb_entity')
+ or algorithm_params.get('wandb_run_name')
+ or os.environ.get('WANDB_PROJECT')
+ ):
detected_loggers.append('wandb')
if algorithm_params.get('tensorboard_log_dir'):
detected_loggers.append('tensorboard')
- if detected_loggers:
+ if detected_loggers and 'logger_type' not in algorithm_params:
algorithm_params['logger_type'] = ','.join(detected_loggers)🤖 Prompt for AI Agents
In `@src/training_hub/algorithms/sft.py` around lines 25 - 36, The current
auto-detection in sft.py collects detected_loggers and unconditionally sets
algorithm_params['logger_type'], which can override an explicit caller-provided
value and misses some logger signals; update the detection to (1) check all
relevant keys/envs for MLflow and W&B (include 'mlflow_experiment_name',
'mlflow_run_name', 'wandb_run_name', 'wandb_entity' and corresponding env vars)
when building detected_loggers, and (2) only set algorithm_params['logger_type']
= ','.join(detected_loggers) if algorithm_params does not already contain a
truthy 'logger_type' (i.e., respect an explicit value); keep the detection logic
in the same block around detected_loggers and algorithm_params so changes are
localized.
2f9c3a9 to
db822be
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@docs/api/functions/sft.md`:
- Around line 106-110: Add a new row to the logger enablement table for the
JSONL logger: add "JSONL" as the Logger, use the configuration key
`jsonl_log_path` as the Enabled By value, and add an environment variable
fallback `JSONL_LOG_PATH` as the Env Variable Fallback so the table now
documents jsonl logging enablement alongside MLflow, W&B, and TensorBoard.
- Around line 112-120: Add the missing logging parameters to the SFT API
parameter table: add rows for `loggers` (type `list[str]`, default e.g. `[]` or
env-aware, description: explicit list of loggers like
["wandb","tensorboard","mlflow","jsonl"]), `run_name` (type `str`, default
`None`, description: base run name supporting `{time}` and `{rank}`
placeholders), `log_level` (type `str`, default e.g. `INFO`, description:
logging verbosity like DEBUG/INFO/WARNING/ERROR/CRITICAL), and `logging_steps`
(type `int`, default `None` or a sensible integer, description: frequency of
logging steps); update the existing rows (`mlflow_tracking_uri`,
`mlflow_experiment_name`, `wandb_project`, `tensorboard_log_dir`, etc.)
formatting to match table style so the new entries align with the same columns
and conventions.
- Line 116: Update the parameter descriptions to explicitly state override
precedence: explain that the global base parameter run_name is used by default,
but logger-specific parameters (e.g., wandb_run_name and mlflow_run_name) take
precedence and override run_name when provided (supporting placeholders like
{time}, {utc_time}, {rank}); update the mlflow_run_name and wandb_run_name
descriptions to mention they override the base run_name and that their
placeholders behave the same as run_name, and if you add run_name to the table
include a short note that it is the default unless a logger-specific run name is
supplied.
In `@examples/scripts/sft_mlflow_example.py`:
- Around line 38-41: The default checkpoint directory uses an absolute
Unix-specific path ("/tmp/sft-mlflow-checkpoints") for the "--ckpt-output-dir"
argument; change its default to a relative, cross-platform path (e.g.,
"sft-mlflow-checkpoints" or "./sft-mlflow-checkpoints") in the add_argument call
so the example is portable and avoids relying on /tmp.
- Around line 7-15: Update the example usage in sft_mlflow_example.py to include
the required CLI flag --data-path (e.g. add "--data-path /path/to/data" to each
shown invocation), so the shown commands (python sft_mlflow_example.py, python
sft_mlflow_example.py --model-path ..., python sft_mlflow_example.py
--mlflow-uri ...) include the mandatory --data-path argument when demonstrating
how to run the script.
| | Logger | Enabled By | Env Variable Fallback | | ||
| | ----------- | --------------------- | --------------------- | | ||
| | MLflow | `mlflow_tracking_uri` | `MLFLOW_TRACKING_URI` | | ||
| | W&B | `wandb_project` | `WANDB_PROJECT` | | ||
| | TensorBoard | `tensorboard_log_dir` | - | |
There was a problem hiding this comment.
Add jsonl logger to the enablement table.
According to the PR objectives, SFT supports jsonl logging, but the table only documents MLflow, W&B, and TensorBoard. The jsonl logger should be included with its enablement condition and any applicable environment variable fallback.
🤖 Prompt for AI Agents
In `@docs/api/functions/sft.md` around lines 106 - 110, Add a new row to the
logger enablement table for the JSONL logger: add "JSONL" as the Logger, use the
configuration key `jsonl_log_path` as the Enabled By value, and add an
environment variable fallback `JSONL_LOG_PATH` as the Env Variable Fallback so
the table now documents jsonl logging enablement alongside MLflow, W&B, and
TensorBoard.
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.
- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration
| Logger | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb | Yes | Yes | Yes |
| tensorboard | Yes | No | Yes |
| mlflow | Yes | No | Yes |
| jsonl | Yes | Yes | No |
OSFT emits warnings for unsupported loggers/params and continues.
Depends on: instructlab/training#680
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pass tensorboard_log_dir through to TrainingArgs
Update SFT backend to pass tensorboard_log_dir to TrainingArgs now that
instructlab-training supports configurable TensorBoard log directories.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable MLflow support for OSFT backend
- Add "mlflow" to SUPPORTED_LOGGERS
- Remove mlflow params from UNSUPPORTED_PARAMS
- Add mlflow_run_name parameter throughout
- Update docstrings to reflect MLflow is now supported
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address PR review comments
- Replace hardcoded data path with required=True in sft_mlflow_example.py
- Add stacklevel=2 to warnings.warn calls in osft.py
- Rename ambiguous loop variable 'l' to 'lg' in sft.py
- Add warnings for unsupported logging params in sft.py backend
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add OSFT MLflow example script
Similar to sft_mlflow_example.py but for OSFT training with
unfreeze_rank_ratio and other OSFT-specific parameters.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add logging and experiment tracking documentation
Create comprehensive documentation for the unified logging API:
- MLflow, Weights & Biases, TensorBoard, and JSONL logging
- Configuration precedence (kwarg > env var > defaults)
- Backend-specific notes and supported loggers
- Example usage for all algorithms (SFT, OSFT, LoRA)
- Run naming with placeholder support
Update sidebar to include the new logging guide.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove duplicate warnings import in sft.py
The warnings module is already imported at module level (line 2).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address review nitpicks in docs and examples
- Add language specifier to fenced code block in logging.md
- Remove unused result variable in example scripts
- Remove unnecessary wandb_run_name from MLflow-focused OSFT example
- Add comment clarifying /dev/shm usage for Linux shared memory
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Auto-detect loggers from config parameters
Loggers are now automatically enabled based on their configuration:
- mlflow_tracking_uri or MLFLOW_TRACKING_URI env → enables MLflow
- wandb_project or WANDB_PROJECT env → enables W&B
- tensorboard_log_dir → enables TensorBoard (SFT only)
The 'loggers' parameter is deprecated and emits a DeprecationWarning
when used. Updated API documentation with new logging configuration
sections and removed outdated logging guide.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address PR review feedback
- Fix markdown table alignment in docs for markdownlint compliance
- Change osft_mlflow_example.py default checkpoint dir from /tmp to ./
- Document that run_name is not supported by mini-trainer backend
- Add stacklevel=2 to block_size warning in osft.py
- Update loggers docstring to indicate deprecated status
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
remove extraneous api kwargs
removes more extraneous api additions
reintroduce mlflow_run_name into lora
docstrings
drop formatting changes
read envs when present
review comments
db822be to
c5f3dd2
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@examples/scripts/osft_mlflow_example.py`:
- Around line 160-165: The except block catching Exception in
examples/scripts/osft_mlflow_example.py (the "except Exception as e:" after the
osft() training call) is triggering BLE001; either narrow the handler to the
most likely exception types raised by osft()/model loading/file I/O/mini_trainer
(e.g., RuntimeError, ValueError, OSError) and handle or re-raise unexpected
ones, or if you intentionally want a catch-all for CLI robustness, keep the
broad handler but add a lint waiver comment on that line ("# noqa: BLE001") to
suppress the ruff warning.
In `@examples/scripts/sft_mlflow_example.py`:
- Around line 135-140: The broad "except Exception as e:" handler in the
training error block should be narrowed or explicitly waived; either replace it
with specific exceptions (e.g., "except (RuntimeError, ValueError, OSError) as
e:" or the set appropriate for the training code) so the handler is precise, or
if a catch-all is intentional for this CLI example add a linter waiver by
changing the line to "except Exception as e: # noqa: BLE001" (and apply the
same change to other example scripts with identical patterns); locate the block
around the training/CLI exit handling (the except block that prints "Training
failed: {e}" and calls sys.exit(1)) to make this change.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@examples/scripts/osft_mlflow_example.py`:
- Line 119: Print formatting is inconsistent: update the f-string that prints
the MLflow experiment so there's a space after the colon to match adjacent
lines; locate the print using args.mlflow_experiment (the statement currently
"print(f\"MLflow Experiment:{args.mlflow_experiment}\")") and change the string
to include a space after the colon so output becomes "MLflow Experiment:
{args.mlflow_experiment}".
🧹 Nitpick comments (1)
src/training_hub/algorithms/lora.py (1)
349-373: Consider removing redundant environment variable checks.Since lines 250-262 already populate
paramsfrom environment variables before_get_report_to()is called, the directos.environ.get()checks here are redundant. The params dict already contains the env var values.♻️ Optional simplification
def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]: """Determine which loggers to report to based on config params.""" # Auto-detect loggers from their config parameters detected = [] if ( params.get('wandb_project') or params.get('wandb_run_name') or params.get('wandb_entity') - or os.environ.get('WANDB_PROJECT') ): detected.append("wandb") if ( params.get('mlflow_tracking_uri') or params.get('mlflow_experiment_name') or params.get('mlflow_run_name') - or os.environ.get('MLFLOW_TRACKING_URI') - or os.environ.get('MLFLOW_EXPERIMENT_NAME') ): detected.append("mlflow") if params.get('tensorboard_log_dir'): detected.append("tensorboard") if detected: return detected if len(detected) > 1 else detected[0] return None
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/algorithms/lora.md`:
- Around line 282-285: The example uses an unsupported parameter mlflow_run_name
which will fail; update the example to use the supported run_name parameter (or
remove the run name line entirely if run_name is not supported), ensuring the
block still includes mlflow_tracking_uri and mlflow_experiment_name; change the
key mlflow_run_name -> run_name in the code snippet where the client/config is
constructed so it matches the supported API.
revert transformers v5 lora fix
Maxusmusti
left a comment
There was a problem hiding this comment.
LGTM, all my review comments have been addressed
Updates minimum versions to include MLflow active run fix and other recent changes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through flat kwargs in
sft(),osft(), andlora_sft()convenience functions.New Parameters
All training functions now accept these optional logging parameters:
wandb_projectstrwandb_entitystrwandb_run_namestrrun_name)tensorboard_log_dirstrmlflow_tracking_uristrmlflow_experiment_namestrBackend Support
*OSFT always logs to JSONL internally
Behavior for unsupported loggers/params: OSFT emits a warning and continues with supported options.
Example Usage
Dependencies
Depends on: instructlab/training#680
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Style
Chores