Add unified logging configuration for all algorithms by RobotSail · Pull Request #34 · Red-Hat-AI-Innovation-Team/training_hub

RobotSail · 2026-01-27T14:38:17Z

Summary

Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through flat kwargs in sft(), osft(), and lora_sft() convenience functions.

New Parameters

All training functions now accept these optional logging parameters:

Parameter	Type	Description
`wandb_project`	`str`	W&B project name
`wandb_entity`	`str`	W&B team/entity name
`wandb_run_name`	`str`	W&B run name (overrides `run_name`)
`tensorboard_log_dir`	`str`	TensorBoard output directory
`mlflow_tracking_uri`	`str`	MLflow tracking server URI
`mlflow_experiment_name`	`str`	MLflow experiment name

Backend Support

Logger	SFT	OSFT	LoRA
wandb	Yes	Yes	Yes
tensorboard	Yes	No	Yes
mlflow	Yes	No	Yes
jsonl	Yes	Yes*	No
json	No	No	Yes

*OSFT always logs to JSONL internally

Behavior for unsupported loggers/params: OSFT emits a warning and continues with supported options.

Example Usage

from training_hub import sft, lora_sft

# SFT with MLflow + JSONL logging
sft(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    data_path="./data.jsonl",
    ckpt_output_dir="./outputs",
    loggers=["mlflow", "jsonl"],
    mlflow_tracking_uri="http://localhost:5000",
    mlflow_experiment_name="my-experiments",
    run_name="experiment-{time}",
)

# LoRA with W&B + TensorBoard
lora_sft(
    model_path="Qwen/Qwen2.5-1.5B-Instruct",
    data_path="./data.jsonl",
    ckpt_output_dir="./outputs",
    loggers=["wandb", "tensorboard"],
    wandb_project="my-project",
    logging_steps=50,
)

Dependencies

Depends on: instructlab/training#680

Test plan

Verify SFT with MLflow logging works end-to-end
Verify LoRA with W&B + TensorBoard works
Verify OSFT warns about unsupported loggers but continues
Verify backward compatibility (existing code without logging params still works)

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added standalone CLI examples for SFT and OSFT training with MLflow (auto run naming, checkpoint handling, distributed options); examples print run summaries and MLflow UI/checkpoint locations.
- Training APIs now accept and forward optional logging options for MLflow, W&B, and TensorBoard; loggers auto-detect from parameters or environment.
Documentation
- Added "Logging Configuration" docs across algorithms and functions (logger params, env-var fallbacks, defaults); some content duplicated.
Style
- Minor lint annotation added to an exception handler.
Chores
- Bumped training-related dependency versions.

coderabbitai · 2026-01-27T14:38:47Z

📝 Walkthrough

Walkthrough

Adds optional logging/config parameters (MLflow, W&B, TensorBoard) across SFT, LoRA-SFT, and OSFT: public APIs extended to accept and forward logging params, loggers auto-detected from params/env, environment fallbacks set, example CLI scripts added, and documentation updated. (34 words)

Changes

Cohort / File(s)	Summary
Core SFT `src/training_hub/algorithms/sft.py`	Extended SFT public API and module-level `sft(...)` to accept/forward logging params (`wandb_`, `tensorboard_log_dir`, `mlflow_`); adds logger auto-detection and propagates logging config into training args.
LoRA SFT `src/training_hub/algorithms/lora.py`	Added `_get_report_to` to resolve report targets from params/env; populate MLflow/W&B env vars when not provided; include `tensorboard_log_dir` and MLflow params in `train`/`lora_sft` and `get_optional_params`.
OSFT `src/training_hub/algorithms/osft.py`	Expanded `OSFTAlgorithm.train` and `osft` wrapper to accept/propagate logging params (`wandb_`, `tensorboard_log_dir`, `mlflow_`); updated `get_optional_params`.
Examples `examples/scripts/sft_mlflow_example.py`, `examples/scripts/osft_mlflow_example.py`	Added CLI examples demonstrating SFT/OSFT runs with MLflow: CLI args (model/data, ckpt dir, training and MLflow options), run-name defaults, ensure output dirs, print run summary, invoke training, and report checkpoint/MLflow info.
Docs `docs/api/classes/SFTAlgorithm.md`, `docs/api/classes/OSFTAlgorithm.md`, `docs/api/functions/sft.md`, `docs/api/functions/osft.md`, `docs/algorithms/lora.md`	Inserted "Logging Configuration" and parameter tables documenting MLflow/W&B/TensorBoard options, env-var fallbacks, and examples (some duplicated blocks).
Dependencies `pyproject.toml`	Bumped `instructlab-training` and `rhai-innovation-mini-trainer` versions (and their `[cuda]` extras).
Misc / Lint `examples/scripts/lora_example.py`	Minor lint annotation added to an exception handler (no behavioral change).

Sequence Diagram(s)

sequenceDiagram
    participant User as User CLI / Script
    participant Algo as Algorithm (SFT / LoRA / OSFT)
    participant Env as Env Vars
    participant Trainer as Trainer Backend
    participant Logger as Logging Backends (MLflow/W&B/TensorBoard)

    User->>Algo: call train(..., mlflow_tracking_uri?, wandb_*, tensorboard_log_dir?, run_name?)
    Algo->>Env: set/override MLFLOW_*/WANDB_* env vars if provided (rgba(0,128,0,0.5))
    Algo->>Algo: resolve enabled loggers via _get_report_to(params) (rgba(0,0,255,0.5))
    Algo->>Trainer: build TrainingArgs (report_to, logging dirs, run_name, others)
    Trainer->>Logger: initialize selected loggers (rgba(128,0,128,0.5))
    Trainer->>Trainer: run training loop (metrics, checkpoints)
    Logger-->>User: persist logs/metrics / UI URLs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Update backend implementation dep versions in pyproject.toml #19: Bumps backend dependency versions (instructlab-training, rhai-innovation-mini-trainer) — related to the same dependency updates in this PR.
Adding Unsloth for LoRA / QLoRA SFT Algorithm #23: Overlaps logging/report_to and LoRA API updates (adds _get_report_to, MLflow/W&B env propagation, signature changes).

Suggested reviewers

Maxusmusti

Poem

🐇 I hopped into logs at dawn and night,
MLflow shone and WandB twinkled bright,
TensorBoard hummed in gentle tune,
Checkpoints packed away by moon,
Rabbit dances — metrics take flight!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add unified logging configuration for all algorithms' accurately summarizes the main change—extending logging support across SFT, OSFT, and LoRA algorithms with new unified parameters.
Docstring Coverage	✅ Passed	Docstring coverage is 82.35% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch expose-mlflow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/training_hub/algorithms/lora.py (1)
313-329: tensorboard_log_dir parameter is accepted but not utilized.

The tensorboard_log_dir parameter is added to the function signature (Line 414) and optional params, but _build_training_args never uses it. When users specify TensorBoard via loggers=["tensorboard"], HuggingFace's Trainer will log to output_dir/runs by default, not the user-specified directory.

Consider setting the TensorBoard log directory if provided:
💡 Suggested enhancement
         # Set MLflow tracking URI if provided
         if params.get('mlflow_tracking_uri'):
             os.environ['MLFLOW_TRACKING_URI'] = params.get('mlflow_tracking_uri')
             if params.get('mlflow_experiment_name'):
                 os.environ['MLFLOW_EXPERIMENT_NAME'] = params.get('mlflow_experiment_name')

+        # Set TensorBoard log directory if provided
+        if params.get('tensorboard_log_dir'):
+            # HF Trainer uses logging_dir for TensorBoard
+            training_args.logging_dir = params.get('tensorboard_log_dir')
+
         return training_args
Alternatively, pass logging_dir directly in the SFTConfig constructor.

🤖 Fix all issues with AI agents

In `@examples/scripts/sft_mlflow_example.py`:
- Around line 31-33: The default for the argparse option defined by
parser.add_argument('--data-path', ...) uses a hardcoded user-specific absolute
path; replace it with a portable default (e.g., a relative path like
"./data/train.jsonl" or a clear placeholder like "<PATH_TO_TRAINING_DATA>") so
other users won't get broken defaults, and keep the existing help text or update
it to mention the expected JSONL format and that the value can be a relative or
absolute path.

In `@src/training_hub/algorithms/osft.py`:
- Around line 412-437: Update the two warnings.warn calls to include
stacklevel=2 so the warning points to the caller (i.e., change warnings.warn(...
) to warnings.warn(..., stacklevel=2)), and when handling the 'loggers' key in
algorithm_params (using SUPPORTED_LOGGERS and the local variable loggers)
forward the validated supported logger values to the backend instead of
discarding them—e.g., capture and filter loggers =
algorithm_params.pop('loggers', None), compute supported_filtered =
list(set(loggers) & SUPPORTED_LOGGERS) and set a backend-facing param such as
algorithm_params['logger_type'] = supported_filtered (matching the convention
used in sft.py) so the mini-trainer receives the allowed loggers.

In `@src/training_hub/algorithms/sft.py`:
- Around line 33-43: The loop variable `l` in the loggers mapping is ambiguous
and should be renamed to `logger` for clarity (see the handling of 'loggers' and
the creation of `mapped` in the block that populates
`algorithm_params['logger_type']`); additionally, `logging_steps` is currently
listed in `unsupported_logging_keys` but is also exposed in the public API and
fed into `optional_params` and `TrainingArgs` (references:
`unsupported_logging_keys`, `training_params`, `TrainingArgs`,
`optional_params`, and the function signature), so either remove `logging_steps`
from `unsupported_logging_keys` so it is preserved into `training_params` and
passed to `TrainingArgs`, or remove it from the public
API/docs/optional_params/signature so it is not advertised; update the
code/comments accordingly to keep the source of truth consistent.

🧹 Nitpick comments (4)

src/training_hub/algorithms/lora.py (1)
332-345: Rename ambiguous loop variable l to improve readability.

Static analysis flags l as ambiguous (easily confused with 1 or I). Use a more descriptive name.
✏️ Suggested fix
     def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]:
         """Determine which loggers to report to based on params."""
         loggers = params.get('loggers')
         if loggers:
             # HuggingFace accepts list of loggers like ["wandb", "tensorboard", "mlflow"]
             # Filter out 'jsonl' as HF doesn't support it directly
             hf_supported = ["wandb", "tensorboard", "mlflow"]
-            filtered = [l for l in loggers if l in hf_supported]
+            filtered = [logger for logger in loggers if logger in hf_supported]
             if filtered:
                 return filtered if len(filtered) > 1 else filtered[0]
             return None
         elif params.get('wandb_project'):
             return "wandb"
         return None
src/training_hub/algorithms/sft.py (2)
1-3: Unused import: warnings.

The warnings module is imported but not used anywhere in this file. Remove it or add the intended usage.
✏️ Suggested fix
 import os
-import warnings
 from typing import Any, Dict, List, Type, Optional
22-28: Consider documenting the parameter support matrix in code comments.

The distinction between supported parameters (run_name, log_level, logger_type, mlflow_*, wandb_*) and unsupported ones (loggers, logging_steps, wandb_run_name, tensorboard_log_dir) could benefit from a brief explanation of why each is categorized as such. This helps future maintainers understand the backend constraints.
examples/scripts/sft_mlflow_example.py (1)
77-112: Remove unused result variable.

The result variable is assigned but never used. Either remove the assignment or add a comment explaining the return value structure.
✏️ Suggested fix
     try:
-        result = sft(
+        sft(
             # Model and data
             model_path=args.model_path,
Or, if you want to demonstrate the return value:
         print(f"View results in MLflow: {args.mlflow_uri}")
+        print(f"Training result: {result}")
         print("=" * 60)

coderabbitai · 2026-01-27T14:44:28Z

+        # Handle logging parameters - validate and warn for unsupported ones
+        SUPPORTED_LOGGERS = {"wandb", "jsonl"}
+        if 'loggers' in algorithm_params:
+            loggers = algorithm_params.pop('loggers', None)
+            if loggers:
+                unsupported = set(loggers) - SUPPORTED_LOGGERS
+                if unsupported:
+                    warnings.warn(
+                        f"Loggers {unsupported} are not supported by mini-trainer backend. "
+                        f"Only {SUPPORTED_LOGGERS} are supported. Ignoring unsupported loggers."
+                    )
+
+        # Validate and remove unsupported logging parameters with warnings
+        UNSUPPORTED_PARAMS = {
+            'log_level': 'log level configuration',
+            'logging_steps': 'logging steps interval',
+            'tensorboard_log_dir': 'TensorBoard logging',
+            'mlflow_tracking_uri': 'MLflow logging',
+            'mlflow_experiment_name': 'MLflow logging',
+            'run_name': 'run name (use wandb_run_name instead)',
+        }
+        for param, description in UNSUPPORTED_PARAMS.items():
+            if algorithm_params.pop(param, None) is not None:
+                warnings.warn(
+                    f"Parameter '{param}' ({description}) is not supported by mini-trainer backend. Ignoring."
+                )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, check the file structure and find mini-trainer backend references fd -t f "train" src/ | head -20

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 65

🏁 Script executed:

#!/bin/bash # Search for TrainingArgs and how loggers are used rg -n "class TrainingArgs" -A 10 --type py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 65

🏁 Script executed:

#!/bin/bash # Search for where loggers might be used in the training backend rg -n "loggers" --type py -B 2 -A 2

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 12011

🏁 Script executed:

#!/bin/bash # Check the specific file context around the warning calls sed -n '410,440p' src/training_hub/algorithms/osft.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1539

🏁 Script executed:

#!/bin/bash # See if supported loggers are referenced anywhere after being popped rg -n "SUPPORTED_LOGGERS\|supported.*logger" --type py -B 2 -A 2

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 65

Add stacklevel=2 to warnings.warn and forward supported loggers to the backend.

The warnings.warn() calls at lines 419 and 435 are missing stacklevel=2, which is needed to point warnings to the caller rather than internal code.

After validating supported loggers (wandb, jsonl), the code pops them from algorithm_params but never forwards them to the backend. Other algorithms like sft.py convert loggers to backend-specific formats (e.g., logger_type). The mini-trainer backend should receive the supported loggers to function correctly.

✏️ Suggested fix

if 'loggers' in algorithm_params: loggers = algorithm_params.pop('loggers', None) if loggers: unsupported = set(loggers) - SUPPORTED_LOGGERS if unsupported: warnings.warn( f"Loggers {unsupported} are not supported by mini-trainer backend. " - f"Only {SUPPORTED_LOGGERS} are supported. Ignoring unsupported loggers." + f"Only {SUPPORTED_LOGGERS} are supported. Ignoring unsupported loggers.", + stacklevel=2 ) + # Forward supported loggers to the backend + supported = [lg for lg in loggers if lg in SUPPORTED_LOGGERS] + if supported: + algorithm_params['loggers'] = supported # Validate and remove unsupported logging parameters with warnings UNSUPPORTED_PARAMS = { 'log_level': 'log level configuration', 'logging_steps': 'logging steps interval', 'tensorboard_log_dir': 'TensorBoard logging', 'mlflow_tracking_uri': 'MLflow logging', 'mlflow_experiment_name': 'MLflow logging', 'run_name': 'run name (use wandb_run_name instead)', } for param, description in UNSUPPORTED_PARAMS.items(): if algorithm_params.pop(param, None) is not None: warnings.warn( - f"Parameter '{param}' ({description}) is not supported by mini-trainer backend. Ignoring." + f"Parameter '{param}' ({description}) is not supported by mini-trainer backend. Ignoring.", + stacklevel=2 )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Handle logging parameters - validate and warn for unsupported ones

SUPPORTED_LOGGERS = {"wandb", "jsonl"}

if 'loggers' in algorithm_params:

loggers = algorithm_params.pop('loggers', None)

if loggers:

unsupported = set(loggers) - SUPPORTED_LOGGERS

if unsupported:

warnings.warn(

f"Loggers {unsupported} are not supported by mini-trainer backend. "

f"Only {SUPPORTED_LOGGERS} are supported. Ignoring unsupported loggers."

)

# Validate and remove unsupported logging parameters with warnings

UNSUPPORTED_PARAMS = {

'log_level': 'log level configuration',

'logging_steps': 'logging steps interval',

'tensorboard_log_dir': 'TensorBoard logging',

'mlflow_tracking_uri': 'MLflow logging',

'mlflow_experiment_name': 'MLflow logging',

'run_name': 'run name (use wandb_run_name instead)',

}

for param, description in UNSUPPORTED_PARAMS.items():

if algorithm_params.pop(param, None) is not None:

warnings.warn(

f"Parameter '{param}' ({description}) is not supported by mini-trainer backend. Ignoring."

)

# Handle logging parameters - validate and warn for unsupported ones

SUPPORTED_LOGGERS = {"wandb", "jsonl"}

if 'loggers' in algorithm_params:

loggers = algorithm_params.pop('loggers', None)

if loggers:

unsupported = set(loggers) - SUPPORTED_LOGGERS

if unsupported:

warnings.warn(

f"Loggers {unsupported} are not supported by mini-trainer backend. "

f"Only {SUPPORTED_LOGGERS} are supported. Ignoring unsupported loggers.",

stacklevel=2

)

# Forward supported loggers to the backend

supported = [lg for lg in loggers if lg in SUPPORTED_LOGGERS]

if supported:

algorithm_params['loggers'] = supported

# Validate and remove unsupported logging parameters with warnings

UNSUPPORTED_PARAMS = {

'log_level': 'log level configuration',

'logging_steps': 'logging steps interval',

'tensorboard_log_dir': 'TensorBoard logging',

'mlflow_tracking_uri': 'MLflow logging',

'mlflow_experiment_name': 'MLflow logging',

'run_name': 'run name (use wandb_run_name instead)',

}

for param, description in UNSUPPORTED_PARAMS.items():

if algorithm_params.pop(param, None) is not None:

warnings.warn(

f"Parameter '{param}' ({description}) is not supported by mini-trainer backend. Ignoring.",

stacklevel=2

)

🧰 Tools

🪛 Ruff (0.14.14)

419-419: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

435-435: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

🤖 Prompt for AI Agents

In `@src/training_hub/algorithms/osft.py` around lines 412 - 437, Update the two warnings.warn calls to include stacklevel=2 so the warning points to the caller (i.e., change warnings.warn(... ) to warnings.warn(..., stacklevel=2)), and when handling the 'loggers' key in algorithm_params (using SUPPORTED_LOGGERS and the local variable loggers) forward the validated supported logger values to the backend instead of discarding them—e.g., capture and filter loggers = algorithm_params.pop('loggers', None), compute supported_filtered = list(set(loggers) & SUPPORTED_LOGGERS) and set a backend-facing param such as algorithm_params['logger_type'] = supported_filtered (matching the convention used in sft.py) so the mini-trainer receives the allowed loggers.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/training_hub/algorithms/sft.py`:
- Around line 210-220: The SFT flow currently drops wandb_run_name via the
unsupported_logging_keys list instead of forwarding it to HuggingFace
TrainingArguments; update the SFT TrainingArgs builder to map wandb_run_name ->
run_name (same behavior as LoRA's mapping in lora.py) when constructing
TrainingArguments (or, if you prefer not to map, emit a clear warning when
wandb_run_name is present and ignored). Locate the use of
unsupported_logging_keys and the code block that assembles the TrainingArguments
(symbols: wandb_run_name, run_name, TrainingArgs, unsupported_logging_keys) and
either copy the LoRA mapping logic to set run_name = wandb_run_name when
run_name is missing, or add a logger.warn/error indicating wandb_run_name was
ignored.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/training_hub/algorithms/osft.py`:
- Line 163: The docstring for the parameter "loggers" in
src/training_hub/algorithms/osft.py is out of sync with the implementation:
update the docstring text that currently says only "wandb" and "jsonl" are
supported to also include "mlflow" so it matches the SUPPORTED_LOGGERS constant
(SUPPORTED_LOGGERS) used in the module; locate the docstring where the loggers
parameter is documented (around the current description at the top of the
function/class) and add "mlflow" to the supported list wording.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/training_hub/algorithms/sft.py`:
- Line 32: Remove the redundant inner "import warnings" statement (the duplicate
import found near the top of the SFT implementation) so that only the
module-level import is used; locate the duplicate import statement (the literal
"import warnings") in src/training_hub/algorithms/sft.py and delete it, keeping
all existing uses of the warnings module unchanged.

🧹 Nitpick comments (3)

examples/scripts/osft_mlflow_example.py (1)

90-122: Consider removing wandb_run_name from this MLflow-focused example.

Line 114 sets wandb_run_name=run_name with comment "Also set for fallback", but this is an MLflow-focused example and W&B isn't enabled in the loggers list. This may confuse users about which logging parameters are actually in effect.

Also, the result variable is assigned but never used.

✏️ Suggested fix

-        result = osft(
+        osft(
             # Model and data
             model_path=args.model_path,
             data_path=args.data_path,
             ckpt_output_dir=args.ckpt_output_dir,

             # OSFT-specific parameters
             unfreeze_rank_ratio=args.unfreeze_rank_ratio,

             # Training parameters
             num_epochs=args.num_epochs,
             effective_batch_size=args.effective_batch_size,
             learning_rate=args.learning_rate,
             max_seq_len=4096,
             max_tokens_per_gpu=args.max_tokens_per_gpu,
             warmup_steps=10,

             # Checkpointing
             checkpoint_at_epoch=True,

             # Logging - enable MLflow
             mlflow_tracking_uri=args.mlflow_uri,
             mlflow_experiment_name=args.mlflow_experiment,
             mlflow_run_name=run_name,
-            wandb_run_name=run_name,  # Also set for fallback

             # Distributed training
             nproc_per_node=args.nproc_per_node,
             nnodes=1,
             node_rank=0,
             rdzv_id=100,
             rdzv_endpoint="127.0.0.1:29500",
         )

docs/guides/logging.md (1)

24-28: Add language specifier to fenced code block.

The configuration precedence code block lacks a language specifier.

✏️ Suggested fix

-```
+```text
 1. Function kwargs (highest priority)
 2. Environment variables
 3. Backend defaults (lowest priority)

</details>

</blockquote></details>
<details>
<summary>examples/scripts/sft_mlflow_example.py (1)</summary><blockquote>

`78-112`: **Minor: Consider the `/dev/shm` assumption and unused `result` variable.**

1. The `result` variable is assigned but never used.
2. Using `/dev/shm` for `data_output_dir` assumes a Linux system with sufficient shared memory. Consider making this configurable or documenting the assumption.

<details>
<summary>✏️ Suggested fix</summary>

```diff
     try:
-        result = sft(
+        sft(
             # Model and data
             model_path=args.model_path,
             data_path=args.data_path,
             ckpt_output_dir=args.ckpt_output_dir,

             # Training parameters
             num_epochs=args.num_epochs,
             effective_batch_size=32,
             learning_rate=1e-5,
             max_seq_len=4096,
             max_tokens_per_gpu=args.max_tokens_per_gpu,

-            # Data processing
-            data_output_dir="/dev/shm",
+            # Data processing (use shared memory for performance on Linux)
+            data_output_dir="/dev/shm",  # Change to /tmp on non-Linux systems

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@docs/api/classes/OSFTAlgorithm.md`:
- Around line 151-163: The two markdown tables (the "Logger" table and the
"Parameter" table in OSFTAlgorithm docs) are misaligned and violate the
table-column-style lint rule; fix by aligning all pipe separators and column
content so each column cell width is consistent and headers/separators match the
number of columns (e.g., ensure the Logger table has three properly aligned
columns with header row "Logger | Enabled By | Env Variable Fallback" and the
Parameter table has four aligned columns with header "Parameter | Type | Default
| Description"), and keep parameter names like mlflow_tracking_uri,
mlflow_experiment_name, mlflow_run_name, wandb_project, wandb_entity,
wandb_run_name in their respective rows while matching spacing so markdownlint
passes.

In `@docs/api/functions/osft.md`:
- Around line 121-134: The two Markdown tables in docs/api/functions/osft.md
(the "Logger" table and the "Parameter" table) violate the configured
table-column-style; update both tables so each column separator uses consistent
single-space padding around pipes and the header separator row aligns with the
number of columns (e.g., "| Col | Col |" and "|-----|-----|") — ensure the
Logger table has three columns (Logger, Enabled By, Env Variable Fallback) and
the Parameter table has four columns (Parameter, Type, Default, Description)
with matching separator rows and uniform spacing to satisfy markdownlint.

In `@docs/api/functions/sft.md`:
- Around line 106-121: Adjust the two Markdown tables so they comply with the
table-column-style rule by ensuring consistent single-space padding around pipe
separators and aligned header separators; specifically update the "Logger |
Enabled By | Env Variable Fallback" table and the "Parameter | Type | Default |
Description" table so every cell has one space after and before each pipe (e.g.,
"| MLflow | `mlflow_tracking_uri` | `MLFLOW_TRACKING_URI` |"), and ensure the
header separator rows (the --- lines) match the column widths and have the same
spacing pattern.

In `@examples/scripts/osft_mlflow_example.py`:
- Around line 35-36: The default checkpoint directory for the CLI arg defined in
parser.add_argument('--ckpt-output-dir', ...) should not be /tmp; change the
default to a repo-relative path (e.g., 'checkpoints' or './checkpoints') so
example runs store artifacts inside the project instead of the world-writable
/tmp directory; update the default value in the parser.add_argument call and
adjust any downstream code that reads ckpt_output_dir if it assumes an absolute
path.

In `@src/training_hub/algorithms/osft.py`:
- Around line 163-171: Docstring currently implies run_name is supported but the
mini-trainer backend warns and drops it; update the parameter entry for run_name
in src/training_hub/algorithms/osft.py to explicitly state that run_name is
ignored by the mini-trainer backend (it will be dropped and a warning emitted),
so callers know it has no effect; reference the run_name parameter name and
mention the "mini-trainer backend" and the warning behavior so the
function/class docstring accurately reflects runtime behavior.
- Around line 195-196: The warning emitted when block_size is used with
is_pretraining=False should include a stacklevel so it points to the caller;
update the warnings.warn call inside the conditional (the expression using
is_pretraining and block_size) to pass stacklevel=2 (e.g.,
warnings.warn("block_size only valid with is_pretraining=True", stacklevel=2))
so the warning location refers to the caller rather than internal code.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@examples/scripts/sft_mlflow_example.py`:
- Line 111: The example sets data_output_dir to the Linux-specific "/dev/shm",
which breaks on macOS/Windows; change the SFT/MLflow invocation to use a
portable path (e.g., tempfile.gettempdir() or a subdirectory of the existing
checkpoint path) or add a platform check and fallback—update the call that
passes data_output_dir (look for the data_output_dir= argument in
sft_mlflow_example.py) to compute a cross-platform directory
(Path(checkpoint_output)/"data" or tempfile.gettempdir()/ "sft_data") and ensure
the directory is created before use.

In `@src/training_hub/algorithms/lora.py`:
- Around line 407-415: Add a missing mlflow_run_name param to the LoRA API to
match SFT: update the train() signature to accept mlflow_run_name (same type as
mlflow_experiment_name), add mlflow_run_name to the optional_params dict where
other MLflow options are collected, include it in get_optional_params() so it is
passed through, and add mlflow_run_name to the lora_sft() function signature so
downstream SFT-style calls receive the run name; search for symbols train,
optional_params, get_optional_params, and lora_sft in this file to locate the
exact insertion points and ensure all three places use the same param name and
type.

🧹 Nitpick comments (1)

src/training_hub/algorithms/lora.py (1)
339-339: Rename ambiguous loop variable l to logger.

The variable name l is ambiguous (looks like 1 in some fonts) and is flagged by the linter.
✏️ Suggested fix
-            filtered = [l for l in loggers if l in hf_supported]
+            filtered = [logger for logger in loggers if logger in hf_supported]

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/training_hub/algorithms/lora.py (1)

296-316: ⚠️ Potential issue | 🟠 Major

Wire tensorboard_log_dir to SFTConfig via logging_dir parameter.

TensorBoard is enabled when tensorboard_log_dir is provided (via report_to), but the directory path is never passed to SFTConfig. This causes TensorBoard to log to its default directory instead of the user-specified path.

🔧 Proposed fix

         training_args = SFTConfig(
             output_dir=params['ckpt_output_dir'],
             num_train_epochs=num_epochs,
             per_device_train_batch_size=micro_batch_size,
             gradient_accumulation_steps=gradient_accumulation_steps,
             learning_rate=params.get('learning_rate', 2e-4),
             weight_decay=params.get('weight_decay', 0.01),
             fp16=params.get('fp16', False),
             bf16=params.get('bf16', True),
             max_grad_norm=params.get('max_grad_norm', 0.3),
             warmup_steps=params.get('warmup_steps', 10),
             lr_scheduler_type=params.get('lr_scheduler', 'linear'),

             # Logging
             logging_steps=params.get('logging_steps', 1),
+            logging_dir=params.get('tensorboard_log_dir'),
             save_steps=params.get('save_steps', 500),
             eval_steps=params.get('eval_steps', 500),
             save_total_limit=params.get('save_total_limit', 3),

🤖 Fix all issues with AI agents

In `@src/training_hub/algorithms/lora.py`:
- Around line 332-345: _update _get_report_to to first check for an explicit
"loggers" key in params and honor it (allowing users to disable auto-detection
or force reporters): if params.get("loggers") is provided accept either a str or
list, validate entries against the allowed set {"wandb","mlflow","tensorboard"}
and raise ValueError for unknown entries, normalize a single string to return
that string and a list of multiple entries to return the list; if "loggers" is
not provided, preserve the existing auto-detection logic (checking
wandb_project, MLFLOW_TRACKING_URI, tensorboard_log_dir) and return None when
nothing detected. Ensure you reference the _get_report_to function and update
its input validation and return handling accordingly.
- Around line 324-329: The current logic only sets
os.environ['MLFLOW_EXPERIMENT_NAME'] when params.get('mlflow_tracking_uri') is
present; change it so that if params.get('mlflow_experiment_name') is truthy it
always sets os.environ['MLFLOW_EXPERIMENT_NAME'] regardless of
params.get('mlflow_tracking_uri'), while keeping the existing behavior for
os.environ['MLFLOW_TRACKING_URI'] when params.get('mlflow_tracking_uri') is
provided; update the block around params, 'mlflow_tracking_uri', and
'mlflow_experiment_name' to set the experiment name independently.

coderabbitai · 2026-02-03T00:37:09Z

+    def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]:
+        """Determine which loggers to report to based on config params."""
+        # Auto-detect loggers from their config parameters
+        detected = []
+        if params.get('wandb_project') or os.environ.get('WANDB_PROJECT'):
+            detected.append("wandb")
+        if params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'):
+            detected.append("mlflow")
+        if params.get('tensorboard_log_dir'):
+            detected.append("tensorboard")
+
+        if detected:
+            return detected if len(detected) > 1 else detected[0]
+        return None


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/training_hub/algorithms/lora.py | sed -n '320,360p'

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1970

🏁 Script executed:

rg "_get_report_to" src/training_hub/algorithms/lora.py -B 2 -A 2

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 555

🏁 Script executed:

rg "loggers" src/training_hub/algorithms/lora.py -B 2 -A 2

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 688

🏁 Script executed:

rg "report_to" src/training_hub/algorithms/lora.py -B 2 -A 2

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 555

🏁 Script executed:

cat -n src/training_hub/algorithms/lora.py | sed -n '310,330p'

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1259

🏁 Script executed:

rg "loggers" src/training_hub/algorithms/ -B 1 -A 1

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1863

Implement explicit loggers parameter selection with validation.

The comment at line 313 states "use loggers list if provided, otherwise fallback to wandb_project", but _get_report_to() ignores any explicit loggers parameter and only auto-detects. This prevents users from disabling auto-detection or forcing specific reporters.

🔧 Proposed fix

def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]: """Determine which loggers to report to based on config params.""" + explicit = params.get('loggers') + if explicit is not None: + if isinstance(explicit, str): + explicit = [explicit] + allowed = {"wandb", "mlflow", "tensorboard"} + invalid = set(explicit) - allowed + if invalid: + raise ValueError(f"Unsupported loggers: {sorted(invalid)}") + return explicit if len(explicit) > 1 else explicit[0] + # Auto-detect loggers from their config parameters detected = [] if params.get('wandb_project') or os.environ.get('WANDB_PROJECT'): detected.append("wandb") if params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'): detected.append("mlflow") if params.get('tensorboard_log_dir'): detected.append("tensorboard")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]:

"""Determine which loggers to report to based on config params."""

# Auto-detect loggers from their config parameters

detected = []

if params.get('wandb_project') or os.environ.get('WANDB_PROJECT'):

detected.append("wandb")

if params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'):

detected.append("mlflow")

if params.get('tensorboard_log_dir'):

detected.append("tensorboard")

if detected:

return detected if len(detected) > 1 else detected[0]

return None

def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]:

"""Determine which loggers to report to based on config params."""

explicit = params.get('loggers')

if explicit is not None:

if isinstance(explicit, str):

explicit = [explicit]

allowed = {"wandb", "mlflow", "tensorboard"}

invalid = set(explicit) - allowed

if invalid:

raise ValueError(f"Unsupported loggers: {sorted(invalid)}")

return explicit if len(explicit) > 1 else explicit[0]

# Auto-detect loggers from their config parameters

detected = []

if params.get('wandb_project') or os.environ.get('WANDB_PROJECT'):

detected.append("wandb")

if params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'):

detected.append("mlflow")

if params.get('tensorboard_log_dir'):

detected.append("tensorboard")

if detected:

return detected if len(detected) > 1 else detected[0]

return None

🤖 Prompt for AI Agents

In `@src/training_hub/algorithms/lora.py` around lines 332 - 345, _update _get_report_to to first check for an explicit "loggers" key in params and honor it (allowing users to disable auto-detection or force reporters): if params.get("loggers") is provided accept either a str or list, validate entries against the allowed set {"wandb","mlflow","tensorboard"} and raise ValueError for unknown entries, normalize a single string to return that string and a list of multiple entries to return the list; if "loggers" is not provided, preserve the existing auto-detection logic (checking wandb_project, MLFLOW_TRACKING_URI, tensorboard_log_dir) and return None when nothing detected. Ensure you reference the _get_report_to function and update its input validation and return handling accordingly.

coderabbitai · 2026-02-03T00:37:09Z

+              # Logging configuration
              wandb_project: Optional[str] = None,
              wandb_entity: Optional[str] = None,
              wandb_run_name: Optional[str] = None,
+              tensorboard_log_dir: Optional[str] = None,
+              mlflow_tracking_uri: Optional[str] = None,
+              mlflow_experiment_name: Optional[str] = None,


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd src/training_hub/algorithms && wc -l lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 94

🏁 Script executed:

cd src/training_hub/algorithms && head -50 lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 2307

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '400,420p' lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1136

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '570,590p' lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1028

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '660,680p' lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 798

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '730,750p' lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1031

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '875,895p' lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 823

🏁 Script executed:

cd src/training_hub/algorithms && grep -n "loggers\|run_name\|log_level" lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 626

🏁 Script executed:

cd src/training_hub/algorithms && grep -n "_build_training_args\|def lora_sft\|def optional_params\|def get_optional_params" lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 308

🏁 Script executed:

cd src/training_hub/algorithms && grep -n "loggers\|run_name\|log_level" sft.py | head -20

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1226

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '1,100p' sft.py | grep -n "class\|def"

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 420

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '90,250p' sft.py | head -80

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 4093

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '320,340p' lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1146

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '246,280p' lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1752

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '308,320p' lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 702

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '245,330p' lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 4138

🏁 Script executed:

cd src/training_hub/algorithms && sed -n '332,360p' lora.py

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1147

🏁 Script executed:

cd src/training_hub && find . -name "*.py" -type f | xargs grep -l "loggers.*Optional\[List" | head -5

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 65

🏁 Script executed:

cd src/training_hub && git log --oneline --all | head -20

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1285

🏁 Script executed:

cd src/training_hub && git log --all --grep="loggers\|run_name\|log_level" --oneline | head -10

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 142

🏁 Script executed:

cd src/training_hub && git show 0d48240 --stat | head -30

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 1795

🏁 Script executed:

cd src/training_hub && git log --all --format="%H %s" | grep -i "expose\|logging\|lora" | head -10

Repository: Red-Hat-AI-Innovation-Team/training_hub

Length of output: 582

Add loggers, run_name, and log_level parameters to LoRA APIs to match PR objectives.

The LoRA implementation currently only exposes vendor-specific logging parameters (wandb_project, wandb_entity, etc.) but lacks the generic loggers, run_name, and log_level parameters referenced in the PR intent. Line 313 of _build_training_args contains a comment indicating "use loggers list if provided" but this parameter is not actually exposed in the function signatures. Users cannot explicitly select loggers, set a run name, or control log level without relying on environment variables.

Thread these parameters through:

train() method signature (lines 407-413 and other locations)

optional_params and get_optional_params

lora_sft() function

_build_training_args() to use them when building SFTConfig

In _build_training_args(), use run_name=(run_name or wandb_run_name) prioritizing the generic parameter, and pass log_level to the config.

This PR does not intend to include those arguments, including them in the description was an oversight.

@RobotSail, thank you for the clarification! I understand now that the mention of loggers, run_name, and log_level in the PR description was an oversight, and those parameters are not intended to be part of this PR's scope for LoRA. The current implementation with vendor-specific parameters (wandb_project, mlflow_tracking_uri, etc.) is correct as designed.

I appreciate you clearing up the confusion! 🐰

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/training_hub/algorithms/lora.py`:
- Around line 340-341: The code detects TensorBoard via
params.get('tensorboard_log_dir') but never passes that path into the training
args; update the _build_training_args function so when constructing the
SFTConfig (or training args object) you add logging_dir=
params.get('tensorboard_log_dir') (or None when absent) so SFTConfig receives
the TensorBoard directory; locate the SFTConfig instantiation in
_build_training_args and set the logging_dir kwarg accordingly so the detected
"tensorboard" logger actually writes events to the specified directory.
- Around line 313-316: The SFTConfig run_name is only set from
params.get('wandb_run_name') while params may include 'mlflow_run_name'; update
the code that constructs SFTConfig (the call using
report_to=self._get_report_to(params), run_name=...) to accept a generic
run_name (e.g., params.get('run_name') falling back to wandb_run_name) or, if
keeping separate names, set the MLflow run name when 'mlflow_run_name' is
present (for example by setting
os.environ['MLFLOW_RUN_NAME']=params['mlflow_run_name'] before trainer/logging
initialization or invoking mlflow.start_run(run_name=...)). Ensure this change
touches the same SFTConfig construction site and uses params['mlflow_run_name']
or a unified params['run_name'] so MLflow receives the intended run name.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/training_hub/algorithms/sft.py (1)

124-214: ⚠️ Potential issue | 🟠 Major

Add parameter filtering before passing to TrainingArgs, matching the OSFT pattern.

The logging parameters (wandb_*, mlflow_*, tensorboard_*) are passed directly to TrainingArgs(**training_params) without filtering. Unlike the OSFT backend which uses fields(TrainingArgs) to filter unsupported parameters, the SFT backend risks runtime failures if the installed instructlab-training version doesn't support these fields. Either filter parameters using field introspection (like OSFT does) or confirm that all versions ≥0.13.0 support these fields.

🤖 Fix all issues with AI agents

In `@src/training_hub/algorithms/sft.py`:
- Around line 25-36: The current auto-detection in sft.py collects
detected_loggers and unconditionally sets algorithm_params['logger_type'], which
can override an explicit caller-provided value and misses some logger signals;
update the detection to (1) check all relevant keys/envs for MLflow and W&B
(include 'mlflow_experiment_name', 'mlflow_run_name', 'wandb_run_name',
'wandb_entity' and corresponding env vars) when building detected_loggers, and
(2) only set algorithm_params['logger_type'] = ','.join(detected_loggers) if
algorithm_params does not already contain a truthy 'logger_type' (i.e., respect
an explicit value); keep the detection logic in the same block around
detected_loggers and algorithm_params so changes are localized.

coderabbitai · 2026-02-03T01:35:48Z

+        # Auto-detect loggers from config parameters
+        detected_loggers = []
+        if algorithm_params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'):
+            detected_loggers.append('mlflow')
+        if algorithm_params.get('wandb_project') or os.environ.get('WANDB_PROJECT'):
+            detected_loggers.append('wandb')
+        if algorithm_params.get('tensorboard_log_dir'):
+            detected_loggers.append('tensorboard')
+
+        if detected_loggers:
+            algorithm_params['logger_type'] = ','.join(detected_loggers)
+


⚠️ Potential issue | 🟡 Minor

Avoid overriding explicit logger_type and detect from all logger params.
Right now, auto-detection can override a caller-provided logger_type and it won’t enable W&B/MLflow when only wandb_run_name/wandb_entity or mlflow_experiment_name/mlflow_run_name are set. That can silently disable logging or change logger selection.

🔧 Proposed adjustment

- detected_loggers = [] - if algorithm_params.get('mlflow_tracking_uri') or os.environ.get('MLFLOW_TRACKING_URI'): + detected_loggers = [] + if ( + algorithm_params.get('mlflow_tracking_uri') + or algorithm_params.get('mlflow_experiment_name') + or algorithm_params.get('mlflow_run_name') + or os.environ.get('MLFLOW_TRACKING_URI') + ): detected_loggers.append('mlflow') - if algorithm_params.get('wandb_project') or os.environ.get('WANDB_PROJECT'): + if ( + algorithm_params.get('wandb_project') + or algorithm_params.get('wandb_entity') + or algorithm_params.get('wandb_run_name') + or os.environ.get('WANDB_PROJECT') + ): detected_loggers.append('wandb') if algorithm_params.get('tensorboard_log_dir'): detected_loggers.append('tensorboard') - if detected_loggers: + if detected_loggers and 'logger_type' not in algorithm_params: algorithm_params['logger_type'] = ','.join(detected_loggers)

🤖 Prompt for AI Agents

In `@src/training_hub/algorithms/sft.py` around lines 25 - 36, The current auto-detection in sft.py collects detected_loggers and unconditionally sets algorithm_params['logger_type'], which can override an explicit caller-provided value and misses some logger signals; update the detection to (1) check all relevant keys/envs for MLflow and W&B (include 'mlflow_experiment_name', 'mlflow_run_name', 'wandb_run_name', 'wandb_entity' and corresponding env vars) when building detected_loggers, and (2) only set algorithm_params['logger_type'] = ','.join(detected_loggers) if algorithm_params does not already contain a truthy 'logger_type' (i.e., respect an explicit value); keep the detection logic in the same block around detected_loggers and algorithm_params so changes are localized.

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@docs/api/functions/sft.md`:
- Around line 106-110: Add a new row to the logger enablement table for the
JSONL logger: add "JSONL" as the Logger, use the configuration key
`jsonl_log_path` as the Enabled By value, and add an environment variable
fallback `JSONL_LOG_PATH` as the Env Variable Fallback so the table now
documents jsonl logging enablement alongside MLflow, W&B, and TensorBoard.
- Around line 112-120: Add the missing logging parameters to the SFT API
parameter table: add rows for `loggers` (type `list[str]`, default e.g. `[]` or
env-aware, description: explicit list of loggers like
["wandb","tensorboard","mlflow","jsonl"]), `run_name` (type `str`, default
`None`, description: base run name supporting `{time}` and `{rank}`
placeholders), `log_level` (type `str`, default e.g. `INFO`, description:
logging verbosity like DEBUG/INFO/WARNING/ERROR/CRITICAL), and `logging_steps`
(type `int`, default `None` or a sensible integer, description: frequency of
logging steps); update the existing rows (`mlflow_tracking_uri`,
`mlflow_experiment_name`, `wandb_project`, `tensorboard_log_dir`, etc.)
formatting to match table style so the new entries align with the same columns
and conventions.
- Line 116: Update the parameter descriptions to explicitly state override
precedence: explain that the global base parameter run_name is used by default,
but logger-specific parameters (e.g., wandb_run_name and mlflow_run_name) take
precedence and override run_name when provided (supporting placeholders like
{time}, {utc_time}, {rank}); update the mlflow_run_name and wandb_run_name
descriptions to mention they override the base run_name and that their
placeholders behave the same as run_name, and if you add run_name to the table
include a short note that it is the default unless a logger-specific run name is
supplied.

In `@examples/scripts/sft_mlflow_example.py`:
- Around line 38-41: The default checkpoint directory uses an absolute
Unix-specific path ("/tmp/sft-mlflow-checkpoints") for the "--ckpt-output-dir"
argument; change its default to a relative, cross-platform path (e.g.,
"sft-mlflow-checkpoints" or "./sft-mlflow-checkpoints") in the add_argument call
so the example is portable and avoids relying on /tmp.
- Around line 7-15: Update the example usage in sft_mlflow_example.py to include
the required CLI flag --data-path (e.g. add "--data-path /path/to/data" to each
shown invocation), so the shown commands (python sft_mlflow_example.py, python
sft_mlflow_example.py --model-path ..., python sft_mlflow_example.py
--mlflow-uri ...) include the mandatory --data-path argument when demonstrating
how to run the script.

coderabbitai · 2026-02-03T16:44:37Z

+| Logger      | Enabled By            | Env Variable Fallback |
+| ----------- | --------------------- | --------------------- |
+| MLflow      | `mlflow_tracking_uri` | `MLFLOW_TRACKING_URI` |
+| W&B         | `wandb_project`       | `WANDB_PROJECT`       |
+| TensorBoard | `tensorboard_log_dir` | -                     |


⚠️ Potential issue | 🟠 Major

Add jsonl logger to the enablement table.

According to the PR objectives, SFT supports jsonl logging, but the table only documents MLflow, W&B, and TensorBoard. The jsonl logger should be included with its enablement condition and any applicable environment variable fallback.

🤖 Prompt for AI Agents

In `@docs/api/functions/sft.md` around lines 106 - 110, Add a new row to the logger enablement table for the JSONL logger: add "JSONL" as the Logger, use the configuration key `jsonl_log_path` as the Enabled By value, and add an environment variable fallback `JSONL_LOG_PATH` as the Env Variable Fallback so the table now documents jsonl logging enablement alongside MLflow, W&B, and TensorBoard.

Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through flat kwargs in sft(), osft(), and lora_sft() convenience functions. - `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"]) - `run_name`: Run name with placeholder support ({time}, {rank}) - `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) - `logging_steps`: How often to log metrics - `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration - `tensorboard_log_dir`: TensorBoard output directory - `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration | Logger | SFT | OSFT | LoRA | |-------------|-----|------|------| | wandb | Yes | Yes | Yes | | tensorboard | Yes | No | Yes | | mlflow | Yes | No | Yes | | jsonl | Yes | Yes | No | OSFT emits warnings for unsupported loggers/params and continues. Depends on: instructlab/training#680 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Pass tensorboard_log_dir through to TrainingArgs Update SFT backend to pass tensorboard_log_dir to TrainingArgs now that instructlab-training supports configurable TensorBoard log directories. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Enable MLflow support for OSFT backend - Add "mlflow" to SUPPORTED_LOGGERS - Remove mlflow params from UNSUPPORTED_PARAMS - Add mlflow_run_name parameter throughout - Update docstrings to reflect MLflow is now supported Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Address PR review comments - Replace hardcoded data path with required=True in sft_mlflow_example.py - Add stacklevel=2 to warnings.warn calls in osft.py - Rename ambiguous loop variable 'l' to 'lg' in sft.py - Add warnings for unsupported logging params in sft.py backend Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Add OSFT MLflow example script Similar to sft_mlflow_example.py but for OSFT training with unfreeze_rank_ratio and other OSFT-specific parameters. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Add logging and experiment tracking documentation Create comprehensive documentation for the unified logging API: - MLflow, Weights & Biases, TensorBoard, and JSONL logging - Configuration precedence (kwarg > env var > defaults) - Backend-specific notes and supported loggers - Example usage for all algorithms (SFT, OSFT, LoRA) - Run naming with placeholder support Update sidebar to include the new logging guide. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Remove duplicate warnings import in sft.py The warnings module is already imported at module level (line 2). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Address review nitpicks in docs and examples - Add language specifier to fenced code block in logging.md - Remove unused result variable in example scripts - Remove unnecessary wandb_run_name from MLflow-focused OSFT example - Add comment clarifying /dev/shm usage for Linux shared memory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Auto-detect loggers from config parameters Loggers are now automatically enabled based on their configuration: - mlflow_tracking_uri or MLFLOW_TRACKING_URI env → enables MLflow - wandb_project or WANDB_PROJECT env → enables W&B - tensorboard_log_dir → enables TensorBoard (SFT only) The 'loggers' parameter is deprecated and emits a DeprecationWarning when used. Updated API documentation with new logging configuration sections and removed outdated logging guide. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Address PR review feedback - Fix markdown table alignment in docs for markdownlint compliance - Change osft_mlflow_example.py default checkpoint dir from /tmp to ./ - Document that run_name is not supported by mini-trainer backend - Add stacklevel=2 to block_size warning in osft.py - Update loggers docstring to indicate deprecated status Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> remove extraneous api kwargs removes more extraneous api additions reintroduce mlflow_run_name into lora docstrings drop formatting changes read envs when present review comments

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@examples/scripts/osft_mlflow_example.py`:
- Around line 160-165: The except block catching Exception in
examples/scripts/osft_mlflow_example.py (the "except Exception as e:" after the
osft() training call) is triggering BLE001; either narrow the handler to the
most likely exception types raised by osft()/model loading/file I/O/mini_trainer
(e.g., RuntimeError, ValueError, OSError) and handle or re-raise unexpected
ones, or if you intentionally want a catch-all for CLI robustness, keep the
broad handler but add a lint waiver comment on that line ("# noqa: BLE001") to
suppress the ruff warning.

In `@examples/scripts/sft_mlflow_example.py`:
- Around line 135-140: The broad "except Exception as e:" handler in the
training error block should be narrowed or explicitly waived; either replace it
with specific exceptions (e.g., "except (RuntimeError, ValueError, OSError) as
e:" or the set appropriate for the training code) so the handler is precise, or
if a catch-all is intentional for this CLI example add a linter waiver by
changing the line to "except Exception as e:  # noqa: BLE001" (and apply the
same change to other example scripts with identical patterns); locate the block
around the training/CLI exit handling (the except block that prints "Training
failed: {e}" and calls sys.exit(1)) to make this change.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@examples/scripts/osft_mlflow_example.py`:
- Line 119: Print formatting is inconsistent: update the f-string that prints
the MLflow experiment so there's a space after the colon to match adjacent
lines; locate the print using args.mlflow_experiment (the statement currently
"print(f\"MLflow Experiment:{args.mlflow_experiment}\")") and change the string
to include a space after the colon so output becomes "MLflow Experiment:
{args.mlflow_experiment}".

🧹 Nitpick comments (1)

src/training_hub/algorithms/lora.py (1)

349-373: Consider removing redundant environment variable checks.

Since lines 250-262 already populate params from environment variables before _get_report_to() is called, the direct os.environ.get() checks here are redundant. The params dict already contains the env var values.

♻️ Optional simplification

     def _get_report_to(self, params: Dict[str, Any]) -> Optional[Union[str, List[str]]]:
         """Determine which loggers to report to based on config params."""
         # Auto-detect loggers from their config parameters
         detected = []
         if (
             params.get('wandb_project')
             or params.get('wandb_run_name')
             or params.get('wandb_entity')
-            or os.environ.get('WANDB_PROJECT')
         ):
             detected.append("wandb")
         if (
             params.get('mlflow_tracking_uri')
             or params.get('mlflow_experiment_name')
             or params.get('mlflow_run_name')
-            or os.environ.get('MLFLOW_TRACKING_URI')
-            or os.environ.get('MLFLOW_EXPERIMENT_NAME')
         ):
             detected.append("mlflow")
         if params.get('tensorboard_log_dir'):
             detected.append("tensorboard")

         if detected:
             return detected if len(detected) > 1 else detected[0]
         return None

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/algorithms/lora.md`:
- Around line 282-285: The example uses an unsupported parameter mlflow_run_name
which will fail; update the example to use the supported run_name parameter (or
remove the run name line entirely if run_name is not supported), ensuring the
block still includes mlflow_tracking_uri and mlflow_experiment_name; change the
key mlflow_run_name -> run_name in the code snippet where the client/config is
constructed so it matches the supported API.

revert transformers v5 lora fix

Maxusmusti

LGTM, all my review comments have been addressed

Updates minimum versions to include MLflow active run fix and other recent changes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

Comment thread src/training_hub/algorithms/sft.py Outdated

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

Comment thread src/training_hub/algorithms/osft.py Outdated

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

Comment thread src/training_hub/algorithms/sft.py Outdated

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

Comment thread examples/scripts/sft_mlflow_example.py Outdated

Comment thread src/training_hub/algorithms/lora.py

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

Comment thread src/training_hub/algorithms/lora.py

Comment thread src/training_hub/algorithms/lora.py

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

RobotSail force-pushed the expose-mlflow branch from 2f9c3a9 to db822be Compare February 3, 2026 16:38

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

RobotSail force-pushed the expose-mlflow branch from db822be to c5f3dd2 Compare February 3, 2026 16:45

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

Comment thread examples/scripts/osft_mlflow_example.py Outdated

Comment thread examples/scripts/sft_mlflow_example.py Outdated

review comments

5c22edc

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

Comment thread examples/scripts/osft_mlflow_example.py Outdated

RobotSail added 3 commits February 3, 2026 18:26

review comments

409b832

comments

c7f60da

update lora docs

8206123

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

Comment thread docs/algorithms/lora.md

RobotSail commented Feb 4, 2026

View reviewed changes

Comment thread src/training_hub/algorithms/lora.py Outdated

Apply suggestion from @RobotSail

fb3351d

revert transformers v5 lora fix

Maxusmusti approved these changes Feb 4, 2026

View reviewed changes

Bump instructlab-training to 0.14.0 and mini-trainer to 0.6.0

a1b0878

Updates minimum versions to include MLflow active run fix and other recent changes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

RobotSail merged commit 84170dc into main Feb 12, 2026
4 checks passed

Conversation

RobotSail commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Parameters

Backend Support

Example Usage

Dependencies

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobotSail Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

RobotSail commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

coderabbitai bot Feb 3, 2026 •

edited

Loading