ROCm · prosenjitdhole · Jan 30, 2026 · Jan 30, 2026
diff --git a/docs/QUICK_RUN_GUIDE.md b/docs/QUICK_RUN_GUIDE.md
@@ -0,0 +1,224 @@
+# Quick Run Guide: aorta-report Pipelines
+
+This guide demonstrates how to use the `aorta-report` CLI to analyze PyTorch profiler traces.
+
+---
+
+## 1. Input Directory Structures
+
+### GEMM Sweep Directory (`gemm-sweep/`)
+
+Used for analyzing GEMM kernel variance across multiple thread/channel configurations.
+
+```
+experiments/2026-01-10/gemm-sweep/
+├── 256thread/
+│   ├── nccl_28channels/
+│   │   └── torch_profiler/
+│   │       ├── rank0/trace/pt.trace.json
+│   │       ├── rank1/trace/pt.trace.json
+│   │       └── ... (rank2-7)
+│   └── nccl_56channels/
+│       └── torch_profiler/
+│           └── rank*/trace/pt.trace.json
+├── 512thread/
+│   ├── nccl_28channels/
+│   │   └── torch_profiler/rank*/...
+│   └── nccl_56channels/
+│       └── torch_profiler/rank*/...
+└── tracelens_analysis/           # Generated by TraceLens
+    ├── 256thread/individual_reports/
+    └── 512thread/individual_reports/
+```
+
+### RCCL Warp Speed Directory (`rccl-warp-speed/`)
+
+Used for comparing baseline vs test configurations (A/B comparison).
+
+```
+experiments/2026-01-10/rccl-warp-speed/
+├── 32cu_512threads/              # Baseline configuration
+│   ├── torch_profiler/
+│   │   ├── rank0/*.json
+│   │   ├── rank1/*.json
+│   │   └── ... (rank2-7)
+│   └── tracelens_analysis/       # Generated by TraceLens
+│       ├── individual_reports/
+│       └── collective_reports/
+├── 37cu_384threads/              # Test configuration
+│   ├── torch_profiler/rank*/...
+│   └── tracelens_analysis/...
+└── 56cu_256threads/              # Another configuration
+    └── ...
+```
+
+---
+
+## 2. Pipeline Commands
+
+### GEMM Variance Analysis Pipeline
+
+Analyzes GEMM kernel time variance across thread/channel configurations.
+
+```bash
+aorta-report pipeline gemm \
+    --sweep-dir ./experiments/2026-01-10/gemm-sweep/ \
+    -o ./comparison_gemm_1/
+```
+
+**Options:**
+- `--sweep-dir` - Path to sweep directory with thread/channel subdirectories
+- `-o, --output` - Output directory for results
+- `--skip-tracelens` - Skip TraceLens analysis if reports already exist
+- `--top-k` - Number of top GEMM kernels to extract (default: 5)
+- `-t, --threads` - Thread configs to analyze (default: 256, 512)
+- `-c, --channels` - Channel configs to analyze (default: 28, 42, 56, 70)
+- `--no-plots` - Skip plot generation
+- `--no-html` - Skip HTML report generation
+
+**Example with options:**
+```bash
+aorta-report pipeline gemm \
+    --sweep-dir ./experiments/2026-01-10/gemm-sweep/ \
+    -o ./comparison_gemm/ \
+    --skip-tracelens \
+    --top-k 10 \
+    -t 256 -t 512 \
+    -c 28 -c 56
+```
+
+---
+
+### Summary Comparison Pipeline
+
+Compares two configurations (baseline vs test) with comprehensive analysis.
+
+```bash
+aorta-report pipeline summary \
+    --baseline ./experiments/2026-01-10/rccl-warp-speed/32cu_512threads/ \
+    --test ./experiments/2026-01-10/rccl-warp-speed/37cu_384threads/ \
+    --baseline-label 32c_512t \
+    --test-label 37c_384t \
+    --output ./comparison_rccl/
+```
+
+**Options:**
+- `--baseline` - Path to baseline trace directory
+- `--test` - Path to test trace directory
+- `--baseline-label` - Label for baseline in reports
+- `--test-label` - Label for test in reports
+- `--output` - Output directory for results
+- `--skip-tracelens` - Skip TraceLens analysis if reports already exist
+- `--gpu-timeline/--no-gpu-timeline` - Include GPU timeline comparison
+- `--collective/--no-collective` - Include collective/NCCL comparison
+
+**Example with options:**
+```bash
+aorta-report pipeline summary \
+    --baseline ./experiments/2026-01-10/rccl-warp-speed/32cu_512threads/ \
+    --test ./experiments/2026-01-10/rccl-warp-speed/56cu_256threads/ \
+    --baseline-label baseline_32cu \
+    --test-label test_56cu \
+    --output ./comparison_output/ \
+    --skip-tracelens
+```
+
+---
+
+## 3. Output Directory Structures
+
+### GEMM Pipeline Output (`comparison_gemm_1/`)
+
+```
+comparison_gemm_1/
+├── top5_gemm_kernels_time_variance.csv           # Raw GEMM variance data
+├── top5_gemm_kernels_time_variance_with_timestamps.csv  # Enhanced with timestamps
+├── plots/
+│   ├── variance_by_threads_boxplot.png           # Variance by thread config
+│   ├── variance_by_channels_boxplot.png          # Variance by channel config
+│   ├── variance_by_ranks_boxplot.png             # Variance by rank
+│   ├── variance_thread_channel_interaction.png   # Thread × Channel interaction
+│   └── variance_violin_combined.png              # Combined violin plot
+└── gemm_variance_report.html                     # Self-contained HTML report
+```
+
+**Key outputs:**
+- **CSV files**: Raw data for further analysis
+- **Boxplots**: Identify which configs have highest variance
+- **HTML report**: Share with team (includes all plots embedded)
+
+---
+
+### Summary Pipeline Output (`comparison_rccl/`)
+
+```
+comparison_rccl/
+├── gpu_timeline_comparison.xlsx                  # GPU timeline comparison
+├── gpu_timeline_combined.xlsx                    # Combined timeline data
+├── collective_comparison.xlsx                    # NCCL collective comparison
+├── collective_combined.xlsx                      # Combined collective data
+├── final_analysis_report.xlsx                    # Comprehensive analysis
+├── plots/
+│   ├── abs_time_comparison.png                   # Absolute time comparison
+│   ├── computation_time_by_rank.png              # Computation time per rank
+│   ├── idle_time_by_rank.png                     # Idle time per rank
+│   ├── total_time_by_rank.png                    # Total time per rank
+│   ├── total_comm_time_by_rank.png               # Communication time per rank
+│   ├── gpu_time_heatmap.png                      # GPU time heatmap
+│   ├── gpu_time_change_percentage_summary_by_rank.png  # % change summary
+│   ├── improvement_chart.png                     # Overall improvement chart
+│   ├── NCCL_Algorithm_Bandwidth_comparison.png   # NCCL bandwidth comparison
+│   ├── NCCL_Bus_Bandwidth_comparison.png         # Bus bandwidth comparison
+│   ├── NCCL_Communication_Latency_comparison.png # Latency comparison
+│   ├── NCCL_Total_Communication_Latency_comparison.png
+│   └── NCCL_Performance_Percentage_Change_comparison.png
+└── performance_analysis_report.html              # Self-contained HTML report
+```
+
+**Key outputs:**
+- **Excel files**: Detailed data for spreadsheet analysis
+- **Plots**: Visual comparisons between baseline and test
+- **HTML report**: Share comprehensive results with team
+
+---
+
+## 4. Quick Start Examples
+
+### Analyze a new sweep directory
+```bash
+# Full pipeline (runs TraceLens + GEMM analysis)
+aorta-report pipeline gemm --sweep-dir /path/to/sweep -o ./output/
+
+# If TraceLens was already run
+aorta-report pipeline gemm --sweep-dir /path/to/sweep -o ./output/ --skip-tracelens
+```
+
+### Compare two configurations
+```bash
+# Full comparison (runs TraceLens + comparison)
+aorta-report pipeline summary \
+    --baseline /path/to/baseline \
+    --test /path/to/test \
+    --baseline-label "Baseline" \
+    --test-label "Test" \
+    --output ./comparison/
+```
+
+### Run only TraceLens analysis
+```bash
+# Single configuration
+aorta-report analyze single /path/to/traces
+
+# Sweep directory (multiple configs)
+aorta-report analyze sweep /path/to/sweep
+```
+
+---
+
+## 5. Tips
+
+1. **First run**: Let the pipeline run TraceLens (don't use `--skip-tracelens`)
+2. **Subsequent runs**: Use `--skip-tracelens` to save time
+3. **Large datasets**: Use `--no-plots --no-html` for faster processing
+4. **Custom analysis**: Use the CSV/Excel outputs for custom visualization
+
diff --git a/scripts/gemm_analysis/run_tracelens_analysis.sh b/scripts/gemm_analysis/run_tracelens_analysis.sh
@@ -264,7 +264,10 @@ else
             # trace file in the rank folder to the canonical `trace/pt.trace.json` path.
             # This will satisfy TraceLens's requirement of only one `*` being present in the trace pattern
             # while also avoiding FileNotFoundErrors due to different filenames.
-            find $TRACE_DIR/rank* -name "*.json" -exec sh -c 'mkdir -p "$(dirname "$0")/trace" && mv "$0" "$(dirname "$0")/trace/pt.trace.json"' {} \;
+            # OLD (not idempotent - running twice creates trace/trace/pt.trace.json):
+            # find $TRACE_DIR/rank* -name "*.json" -exec sh -c 'mkdir -p "$(dirname "$0")/trace" && mv "$0" "$(dirname "$0")/trace/pt.trace.json"' {} \;
+            # NEW: -not -path "*/trace/*" ensures this is idempotent (safe to run multiple times)
+            find $TRACE_DIR/rank* -name "*.json" -not -path "*/trace/*" -exec sh -c 'mkdir -p "$(dirname "$0")/trace" && mv "$0" "$(dirname "$0")/trace/pt.trace.json"' {} \;
 
             TraceLens_generate_multi_rank_collective_report_pytorch \
                 --trace_pattern "$TRACE_DIR/rank*/trace/pt.trace.json" \

diff --git a/src/aorta/report/analysis/__init__.py b/src/aorta/report/analysis/__init__.py
@@ -3,12 +3,13 @@
 from .tracelens_wrapper import TraceLensWrapper
 from .analyze_gemm import analyze_gemm_reports
 from .analyze_single import analyze_single_config
-from .analyze_sweep import analyze_sweep_config
+from .analyze_sweep import analyze_sweep_config, discover_and_run_tracelens
 
 __all__ = [
     "TraceLensWrapper",
     "analyze_gemm_reports",
     "analyze_single_config",
     "analyze_sweep_config",
+    "discover_and_run_tracelens",
 ]
 
diff --git a/src/aorta/report/analysis/analyze_gemm.py b/src/aorta/report/analysis/analyze_gemm.py
@@ -38,12 +38,48 @@ def extract_name_from_kernel_info(kernel_info_str: str) -> Optional[str]:
         return None
 
 
-def column_letter_to_index(letter: str) -> int:
-    """Convert Excel column letter to 0-based index."""
-    index = 0
-    for i, char in enumerate(reversed(letter.upper())):
-        index += (ord(char) - ord("A") + 1) * (26**i)
-    return index - 1
+def find_column_indices(
+    header_row: List[Any],
+    required_columns: Dict[str, str],
+) -> Dict[str, int]:
+    """
+    Find column indices by matching column names in header row.
+
+    Args:
+        header_row: List of column header values
+        required_columns: Dict mapping logical names to expected column names
+                         e.g., {"kernel_info": "kernel_details__summarize_kernel_stats"}
+
+    Returns:
+        Dict mapping logical names to column indices (0-based)
+
+    Raises:
+        ValueError: If any required column is not found
+    """
+    # Create a mapping of column name -> index
+    header_map = {}
+    for idx, col_name in enumerate(header_row):
+        if col_name is not None:
+            header_map[str(col_name)] = idx
+
+    # Find indices for required columns
+    column_indices = {}
+    missing_columns = []
+
+    for logical_name, expected_name in required_columns.items():
+        if expected_name in header_map:
+            column_indices[logical_name] = header_map[expected_name]
+        else:
+            missing_columns.append(f"'{expected_name}' (for {logical_name})")
+
+    if missing_columns:
+        available = list(header_map.keys())[:20]  # Show first 20 columns
+        raise ValueError(
+            f"Required columns not found: {', '.join(missing_columns)}\n"
+            f"Available columns (first 20): {available}"
+        )
+
+    return column_indices
 
 
 def process_excel_file(
@@ -66,6 +102,13 @@ def process_excel_file(
     Returns:
         List of dictionaries containing kernel data
     """
+    # Define required columns by their expected names
+    REQUIRED_COLUMNS = {
+        "kernel_info": "kernel_details__summarize_kernel_stats",
+        "time_min": "Kernel Time (µs)_min",
+        "time_max": "Kernel Time (µs)_max",
+    }
+
     try:
         # Open the workbook
         wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
@@ -77,62 +120,24 @@ def process_excel_file(
 
         sheet = wb["GEMM"]
 
-        # Expected column positions (0-based indices)
-        col_kernel_info = column_letter_to_index("X")  # Column X
-        col_time_min = column_letter_to_index("AG")  # Column AG
-        col_time_max = column_letter_to_index("AH")  # Column AH
-
-        # Read header row to validate column names
         rows_data = []
         header_row = None
+        col_indices = None
 
         for i, row in enumerate(sheet.iter_rows(values_only=True)):
             if i == 0:
-                # This is the header - validate column names match expectations
+                # Parse header row and find column indices dynamically
                 header_row = list(row)
-
-                # Expected column names (match what TraceLens generates)
-                expected_x = "kernel_details__summarize_kernel_stats"
-                expected_ag = "Kernel Time (µs)_min"
-                expected_ah = "Kernel Time (µs)_max"
-
-                # Validate each expected column
-                errors = []
-
-                if col_kernel_info < len(header_row):
-                    header_x = str(header_row[col_kernel_info]) if header_row[col_kernel_info] else ""
-                    if header_x != expected_x:
-                        errors.append(f"Column X: expected '{expected_x}', found '{header_x}'")
-                else:
-                    errors.append(f"Column X: not found (only {len(header_row)} columns)")
-
-                if col_time_min < len(header_row):
-                    header_ag = str(header_row[col_time_min]) if header_row[col_time_min] else ""
-                    if header_ag != expected_ag:
-                        errors.append(f"Column AG: expected '{expected_ag}', found '{header_ag}'")
-                else:
-                    errors.append(f"Column AG: not found (only {len(header_row)} columns)")
-
-                if col_time_max < len(header_row):
-                    header_ah = str(header_row[col_time_max]) if header_row[col_time_max] else ""
-                    if header_ah != expected_ah:
-                        errors.append(f"Column AH: expected '{expected_ah}', found '{header_ah}'")
-                else:
-                    errors.append(f"Column AH: not found (only {len(header_row)} columns)")
-
-                if errors:
-                    raise ValueError(
-                        f"Column validation failed in {file_path}:\n  " + "\n  ".join(errors)
-                    )
-
+                col_indices = find_column_indices(header_row, REQUIRED_COLUMNS)
                 continue
 
-            if row is None or len(row) <= max(col_kernel_info, col_time_min, col_time_max):
+            if row is None or col_indices is None:
                 continue
 
-            kernel_info = row[col_kernel_info] if col_kernel_info < len(row) else None
-            kernel_time_min = row[col_time_min] if col_time_min < len(row) else None
-            kernel_time_max = row[col_time_max] if col_time_max < len(row) else None
+            # Extract values using dynamically found indices
+            kernel_info = row[col_indices["kernel_info"]] if col_indices["kernel_info"] < len(row) else None
+            kernel_time_min = row[col_indices["time_min"]] if col_indices["time_min"] < len(row) else None
+            kernel_time_max = row[col_indices["time_max"]] if col_indices["time_max"] < len(row) else None
 
             # Extract kernel name
             kernel_name = extract_name_from_kernel_info(kernel_info)