AI-assisted pipeline remediation system that automatically detects, analyzes, and generates code fixes for Azure DevOps pipeline failures.
This system reduces mean time to resolution (MTTR) from 30 minutes to under 2 minutes by automating failure investigation and code generation, while maintaining human review for all changes.
For detailed analysis, screenshots, and in-depth explanations, read the full article on OpsCart:
📖 AI-Powered Pipeline Remediation: Complete Guide
The article includes:
- Architecture deep dive with diagrams
- Step-by-step code explanations
- Screenshots of actual pipeline failures and fixes
- Lessons learned and design decisions
- Production deployment considerations
Results:
- 95%+ detection accuracy across test scenarios
- 93% time reduction (30 min → 2 min)
- Production-ready code generation
- Safe handling of syntax errors (never auto-fixed)
Pipeline Failure → Azure Function → Analysis Engine → Decision Logic → GitHub PR → Human Review → Merge
(Timeline View)
TIME: 0s 1s 10s 13s 15s 2min
│ │ │ │ │ │
FLOW: Pipeline ──> Pattern ──> AI Analysis ──> Decision ──> PR Created ──> Human ──> Merged ✅
Fails Detection (if needed) Logic (GitHub) Review
│ │ │ │ │ │
DETAIL: ❌ Error • Missing Var GPT-5.2 80%: Code Branch Approve
in logs • Wrong Region analyzes 65%: Tips Commit Modify
• Syntax Err last 5000 <65%: Work Changes or
95%+ conf. chars Item Reject
Components:
- Azure Function - Webhook receiver and orchestrator
- Pattern Detection - Fast local analysis (95%+ confidence)
- AI Analysis - GPT-5.2 for complex failures
- Code Generator - Language-specific fix generation
- GitHub Integration - PR creation and management
Decision Logic:
- Confidence ≥80% + Safe → Generate code and create PR
- Confidence 65-80% OR Unsafe → Create PR with suggestions
- Confidence <65% → Create work item for investigation
- Python 3.11+
- Azure Functions Core Tools 4.x
- Azure DevOps account with pipelines
- GitHub account and personal access token
- OpenAI API key (GPT-5.2 access)
- Clone the repository:
git clone https://github.com/opscart/agentic-devops-healing.git
cd agentic-devops-healing- Create Python virtual environment:
python3.11 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
cd src/agents/infra-healer
pip install -r requirements.txt --break-system-packages- Configure environment variables:
cp local.settings.json.example local.settings.jsonEdit local.settings.json with your credentials:
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "",
"FUNCTIONS_WORKER_RUNTIME": "python",
"OPENAI_API_KEY": "your-openai-api-key",
"OPENAI_DEPLOYMENT_NAME": "gpt-5.2",
"ADO_ORGANIZATION_URL": "https://dev.azure.com/yourorg",
"ADO_PROJECT_NAME": "your-project",
"ADO_PAT": "your-azure-devops-pat",
"GITHUB_TOKEN": "your-github-token",
"GITHUB_REPO_OWNER": "your-github-username",
"GITHUB_REPO_NAME": "your-repo-name"
}
}Navigate to the function directory and start:
cd ~/Source/agentic-devops-healing/src/agents/infra-healer
# Verify you're in the right directory
pwd
# Should output: /Users/username/Source/agentic-devops-healing/src/agents/infra-healer
# Check directory contents
ls -l
# Should show: function_app.py, analyzers/, fixers/, handlers/, etc.
# Start the Azure Function
func startExpected output:
Azure Functions Core Tools
Core Tools Version: 4.6.0
Function Runtime Version: 4.1045.200.25556
Functions:
HandleFailure: [POST] http://localhost:7071/api/HandleFailure
For detailed output, run func with --verbose flag.
The function is now listening on http://localhost:7071/api/HandleFailure
Trigger the function manually with a pipeline failure:
curl -X POST http://localhost:7071/api/HandleFailure \
-H "Content-Type: application/json" \
-d '{
"pipelineId": "23",
"buildId": "575",
"projectName": "AI-DevOps-POC"
}'Response (within 15 seconds):
{
"status": "success",
"rca": {
"category": "TERRAFORM_MISSING_VARIABLE",
"confidence": 0.95,
"explanation": "Variable 'azure_region' not defined in variables.tf"
},
"action_taken": "AUTO_FIX_PR_CREATED",
"pr_url": "https://github.com/your-repo/pull/28"
}- Deploy to Azure Functions:
func azure functionapp publish <your-function-app-name>-
Configure Azure DevOps webhook:
- Go to Project Settings → Service Hooks
- Create new webhook for "Build completed" events
- Filter: Status = Failed
- URL:
https://<your-function-app>.azurewebsites.net/api/HandleFailure
-
The system will now automatically respond to pipeline failures
Three test scenarios demonstrate the system's capabilities:
Error:
Error: Reference to undeclared input variable
on main.tf line 18:
18: location = var.azure_region
Result:
- Detection: 100% accurate
- Confidence: 95%
- Action: Generated code and created PR
- PR: #28
- Generated fix:
variable "azure_region" {
description = "Azure region for resource deployment"
type = string
default = "eastus"
}Error:
Error: "east-us" was not found in the list of supported Azure Locations
on main.tf line 18:
18: location = "east-us"
Result:
- Detection: 100% accurate
- Confidence: 99%
- Action: Fetched file from GitHub, fixed region, created PR
- PR: #32
- Generated fix:
- location = "east-us"
+ location = "eastus"Error:
Error: Missing closing brace in interpolation
on main.tf line 25:
25: name = "rg-${var.prefix-${var.environment}"
Result:
- Detection: 100% accurate
- Confidence: 98%
- Action: Created PR with fix suggestions (no code generation)
- PR: #33
- Reasoning: Syntax errors require human review for safety
agentic-devops-healing/
├── src/
│ ├── agents/
│ │ └── infra-healer/ # Main Azure Function
│ │ ├── analyzers/ # Failure pattern detection
│ │ │ ├── terraform_analyzer.py
│ │ │ └── pipeline_analyzer.py
│ │ ├── function_app.py # Entry point (22KB)
│ │ ├── requirements.txt # Python dependencies
│ │ └── local.settings.json # Configuration
│ └── shared/
│ ├── ado_client.py # Azure DevOps integration
│ ├── code_generator.py # Code generation engine
│ ├── github_operations.py # GitHub PR management
│ └── openai_client.py # AI analysis client
├── infrastructure/
│ ├── core/ # Production infrastructure
│ └── test-apps/ # Test scenarios
│ └── infra-only/terraform/scenarios/
│ ├── missing-variable/ # Scenario 1
│ ├── wrong-region/ # Scenario 2
│ └── invalid-syntax/ # Scenario 3
└── .azure-pipelines/ # Test pipeline definitions
Pipeline fails → Azure DevOps webhook → Azure Function triggered
- Fetch build logs from failed pipeline
- Fetch last successful build for comparison
- Fetch pipeline YAML definition
- Pattern Detection: Check for known failure patterns (95%+ confidence)
- AI Analysis: Use GPT-5.2 for complex/novel failures
- Override Logic: If pattern confidence ≥90%, override AI classification
IF confidence ≥ 80% AND safe to auto-fix:
→ Generate code and create PR
ELIF confidence ≥ 65% OR unsafe (syntax errors):
→ Create PR with suggestions only
ELSE:
→ Create work item for human investigation
- Extract context from build logs (no hardcoding)
- Generate language-specific fixes
- For file modifications: Fetch from GitHub, modify, return
- Create feature branch
- Commit changes (code or suggestion document)
- Open pull request with detailed description
- Add labels:
ai-generated,automated - Check for duplicates (prevent redundant PRs)
Critical: All fixes require human approval
- Developer reviews AI-generated code
- Approves if correct, or modifies if needed
- Merges when satisfied
Pipeline re-runs after merge → Should now succeed
| Metric | Traditional | AI-Assisted | Improvement |
|---|---|---|---|
| Detection Time | Manual | < 1 second | Instant |
| Investigation | 15 minutes | 10 seconds | 99% faster |
| Coding Fix | 10 minutes | 3 seconds | 99% faster |
| Human Review | N/A | 1-2 minutes | New step |
| Total MTTR | 30 minutes | 2 minutes | 93% faster |
Time Breakdown (AI-Assisted):
- Automated (detection + analysis + code gen): 15 seconds
- Human (review + approve): 1-2 minutes
- Total: ~2 minutes
- Human-in-the-Loop: All changes require human approval
- Syntax Errors Never Auto-Fixed: Even at 98% confidence
- Tiered Confidence Model: Different actions based on risk
- Duplicate Prevention: Won't create redundant PRs
- Git History: Full audit trail of AI generation + human approval
Adjust in src/agents/infra-healer/function_app.py:
# High confidence threshold (auto-generate code)
HIGH_CONFIDENCE = 0.80
# Medium confidence threshold (suggestions only)
MEDIUM_CONFIDENCE = 0.65Add new patterns in src/agents/infra-healer/analyzers/terraform_analyzer.py:
def detect_error_pattern(build_logs: str) -> tuple:
if 'your new pattern' in build_logs:
return ('YOUR_CATEGORY', 0.95)Add new generators in src/shared/code_generator.py:
def generate_your_fix(explanation: str, context: dict) -> dict:
# Your code generation logic
return {filepath: generated_code}- Terraform-Specific: Currently handles only Terraform failures
- Single-File Changes: Best for single-file modifications
- No Rollback: Manual intervention required if fix breaks something
- API Rate Limits: OpenAI API calls are rate-limited
- Missing Terraform variables
- Invalid Azure regions
- Terraform syntax errors (suggestions only)
- No direct commits
- Confidence thresholds
- Syntax errors never auto-fixed
- Multi-language support (Python, Docker, Kubernetes)
- Multi-file code generation
- Automatic rollback on failure
- Learning from merged vs. closed PRs
- Cost optimization (cache AI responses)
- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open Pull Request
MIT License - see LICENSE file for details
Shamsher Khan
- Senior DevOps Engineer at GlobalLogic (Hitachi)
- IEEE Senior Member
- DZone Core Member
- Blog: OpsCart.com
- GitHub: @opscart
- Azure Functions for serverless hosting
- OpenAI GPT-5.2 for AI analysis
- Python 3.11 and PyGithub library
- Azure DevOps for pipeline automation