Skip to content

divake/ai_intel_llm_research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

14 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš€ Intel AI Workload Intelligence Platform

Intel Workload Intelligence Monitor Demo

Real-Time Predictive AI Workload Management with Uncertainty Quantification
Featuring Vision-Language Models, Conformal Prediction, and Intel Hardware Acceleration


๐ŸŽฏ Project Overview

The Intel AI Workload Intelligence Platform represents a cutting-edge research initiative that combines state-of-the-art Large Language Models (LLMs), Vision-Language Models (VLMs), and advanced predictive analytics to create an intelligent edge computing system. This platform demonstrates real-time hardware monitoring, predictive workload offloading, and uncertainty quantification using Intel's latest hardware acceleration technologies.

๐Ÿ”ฌ Key Research Innovations

  • Conformal Prediction for Edge Computing: First-of-its-kind implementation of uncertainty-aware workload forecasting with statistical guarantees
  • Multi-Modal AI Integration: Seamless fusion of LLMs, VLMs, and real-time sensor data (Intel RealSense D455)
  • Hardware-Aware Intelligence: Direct integration with Intel Arc Graphics, utilizing all GPU engines (RCS/BCS/VCS/VECS/CCS)
  • Predictive Offloading Engine: 30-second ahead workload prediction with confidence intervals
  • Real-Time Uncertainty Quantification: Distribution-free prediction intervals for robust decision-making

๐Ÿ—๏ธ Technical Architecture

Built specifically for Intel's heterogeneous compute architecture:

  • Intel Core Ultra 7 165H (14 cores, 20 threads) - Meteor Lake architecture
  • Intel Arc Graphics (128 EUs, 8 Xe Cores) - Integrated GPU with dedicated AI acceleration
  • Intel AI Boost NPU (3rd Gen, 34 TOPS) - Low-power neural processing unit
  • Intel RealSense D455 - Depth camera for multi-modal AI applications

๐ŸŽฌ Featured: Intel Workload Intelligence Monitor

The flagship demonstration showcases our Predictive AI Workload Management System that combines:

  • Real-Time Hardware Telemetry: Direct GPU engine monitoring (RCS/BCS/VCS/VECS/CCS)
  • Vision-Language Understanding: LLaVA 7B model processing RealSense camera feeds
  • Conformal Prediction Engine: Statistical guarantees for workload forecasting
  • Intelligent Offloading: Proactive edge-to-cloud migration decisions

๐Ÿง  Mathematical Foundation

Our system implements conformal prediction for uncertainty quantification:

P(y_future โˆˆ [ลท - q_ฮฑ, ลท + q_ฮฑ]) โ‰ฅ 1 - ฮฑ

Where:

  • ลท = predicted workload (linear regression with sliding window)
  • q_ฮฑ = quantile of calibration residuals
  • ฮฑ = miscoverage rate (0.1 for 90% confidence)

This provides distribution-free prediction intervals with finite-sample guarantees.

๐Ÿš€ Quick Start

1. Launch Intel Workload Intelligence Monitor

cd vlm_env_description
./launch_workload_monitor.sh

2. Start the Complete AI Stack

./start-ai-stack.sh

This initializes:

  • IPEX-LLM Optimized Ollama server with Intel Arc GPU acceleration
  • Open WebUI on http://localhost:8080 (ChatGPT-like interface)
  • Hardware acceleration via Level Zero and SYCL runtime

3. Access Services

Project Structure

AI-Intel-Research/
โ”œโ”€โ”€ models/              # Model storage
โ”‚   โ”œโ”€โ”€ llm/            # Language models
โ”‚   โ””โ”€โ”€ vlm/            # Vision-language models
โ”œโ”€โ”€ frameworks/         # Inference frameworks
โ”‚   โ””โ”€โ”€ ipex-llm/      # Intel optimized Ollama
โ”œโ”€โ”€ benchmarks/        # Performance testing
โ”‚   โ”œโ”€โ”€ scripts/       # Benchmark scripts
โ”‚   โ””โ”€โ”€ results/       # Test results
โ”œโ”€โ”€ vlm_env_description/ # VLM + RealSense camera demos
โ”‚   โ”œโ”€โ”€ realsense_vlm_demo.py      # Direct VLM scene analysis
โ”‚   โ”œโ”€โ”€ realsense_yolo_vlm_demo.py # YOLO + VLM pipeline
โ”‚   โ””โ”€โ”€ run_demo.sh               # Easy launcher
โ””โ”€โ”€ configs/           # Configuration files

Available Scripts

Launch Scripts

  • start-ai-stack.sh - Start Ollama and Web UI
  • stop-ai-stack.sh - Stop all services

Benchmark Scripts

  • benchmarks/scripts/benchmark_llm.py - Run performance benchmarks
  • benchmarks/scripts/test_hardware_modes.sh - Test CPU/GPU/Auto modes

Running Benchmarks

Basic Benchmark

cd benchmarks/scripts
python3 benchmark_llm.py --models tinyllama:latest

Hardware Mode Testing

cd benchmarks/scripts
./test_hardware_modes.sh

This will test performance on:

  1. CPU-only mode
  2. GPU-accelerated mode
  3. Auto mode (CPU+GPU)

Adding New Models

Download Models

cd frameworks/ollama-ipex-llm-*/
./ollama pull <model-name>

Recommended Models

  • Small/Fast: tinyllama (1.1B params)
  • Balanced: phi3:mini (3.8B params)
  • Vision: llava:7b (7B params + vision)
  • Code: deepseek-coder:1.3b (1.3B params)

Vision Models for Camera Integration

  • LLaVA:7b: Best for scene description (~15-17 tokens/s)
  • BakLLaVA:7b: Alternative vision model
  • Qwen-VL: Good for detailed object analysis
  • CogVLM: Advanced vision-language understanding

Environment Variables

GPU Optimization

export OLLAMA_NUM_GPU=999  # Use all GPU layers
export ONEAPI_DEVICE_SELECTOR=level_zero:0  # Select GPU
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1  # Performance

NPU Usage (Experimental)

Currently, NPU support is through Intel NPU Acceleration Library only. Direct Ollama NPU support is pending.

Performance Tips

  1. GPU Acceleration: Ensure OLLAMA_NUM_GPU=999 for best performance
  2. Memory: Models under 4B parameters work best with 8GB VRAM
  3. Context Length: Adjust with OLLAMA_NUM_CTX (default: 2048)

Troubleshooting

Ollama not detecting GPU

# Check GPU availability
clinfo -l
ls -la /dev/dri/

# Restart with explicit GPU selection
export ONEAPI_DEVICE_SELECTOR=level_zero:0

Web UI not accessible

# Check Docker status
docker ps
docker logs open-webui

Performance issues

  • Ensure Intel GPU drivers are up to date
  • Check GPU utilization with sudo intel_gpu_top
  • Monitor system resources with htop

๐Ÿค– AI Model Ecosystem

Supported Foundation Models

Language Models (LLMs)

  • Llama 3.2 (1B, 3B) - Latest generation with improved reasoning
  • Phi-3 (mini, medium) - Microsoft's efficient models
  • Qwen 2.5 (0.5B-7B) - Multilingual capabilities
  • DeepSeek Coder (1.3B, 6.7B) - Specialized for code generation
  • TinyLlama (1.1B) - Ultra-efficient for edge deployment

Vision-Language Models (VLMs)

  • LLaVA 1.6 (7B, 13B) - State-of-the-art visual understanding
  • BakLLaVA - Alternative architecture for robustness
  • Qwen-VL - Multi-modal with OCR capabilities
  • CogVLM - Advanced visual reasoning

๐Ÿ”ง Technical Implementation

Intel Extension for PyTorch (IPEX)

  • XPU Backend: Optimized kernels for Intel Arc Graphics
  • Graph Optimization: Automatic operator fusion
  • Mixed Precision: BF16/INT8 quantization support
  • Memory Management: Efficient VRAM utilization

OpenVINO Integration

  • Model Optimization: FP16/INT8 post-training quantization
  • Hardware Mapping: Automatic CPU/GPU/NPU distribution
  • Inference Pipeline: Async execution with batching

๐ŸŽฏ VLM + RealSense Camera Integration

Advanced Multi-Modal AI System

Our Vision-Language integration demonstrates cutting-edge capabilities:

๐Ÿ”ฌ Technical Features

  • Real-Time Sensor Fusion: RGB-D data at 30 FPS with depth alignment
  • Temporal Coherence: Frame-to-frame object tracking
  • 3D Spatial Understanding: Point cloud generation and analysis
  • Low-Latency Pipeline: < 100ms capture-to-inference

๐Ÿ“Š Performance Metrics

  • LLaVA 7B on Intel Arc: 15-17 tokens/second
  • End-to-End Latency: 2-3 seconds per scene description
  • GPU Utilization: 95%+ on CCS (Compute) engine
  • Power Efficiency: 35-40W total system power

๐Ÿš€ Research Applications

  1. Predictive Maintenance

    • Anomaly detection in manufacturing
    • Proactive failure prediction
  2. Autonomous Navigation

    • Real-time obstacle avoidance
    • Semantic mapping
  3. Human-Robot Interaction

    • Natural language scene understanding
    • Context-aware responses
  4. Edge AI Research

    • Workload distribution optimization
    • Uncertainty-aware computing

๐Ÿ“š Research Contributions

๐Ÿ† Intel Workload Intelligence Monitor

Our flagship research contribution demonstrates state-of-the-art techniques in edge computing:

Theoretical Advances

  • Conformal Prediction for Systems: Novel application of CP to workload forecasting
  • Multi-Modal Uncertainty: Unified framework combining hardware, thermal, and prediction uncertainty
  • Proactive Offloading: Predictive algorithms with guaranteed coverage probability

Technical Innovations

  1. Direct GPU Engine Monitoring

    • Non-blocking I/O for intel_gpu_top integration
    • Real-time parsing of all Intel Arc engines (RCS/BCS/VCS/VECS/CCS)
    • Sub-second latency with caching mechanism
  2. Sliding Window Feature Engineering

    • 5-dimensional temporal features for trend capture
    • Automatic model retraining with online learning
    • Robust to non-stationary workloads
  3. Risk-Aware Decision Engine

    total_risk = current_load ร— temperature_factor ร— prediction_uncertainty
    • Temperature-aware throttling prediction
    • Uncertainty-weighted decisions
    • Conservative upper-bound utilization

Empirical Results

  • Prediction Accuracy: 92% coverage rate (target: 90%)
  • False Positive Rate: < 5% unnecessary offloads
  • Lead Time: 30-second advance warning
  • Overhead: < 3% CPU utilization for monitoring

๐Ÿ”ฌ Academic Impact

This research contributes to multiple domains:

  • Edge Computing: First CP-based workload prediction system
  • Uncertainty Quantification: Practical implementation with guarantees
  • Systems ML: Bridging theory and systems engineering
  • Intel Hardware: Optimized for Meteor Lake architecture

๐Ÿš€ Performance Benchmarks & Results

๐Ÿ“Š Token Generation Performance Across Models

LLM/VLM Performance Benchmarks

Our comprehensive benchmarking demonstrates exceptional performance on Intel Core Ultra 7 165H with Arc Graphics:

๐Ÿ† Peak Performance Metrics

  • TinyLlama (1.1B): 84.4 tokens/second - Ultra-fast edge inference
  • DeepSeek Coder (1.3B): 65 tokens/second - Optimized for code generation
  • Llama 3.2 (3B): 26 tokens/second - Advanced reasoning capabilities
  • Phi-3 Mini (3.8B): 26 tokens/second - Microsoft's efficient architecture
  • Mistral 7B: 15 tokens/second - Production-ready performance
  • LLaVA 7B: 15 tokens/second - Vision-language understanding

๐Ÿ”ฅ Key Achievements

  • 5.1x Speed Advantage over baseline implementations
  • <30ms First Token Latency for real-time applications
  • 6 Models Tested across different parameter sizes
  • 64GB DDR5 Memory enabling large model inference

๐ŸŽฌ Real-Time LLM Response Demo

LLM Real-time Response Demo

Live demonstration of Llama 3.2 generating responses at 26 tokens/second on Intel Arc Graphics, showcasing:

  • Smooth, real-time text generation
  • Low latency response initiation
  • Consistent throughput during extended generation
  • Professional Open WebUI interface

๐Ÿ“ˆ Model Portfolio & Optimization Strategy

Model Size and Memory Requirements

Model Size & Memory Analysis

Our testing reveals optimal configurations for different use cases:

Edge Deployment (1-3B Parameters)

  • TinyLlama 1.1B: 2GB memory, 80 tokens/sec
  • DeepSeek Coder 1.3B: 3GB memory, 65 tokens/sec
  • Llama 3.2 3B: 6GB memory, 26 tokens/sec

Professional Applications (3-7B Parameters)

  • Phi-3 Mini 3.8B: 8GB memory, 26 tokens/sec
  • Mistral 7B: 14GB memory, 15 tokens/sec
  • LLaVA 7B: 14GB memory, 15 tokens/sec (with vision)

Inference Speed Optimization

The graph demonstrates inverse correlation between model size and inference speed:

  • Sub-2B models: >60 tokens/sec (real-time applications)
  • 3-4B models: 25-30 tokens/sec (interactive use)
  • 7B models: 15-20 tokens/sec (quality-focused tasks)

๐ŸŽฏ Intel-Optimized Model Capabilities

Intel-Optimized Model Portfolio

Specialized Model Performance

๐Ÿ”ต TinyLlama (1.1B)

  • 80 tokens/sec on Intel Arc
  • Quick responses, basic Q&A
  • Lightweight edge deployment

๐ŸŸฆ DeepSeek Coder (1.3B)

  • 65 tokens/sec performance
  • Code generation & debugging
  • Technical documentation

๐ŸŸฉ Llama 3.2 (3B)

  • 26 tokens/sec throughput
  • Conversational AI
  • Creative writing & general tasks

๐ŸŸช Phi-3 Mini (3.8B)

  • 26 tokens/sec speed
  • Complex reasoning
  • Educational content & analysis

๐ŸŸจ Mistral 7B

  • 15 tokens/sec generation
  • Advanced reasoning
  • Long-form content generation

๐ŸŸง LLaVA 7B

  • 15 tokens/sec with vision
  • Image analysis & visual Q&A
  • Multi-modal understanding

๐Ÿ–ผ๏ธ Open WebUI Integration

Open WebUI Interface

Professional ChatGPT-like Interface featuring:

  • Vision-Language Model Support: Direct image upload and analysis
  • Real-time Streaming: Smooth token-by-token display
  • Model Switching: Instant model selection without restart
  • Conversation Management: Save, export, and organize chats
  • IPEX-LLM Backend: Intel-optimized inference engine

๐Ÿ”ง Technical Optimizations

Intel Extension for PyTorch (IPEX)

# Automatic optimization with IPEX
import intel_extension_for_pytorch as ipex
model = ipex.optimize(model, dtype=torch.bfloat16)

Performance Tuning Parameters

# Environment variables for optimal performance
export OLLAMA_NUM_GPU=999          # Use all GPU layers
export OLLAMA_GPU_OVERHEAD=0       # Minimize overhead
export SYCL_CACHE_PERSISTENT=1     # Enable kernel caching
export ONEAPI_DEVICE_SELECTOR=level_zero:0  # Intel Arc GPU

Memory Management

  • Automatic Layer Offloading: CPU โ†” GPU dynamic allocation
  • KV Cache Optimization: Reduced memory footprint
  • Batch Processing: Multi-query attention for throughput

๐Ÿ“Š Comparative Analysis

Model Parameters Memory Tokens/sec Use Case
TinyLlama 1.1B 2GB 84.4 Edge AI, IoT
DeepSeek Coder 1.3B 3GB 65 Code Assistant
Llama 3.2 3B 6GB 26 Conversational AI
Phi-3 Mini 3.8B 8GB 26 Education, Analysis
Mistral 7B 7B 14GB 15 Professional Writing
LLaVA 7B 7B 14GB 15 Visual Understanding

๐Ÿ… Performance Highlights

  1. Industry-Leading Edge Performance

    • TinyLlama achieves 84.4 tokens/sec - fastest in its class
    • 5.1x faster than baseline CPU implementations
  2. Production-Ready Latency

    • First token latency: <30ms
    • Time to first token comparable to cloud APIs
  3. Scalable Architecture

    • Seamless scaling from 1B to 7B parameters
    • Automatic CPU-GPU workload distribution
  4. Vision-Language Excellence

    • LLaVA 7B: Real-time image understanding at 15 tokens/sec
    • Integrated with RealSense for live scene analysis

๐Ÿ”ฎ Future Research Directions

Short-term Goals

  • NPU Direct Inference: Native Intel AI Boost integration
  • Distributed CP: Federated conformal prediction
  • Adaptive Windows: Context-aware prediction horizons
  • Multi-Device Orchestration: Swarm intelligence for edge clusters

Long-term Vision

  • Neuromorphic Integration: Event-driven workload prediction
  • Quantum-Classical Hybrid: Uncertainty quantification with QC
  • Self-Optimizing Systems: AutoML for system parameters
  • Energy-Aware Offloading: Carbon footprint optimization

๐Ÿ“– Publications & Presentations

Papers in Preparation

  1. "Conformal Prediction for Edge Computing: Statistical Guarantees for AI Workload Management"

    • Target: MLSys 2025
    • Status: Experimental validation phase
  2. "Uncertainty-Aware Vision-Language Models for Edge Intelligence"

    • Target: CVPR 2025 Workshop
    • Status: Writing phase

Technical Reports

  • Intel Technical Brief: "Optimizing AI Workloads on Meteor Lake Architecture"
  • White Paper: "Predictive Resource Management for Edge AI Systems"

๐Ÿ’ป System Requirements

Hardware

  • CPU: Intel Core Ultra (Meteor Lake) or newer
  • GPU: Intel Arc Graphics (Xe architecture)
  • Memory: 16GB+ DDR5 recommended
  • Camera: Intel RealSense D400 series (optional)

Software

  • OS: Ubuntu 24.04 LTS
  • Drivers: Intel GPU drivers 2024.2+
  • Runtime: Intel oneAPI Base Toolkit
  • Python: 3.8+ with Intel optimizations

๐Ÿค Collaboration

This project is a collaboration between:

  • AI Research Lab - Algorithm development
  • Intel Corporation - Hardware optimization
  • Academic Partners - Theoretical foundations

๐Ÿ“„ License

Licensed under Apache 2.0 with additional terms for research use. See LICENSE for details.


Advancing Edge AI with Predictive Intelligence
Intel Core Ultra ร— Vision-Language Models ร— Uncertainty Quantification

About

LLM/VLM inference on Intel Core Ultra hardware with CPU/GPU/NPU acceleration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors