MMCTAgent

Multi-Modal Critical Thinking Agent Framework for Complex Visual Reasoning

🎥 Demo Video • 📄 Research Paper • 🚀 Quick Start

▶️ Watch Demo Video

Overview

MMCTAgent is a state-of-the-art multi-modal AI framework that brings human-like critical thinking to visual reasoning tasks. it combines advanced planning, self-critique, and tool-based reasoning to deliver superior performance in complex image and video understanding applications.

Why MMCTAgent?

🧠 Self Reflection Framework: MMCTAgent emulates iteratively analyzing multi-modal information, decomposing complex queries, planning strategies, and dynamically evolving its reasoning. Designed as a research framework, MMCTAgent integrates critical thinking elements such as verification of final answers and self-reflection through a novel approach that defines a vision-based critic and identifies task-specific evaluation criteria, thereby enhancing its decision-making abilities.
🔬 Enables Querying over Multimodal Collections: It enables modular design to plug-in right audio, visual extraction and processing tools, combined with Multimodal LLMs to ingest and query over large number of videos and image data.
🚀 Easy Integration: Its modular design allows for easy integration into existing workflows and adding domain-specific tools, facilitating adoption across various domains requiring advanced visual reasoning capabilities.

Key Features

Critical Thinking Architecture

MMCTAgent is inspired by human cognitive processes and integrates a structured reasoning loop:

Planner:
Generates an initial response using relevant tools for visual or multi-modal input.
Critic:
Evaluates the Planner’s response and provides feedback to improve accuracy and decision-making.

Modular Agents

MMCTAgent includes two specialized agents:

ImageAgent

A reasoning engine tailored for static image understanding.
It supports a configurable set of tools via the ImageQnaTools enum:

object_detection – Detects objects in an image.
ocr – Extracts embedded text content.
recog – Recognizes scenes, faces, or objects.
vit – Applies vision llm for high-level visual reasoning.

The Critic can be toggled via the use_critic_agent flag.

VideoAgent

Optimized for deep video understanding:

Video Question Answering

Applies a fixed toolchain orchestrated by the Planner:

GET_VIDEO_SUMMARY – Retrieves the most relevant video for the query, along with its summary.
GET_OBJECT_COLLECTION – Retrieves the most relevant video for the query, along with its detected objects.
GET_CONTEXT – Extracts transcript, visual summary chunks and object collection info relevant to the query.
GET_RELEVANT_FRAMES – Provides semantically similar keyframes related to the query. This tool is based on the CLIP embedding.
QUERY_FRAME – Queries specific video keyframes to extract detailed information and provide additional visual context to the Planner.

The Critic agent helps validate and refine answers, improving reasoning depth.

For more details, refer to the full research article:

MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning
Published on arXiv – arxiv.org/abs/2405.18358

Getting Started

Installation

Clone the Repository

git clone https://github.com/microsoft/MMCTAgent.git
cd MMCTAgent

System Dependencies

Install FFmpeg

Linux/Ubuntu:
```
sudo apt-get update
sudo apt-get install ffmpeg libsm6 libxext6 -y
```
Windows:
- Download FFmpeg from ffmpeg.org
- Add the bin folder to your system PATH

Python Environment Setup

Option A: Using Conda (Recommended)

conda create -n mmct-agent python=3.11
conda activate mmct-agent

Option B: Using venv

python -m venv mmct-agent
# Linux/Mac
source mmct-agent/bin/activate
# Windows
mmct-agent\Scripts\activate.bat

Install Dependencies

Choose the installation option based on your needs:

Option A: Image Pipeline
```
pip install --upgrade pip
pip install ".[image-agent]"
```
Option B: Video Pipeline
```
pip install --upgrade pip
pip install ".[video-agent]"
```
Option C: All Features (Image + Video + MCP Server)
```
pip install --upgrade pip
pip install ".[all]"
```
Quick Start Examples

Image Analysis with MMCTAgent

from mmct.image_pipeline import ImageAgent, ImageQnaTools
from mmct.providers.azure import AzureLLMProvider
from mmct.config.providers import ImageAgentProviderConfig
from azure.identity import DefaultAzureCredential, AzureCliCredential, ChainedTokenCredential
import asyncio


credentials = ChainedTokenCredential(AzureCliCredential(),DefaultAzureCredential()) # Or directly use api_key
# Initializing the provider
provider = ImageAgentProviderConfig(
    llm_provider=AzureLLMProvider(
        endpoint = "<your_endpoint>",
        deployment_name="<deployment_name>",
        model_name="<model_name>",
        api_version="api_version",
        credentials=credentials,
    )
)

# Initialize the Image Agent with desired tools
image_agent = ImageAgent(
    query="What objects are visible in this image and what text can you read?",
    image_path="path/to/your/image.jpg",
    tools=[ImageQnaTools.object_detection, ImageQnaTools.ocr, ImageQnaTools.vit],
    use_critic_agent=True,  # Enable critical thinking
    stream=False,
    provider = provider
)

# Run the analysis
response = asyncio.run(image_agent())
print(f"Analysis Result: {response.response}")

Video Analysis with VideoAgent.

Ingest a video through MMCT Video Ingestion Pipeline.

from mmct.video_pipeline import IngestionPipeline, Languages
from mmct.config.providers import IngestionProviderConfig
from mmct.providers.azure import (
    AzureLLMProvider,
    AzureEmbeddingProvider,
    AISearchChapterProvider,
    AISearchKeyframesProvider,
    AISearchObjectCollectionProvider,
    AzureStorageProvider,
    WhisperTranscriptionProvider
)
from mmct.providers.local import ClipImageEmbeddingProvider
from azure.identity import DefaultAzureCredential, AzureCliCredential, ChainedTokenCredential

credentials = ChainedTokenCredential(AzureCliCredential(), DefaultAzureCredential())

# Initializing the provider
provider = IngestionProviderConfig(
    llm_provider=AzureLLMProvider(
        endpoint="https://<your-openai-endpoint>.openai.azure.com/",
        deployment_name="<your-llm-deployment-name>",
        model_name="<your-llm-model-name>",
        api_version="<your-api-version>",
        credentials=credentials,
    ),
    embedding_provider=AzureEmbeddingProvider(
        endpoint="https://<your-openai-endpoint>.openai.azure.com/",
        deployment_name="<your-embedding-deployment-name>",
        api_version="<your-api-version>",
        credentials=credentials,
    ),
    image_embedding_provider=ClipImageEmbeddingProvider(),
    vectordb_chapter=AISearchChapterProvider(
        endpoint="https://<your-search-service>.search.windows.net",
        index_name="<your-chapter-index-name>",
        credentials=credentials,
    ),
    vectordb_keyframes=AISearchKeyframesProvider(
        endpoint="https://<your-search-service>.search.windows.net",
        index_name="<your-keyframe-index-name>",
        credentials=credentials,
    ),
    vectordb_object_registry=AISearchObjectCollectionProvider(
        endpoint="https://<your-search-service>.search.windows.net",
        index_name="<your-object-registry-index-name>",
        credentials=credentials,
    ),
    storage_provider=AzureStorageProvider(
        storage_account_name="<your-storage-account-name>",
        keyframe_container_name="<your-keyframe-container-name>",
        credentials=credentials,
    ),
    transcription_provider=WhisperTranscriptionProvider(
        endpoint="https://<your-openai-endpoint>.openai.azure.com/",
        api_version="<your-api-version>",
        deployment_name="<your-whisper-deployment-name>",
        credentials=credentials,
    ),
)


ingestion = IngestionPipeline(
    video_path="path-of-your-video",
    language=Languages.ENGLISH_INDIA,
    provider = provider
)

# Run the ingestion pipeline
await ingestion.run()

Perform Q&A through MMCT's Video Agent.

from mmct.video_pipeline import VideoAgent
from mmct.config.providers import VideoAgentProviderConfig
from mmct.providers.azure import (
    AzureLLMProvider,
    AzureEmbeddingProvider,
    AISearchChapterProvider,
    AISearchKeyframesProvider,
    AISearchObjectCollectionProvider,
    AzureStorageProvider
)
from mmct.providers.local import ClipImageEmbeddingProvider
from azure.identity import DefaultAzureCredential, AzureCliCredential, ChainedTokenCredential
import asyncio

credentials = ChainedTokenCredential(AzureCliCredential(), DefaultAzureCredential())

# Initializing the provider
provider = VideoAgentProviderConfig(
    llm_provider=AzureLLMProvider(
        endpoint="https://<your-openai-endpoint>.openai.azure.com/",
        deployment_name="<your-llm-deployment-name>",
        model_name="<your-llm-model-name>",
        api_version="<your-api-version>",
        credentials=credentials,
    ),

    embedding_provider=AzureEmbeddingProvider(
        endpoint="https://<your-openai-endpoint>.openai.azure.com/",
        deployment_name="<your-embedding-deployment>",
        api_version="<your-api-version>",
        credentials=credentials,
    ),

    image_embedding_provider=ClipImageEmbeddingProvider(),

    vectordb_chapter=AISearchChapterProvider(
        endpoint="https://<your-search-service>.search.windows.net",
        index_name="<your-chapter-index-name>",
        credentials=credentials,
    ),

    vectordb_keyframes=AISearchKeyframesProvider(
        endpoint="https://<your-search-service>.search.windows.net",
        index_name="<your-keyframe-index-name>",
        credentials=credentials,
    ),

    vectordb_object_registry=AISearchObjectCollectionProvider(
        endpoint="https://<your-search-service>.search.windows.net",
        index_name="<your-object-registry-index-name>",
        credentials=credentials,
    ),

    storage_provider=AzureStorageProvider(
        storage_account_name="<your-storage-account-name>",
        keyframe_container_name="<your-keyframe-container-name>",
        credentials=credentials,
    )
)


# Configure the Video Agent
video_agent = VideoAgent(
    query="input-query",
    video_id=None,  # Optional: specify video ID
    url=None,  # Optional: URL to filter out the search results for given url
    use_critic_agent=True,  # Enable critic agent
    stream=False,  # Stream response
    cache=False,  # Optional: enable caching
    provider = provider
)

# Execute video analysis
response = await video_agent()
print(f"Video Analysis: {response}")

For more comprehensive examples, see the examples/ directory.

Provider System

Multi-Cloud & Vendor-Agnostic Architecture

MMCTAgent now features a modular provider system that allows you to seamlessly switch between different cloud providers and AI services without changing your application code. This makes the framework truly vendor-agnostic and suitable for various deployment scenarios.

Supported Providers

Service Type	Supported Providers	Use Cases
LLM	Azure OpenAI, OpenAI, + Custom	Text generation, chat completion
Search	Azure AI Search, FAISS	Document search and retrieval
Transcription	Azure Speech Services, OpenAI Whisper	Audio-to-text conversion
Storage	Azure Blob Storage, Local Storage	File storage and management

Note: All provider types support custom implementations. See the Custom LLM Provider Example (Anthropic Claude) or read the Providers Guide for implementation details.

For detailed configuration instructions, see our Provider Configuration Guide.

Configuration

System Requirements for CLIP embeddings (openai/clip-vit-base-patch32)

Minimum (development / small-scale):

CPU: 4-core modern i5/i7, ~8 GB RAM
Disk: ~500 MB caching model + image/text data
GPU: none (works but slow)

Recommended (for decent speed / batching):

CPU: 8+ cores, 16 GB RAM
GPU: NVIDIA with ≥ 4-6 GB VRAM (e.g. RTX 2060/3060)
PyTorch + CUDA installed, with mixed precision support

High-throughput (fast, large batches):

16+ cores CPU, 32+ GB RAM
GPU: 8-16 GB+ VRAM, fast memory bandwidth (e.g. RTX 3090, A100)
Use float16 / bfloat16, efficient batching, parallel preprocessing

Project Structure

Below is the project structure highlighting the key entry-point scripts for running the three main pipelines— Image QNA, Video Ingestion and Video Agent.

MMCTAgent
| 
├── infra
|   └── INFRA_DEPLOYMENT_GUIDE.md    # Guide for deployment of Azure Infrastructure 
├── app                              # contains the FASTAPI application over the mmct pipelines.
├── mcp_server
│   ├── main.py                      # you need to run main.py to start MCP server
│   ├── client.py                    # MCP server client to connect to MCP server
│   ├── notebooks/                   # contains the examples to utilize MCP server through different agentic-frameworks
│   └── README.md                    # Guide for MCP server.
├── mmct
│   ├── .
│   ├── image_pipeline
│   │   ├── agents
│   │   │    └── image_agent.py      #  Entry point for the MMCT Image Agentic Workflow
│   │   └── README.md                #  Guide for Image Pipeline
│   └── video_pipeline
│       ├── agents
│       │   └── video_agent.py      # Entry point for the MMCT Video Agentic Workflow
│       ├── core
│       │     └── ingestion
│       │           └── ingestion_pipeline.py   # Entry point for the Video Ingestion Workflow
│       └── README.md                # Guide for Video Pipeline
├── pyproject.toml                   # Project configuration and dependencies
└── README.md

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com. When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA. This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Note: This project is currently under active research and continuous development. While contributions are encouraged, please note that the codebase may evolve as the project matures.

Citation

If you find MMCTAgent useful in your research, please cite our paper:

@article{kumar2024mmctagent,
  title={MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning},
  author={Kumar, Somnath and Gadhia, Yash and Ganu, Tanuja and Nambi, Akshay},
  conference={NeurIPS OWA-2024},
  year={2024},
  url={https://www.microsoft.com/en-us/research/publication/mmctagent-multi-modal-critical-thinking-agent-framework-for-complex-visual-reasoning}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 280 Commits
app		app
config		config
docs		docs
examples		examples
infra		infra
mcp_server		mcp_server
mmct		mmct
.dockerignore		.dockerignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile.base		Dockerfile.base
Dockerfile.mcp		Dockerfile.mcp
EVALUATIONS.md		EVALUATIONS.md
LICENSE		LICENSE
README.md		README.md
RESPONSIBLE_AI.md		RESPONSIBLE_AI.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMCTAgent

Overview

Why MMCTAgent?

Key Features

Critical Thinking Architecture

Modular Agents

Table of Contents

Getting Started

Installation

Image Analysis with MMCTAgent

Video Analysis with VideoAgent.

Provider System

Multi-Cloud & Vendor-Agnostic Architecture

Supported Providers

Configuration

System Requirements for CLIP embeddings (openai/clip-vit-base-patch32)

Project Structure

Contributing

Citation

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

microsoft/MMCTAgent

Folders and files

Latest commit

History

Repository files navigation

MMCTAgent

Overview

Why MMCTAgent?

Key Features

Critical Thinking Architecture

Modular Agents

Table of Contents

Getting Started

Installation

Image Analysis with MMCTAgent

Video Analysis with VideoAgent.

Provider System

Multi-Cloud & Vendor-Agnostic Architecture

Supported Providers

Configuration

System Requirements for CLIP embeddings (openai/clip-vit-base-patch32)

Project Structure

Contributing

Citation

License

Support

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages