Skip to content

New evaluation results: Typed conflict resolution achieves 12% on FC-MH #18

@Sohamp2809

Description

@Sohamp2809

Hi @HUST-AI-HYZ — congrats on the ICLR 2026 acceptance!

I built Mnemos, an open-source memory engine with typed conflict resolution,
and evaluated it on your full Conflict_Resolution split (800 questions,
8 examples). Sharing results in case they're useful for future benchmark versions.

Approach: Classify contradictions into types (preference evolution /
factual correction / context-dependent) and apply different resolution
strategies at ingestion time, before retrieval.

Results (full Conflict_Resolution split):

Split Mnemos Naive Baseline
FC-MH 6K 27.0% 9.0%
FC-MH 32K 11.0% 3.0%
FC-MH 64K 8.0% 6.0%
FC-MH 262K 2.0% 2.0%
FC-SH 6K 90.0% 69.0%
FC-SH 32K 65.0% 80.0%
FC-SH 64K 55.0% 76.0%
FC-SH 262K 28.0% 76.0%
MH Average 12.0% 5.0%
  • Backbone: GPT-4.1-mini
  • Embeddings: all-MiniLM-L6-v2
  • The naive baseline uses identical LLM + embeddings + retrieval —
    the delta is purely from conflict resolution.

Known limitation: On long-context single-hop (32K+), the engine
over-deletes similar-but-non-contradictory facts. Adaptive thresholds
per context length would address this.

Repo: https://github.com/Sohamp2809/mnemos
Results JSON: https://github.com/Sohamp2809/mnemos/blob/main/results/mabench_cr_full.json
Reproducible with: python Benchmark_sample/run_MemoryAgentBench.py --llm openai

A few open questions I'd value your perspective on:

  1. The over-deletion on long-context SH (32K+) suggests the fixed
    similarity threshold breaks down at scale. Did you observe similar
    patterns with Mem0 or Cognee's conflict handling?

  2. Would it be useful for the benchmark to include a "conflict
    detection recall" metric alongside QA accuracy to measure how
    many contradictions a system actually catches vs misses?

Happy to run additional evaluations or adapt the approach if useful
for future versions of the benchmark. Open to collaboration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions