New evaluation results: Typed conflict resolution achieves 12% on FC-MH

Hi @HUST-AI-HYZ — congrats on the ICLR 2026 acceptance!

I built Mnemos, an open-source memory engine with typed conflict resolution, 
and evaluated it on your full Conflict_Resolution split (800 questions, 
8 examples). Sharing results in case they're useful for future benchmark versions.

**Approach:** Classify contradictions into types (preference evolution / 
factual correction / context-dependent) and apply different resolution 
strategies at ingestion time, before retrieval.

**Results (full Conflict_Resolution split):**

| Split | Mnemos | Naive Baseline |
|-------|--------|---------------|
| FC-MH 6K | 27.0% | 9.0% |
| FC-MH 32K | 11.0% | 3.0% |
| FC-MH 64K | 8.0% | 6.0% |
| FC-MH 262K | 2.0% | 2.0% |
| FC-SH 6K | 90.0% | 69.0% |
| FC-SH 32K | 65.0% | 80.0% |
| FC-SH 64K | 55.0% | 76.0% |
| FC-SH 262K | 28.0% | 76.0% |
| **MH Average** | **12.0%** | **5.0%** |

- Backbone: GPT-4.1-mini
- Embeddings: all-MiniLM-L6-v2
- The naive baseline uses identical LLM + embeddings + retrieval — 
  the delta is purely from conflict resolution.

**Known limitation:** On long-context single-hop (32K+), the engine 
over-deletes similar-but-non-contradictory facts. Adaptive thresholds 
per context length would address this.

Repo: https://github.com/Sohamp2809/mnemos
Results JSON: https://github.com/Sohamp2809/mnemos/blob/main/results/mabench_cr_full.json
Reproducible with: `python Benchmark_sample/run_MemoryAgentBench.py --llm openai`

A few open questions I'd value your perspective on:

1. The over-deletion on long-context SH (32K+) suggests the fixed 
   similarity threshold breaks down at scale. Did you observe similar 
   patterns with Mem0 or Cognee's conflict handling?

2. Would it be useful for the benchmark to include a "conflict 
   detection recall" metric alongside QA accuracy to measure how 
   many contradictions a system actually catches vs misses?

Happy to run additional evaluations or adapt the approach if useful 
for future versions of the benchmark. Open to collaboration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New evaluation results: Typed conflict resolution achieves 12% on FC-MH #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Split	Mnemos	Naive Baseline
FC-MH 6K	27.0%	9.0%
FC-MH 32K	11.0%	3.0%
FC-MH 64K	8.0%	6.0%
FC-MH 262K	2.0%	2.0%
FC-SH 6K	90.0%	69.0%
FC-SH 32K	65.0%	80.0%
FC-SH 64K	55.0%	76.0%
FC-SH 262K	28.0%	76.0%
MH Average	12.0%	5.0%

New evaluation results: Typed conflict resolution achieves 12% on FC-MH #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions