Hi @HUST-AI-HYZ — congrats on the ICLR 2026 acceptance!
I built Mnemos, an open-source memory engine with typed conflict resolution,
and evaluated it on your full Conflict_Resolution split (800 questions,
8 examples). Sharing results in case they're useful for future benchmark versions.
Approach: Classify contradictions into types (preference evolution /
factual correction / context-dependent) and apply different resolution
strategies at ingestion time, before retrieval.
Results (full Conflict_Resolution split):
| Split |
Mnemos |
Naive Baseline |
| FC-MH 6K |
27.0% |
9.0% |
| FC-MH 32K |
11.0% |
3.0% |
| FC-MH 64K |
8.0% |
6.0% |
| FC-MH 262K |
2.0% |
2.0% |
| FC-SH 6K |
90.0% |
69.0% |
| FC-SH 32K |
65.0% |
80.0% |
| FC-SH 64K |
55.0% |
76.0% |
| FC-SH 262K |
28.0% |
76.0% |
| MH Average |
12.0% |
5.0% |
- Backbone: GPT-4.1-mini
- Embeddings: all-MiniLM-L6-v2
- The naive baseline uses identical LLM + embeddings + retrieval —
the delta is purely from conflict resolution.
Known limitation: On long-context single-hop (32K+), the engine
over-deletes similar-but-non-contradictory facts. Adaptive thresholds
per context length would address this.
Repo: https://github.com/Sohamp2809/mnemos
Results JSON: https://github.com/Sohamp2809/mnemos/blob/main/results/mabench_cr_full.json
Reproducible with: python Benchmark_sample/run_MemoryAgentBench.py --llm openai
A few open questions I'd value your perspective on:
-
The over-deletion on long-context SH (32K+) suggests the fixed
similarity threshold breaks down at scale. Did you observe similar
patterns with Mem0 or Cognee's conflict handling?
-
Would it be useful for the benchmark to include a "conflict
detection recall" metric alongside QA accuracy to measure how
many contradictions a system actually catches vs misses?
Happy to run additional evaluations or adapt the approach if useful
for future versions of the benchmark. Open to collaboration.
Hi @HUST-AI-HYZ — congrats on the ICLR 2026 acceptance!
I built Mnemos, an open-source memory engine with typed conflict resolution,
and evaluated it on your full Conflict_Resolution split (800 questions,
8 examples). Sharing results in case they're useful for future benchmark versions.
Approach: Classify contradictions into types (preference evolution /
factual correction / context-dependent) and apply different resolution
strategies at ingestion time, before retrieval.
Results (full Conflict_Resolution split):
the delta is purely from conflict resolution.
Known limitation: On long-context single-hop (32K+), the engine
over-deletes similar-but-non-contradictory facts. Adaptive thresholds
per context length would address this.
Repo: https://github.com/Sohamp2809/mnemos
Results JSON: https://github.com/Sohamp2809/mnemos/blob/main/results/mabench_cr_full.json
Reproducible with:
python Benchmark_sample/run_MemoryAgentBench.py --llm openaiA few open questions I'd value your perspective on:
The over-deletion on long-context SH (32K+) suggests the fixed
similarity threshold breaks down at scale. Did you observe similar
patterns with Mem0 or Cognee's conflict handling?
Would it be useful for the benchmark to include a "conflict
detection recall" metric alongside QA accuracy to measure how
many contradictions a system actually catches vs misses?
Happy to run additional evaluations or adapt the approach if useful
for future versions of the benchmark. Open to collaboration.