LongMemEval
Long-term interactive memory across 7 question types (ICLR 2025). Currently using the oracle split; longmemeval_s adds the realistic distractor haystack.
No per-category data.
| Suite | Headline metric | Questions | Wall (min) | Cost ($) |
|---|---|---|---|---|
longmemeval | โ | 0 | 0.0 | $0.00 |
locomo | โ | 0 | 0.0 | $0.00 |
graphrag_full | โ | 0 | 0.0 | $0.00 |
multihop | โ | 0 | 0.0 | $0.00 |
throughput | โ | 0 | 0.0 | $0.00 |
Long-term interactive memory across 7 question types (ICLR 2025). Currently using the oracle split; longmemeval_s adds the realistic distractor haystack.
No per-category data.
Long Conversation Memory benchmark โ multi-session dialogue with single-hop, multi-hop, counterfactual, and abstention question types.
No per-category data.
End-to-end graph-RAG quality (ICLR 2026). Measures answer correctness, rationale quality (R score), and the combined AR metric.
Multi-hop reasoning over HotpotQA. Bridge questions require chaining evidence across passages; comparison questions contrast two entities.
No per-category data.
Query throughput as concurrency increases. The first level (concurrency=1) establishes baseline latency; later levels show saturation behavior.
No per-category data.