Khora Benchmarks

Memory system performance across LongMemEval, LoCoMo, GraphRAG-Bench, and more.

Latest run: demo (demo) โ€” 20587 days ago

Headline

0.0%
LongMemEval accuracy
demo
0.0%
LoCoMo accuracy
0.000
Retrieval MRR
demo
0 ms
Median query latency
$0.0000
Cost per question

Per-suite breakdown

Suite Headline metric Questions Wall (min) Cost ($)
longmemeval โ€” 0 0.0 $0.00
locomo โ€” 0 0.0 $0.00
graphrag_full โ€” 0 0.0 $0.00
multihop โ€” 0 0.0 $0.00
throughput โ€” 0 0.0 $0.00

Deep dive

LongMemEval

Long-term interactive memory across 7 question types (ICLR 2025). Currently using the oracle split; longmemeval_s adds the realistic distractor haystack.

Per question type

No per-category data.

LoCoMo

Long Conversation Memory benchmark โ€” multi-session dialogue with single-hop, multi-hop, counterfactual, and abstention question types.

Per category

No per-category data.

GraphRAG-Bench

End-to-end graph-RAG quality (ICLR 2026). Measures answer correctness, rationale quality (R score), and the combined AR metric.

Multi-hop (HotpotQA)

Multi-hop reasoning over HotpotQA. Bridge questions require chaining evidence across passages; comparison questions contrast two entities.

Per question type

No per-category data.

Throughput

Query throughput as concurrency increases. The first level (concurrency=1) establishes baseline latency; later levels show saturation behavior.

QPS and p95 latency by concurrency

No per-category data.

Across releases

All runs