The result
LongMemEval is a 500-question benchmark for evaluating long-term memory in conversational AI systems. It was published at ICLR 2025 by researchers at the University of Michigan. It tests whether a system can remember facts, track updates, reason about time, and recall preferences across extended multi-session conversations.
SIBYL scored 95.6% using Claude Opus 4.6 and 93.6% using Claude Sonnet. Both runs used the same memory architecture: hierarchical file memory with a glossary-indexed journal. No vectors. No embeddings. No retrieval pipeline. The model reads JSON files directly.
This places SIBYL #2 on the community leaderboard, behind only agentmemory V4 (96.2%) and tied with Chronos by PwC (95.6%). Every other system in the top 10 uses vector stores, embeddings, or hybrid retrieval pipelines.
Community leaderboard
Self-reported results. No official leaderboard exists. Judges and generator models vary across entries. Updated April 15, 2026.
| # | System | Score | Architecture |
|---|---|---|---|
| 1 | agentmemory V4 | 96.2% | BM25 + vector hybrid |
| 2 | SIBYL (Opus) | 95.6% | Hierarchical file memory file |
| 2 | Chronos (PwC) | 95.6% | unknown |
| 4 | Mastra Observational Memory | 94.9% | Vector + LLM extraction vector |
| 5 | SIBYL (Sonnet) | 93.6% | Hierarchical file memory file |
| 6 | Backboard | 93.4% | unknown |
| 7 | OMEGA | 93.2% | bge-small ONNX embeddings vector |
| 8 | Hindsight (Vectorize) | 91.4% | Semantic + BM25 hybrid |
| 9 | HydraDB | 90.8% | closed |
| 10 | Appleseed Memory | 90.2% | open |
| 11 | Neutrally | 89.4% | unknown |
| 12 | sociomemory | 86.6% | 10-step Hyper Search RAG vector |
| 13 | Emergence AI | 86.0% | RAG vector |
| 14 | Supermemory | 85.9% | Cloud embeddings vector |
| Full-context GPT-4o (baseline) | 60.2% | Entire history in context |
Per-category breakdown
LongMemEval tests six categories. The bars show v1 (Sonnet baseline, gray), v2 Sonnet (blue), and v2 Opus (gold).
| Category | v1 Sonnet | v2 Sonnet | v2 Opus | n |
|---|---|---|---|---|
| single-session-user | 95.7% | 100% | 100% | 70 |
| single-session-assistant | 92.9% | 100% | 100% | 56 |
| temporal-reasoning | 75.2% | 94.7% | 96.2% | 133 |
| knowledge-update | 94.9% | 96.2% | 92.3% | 78 |
| multi-session | 90.1% | 88.0% | 93.2% | 133 |
| single-session-preference | 70.0% | 80.0% | 93.3% | 30 |
| Overall | 86.7% | 93.6% | 95.6% | 500 |
The upgrade that mattered
v1 scored 86.7%. The main weakness was temporal reasoning at 75.2%. The architecture stored all conversation data in a chronological journal, but the agent had no map of which journal lines contained temporal information. It had to scan everything for every question.
The v2 upgrade added glossary-indexed ingest. During data preparation, every journal line is classified into categories: temporal references, preferences, facts, and knowledge updates. The resulting INDEX.json contains line-level pointers with date metadata and content previews.
v1: question arrives -> agent reads entire journal -> searches for answer v2: question arrives -> agent reads INDEX.json -> jumps to indexed lines -> answers The glossary tells the agent WHERE to look before it reads anything. For temporal questions, it checks the temporal index first. For preference questions, the preference index. The journal is the source of truth. The index is the routing layer.
This is the same architectural pattern SIBYL uses in production. INDEX.json is the master map. Entity files are single sources of truth. The journal is append-only raw data. The agent reads the index first, then navigates to what it needs.
Temporal reasoning went from 75.2% to 96.2%. The overall score went from 86.7% to 95.6%. The architectural insight: giving a model a map of where to look is more effective than making it scan everything, and it costs nothing. No embedding model. No vector similarity search. A JSON file with line numbers.
Architecture
SIBYL's memory is a hierarchical tiered file system. It was built from operational necessity over 50+ days of continuous autonomous operation, not designed in a lab for a benchmark.
HOT (read every session) INDEX.json, session.json, priorities.json, treasury.json WARM (read on demand) Entity files: one JSON per project, person, or product COLD (append-only) Journal (JSONL), error logs, revenue logs FROZEN (archive) Closed items, old journals, passed evaluations REFERENCE (static docs) Evaluation framework, operational guidelines
| Component | SIBYL | Typical vector system |
|---|---|---|
| Storage | JSON files + JSONL journal | Vector DB (Pinecone, Chroma, etc.) |
| Indexing | INDEX.json (master catalog with line pointers) | Embedding model (ada-002, etc.) |
| Retrieval | Model reads files directly | Cosine similarity search |
| Update | Edit the file. Instant. | Re-embed chunks |
| Infrastructure | None. Filesystem only. | Vector DB server + embedding API |
| Monthly cost | $0 | $19 - $249+ |
| Portability | cp -r (any system, any LLM) | Locked to embedding model |
What this does not prove
This benchmark measures answer accuracy on a specific dataset with a specific evaluation methodology. It is worth stating what it does not establish:
- There is no official LongMemEval leaderboard. Community results are self-reported with varying judges and generator models. Direct comparison across entries carries caveats.
- SIBYL uses Claude (Opus 4.6 and Sonnet) as both the answering model and the evaluation judge for preference questions. Other entries use GPT-4o or GPT-4o-mini as judges. Judge choice affects scores.
- Benchmark performance does not guarantee production performance. SIBYL's architecture was built for production use and happens to benchmark well. The reverse (benchmark-first design) is a different optimization target.
- File-based memory has scaling limits that vector systems do not. At tens of thousands of entity files, direct file reads become slower than indexed retrieval. SIBYL operates well within those limits today.
Test conditions
Dataset: LongMemEval Oracle (ICLR 2025, University of Michigan)
Questions: 500 total, 6 categories
Models: Claude Opus 4.6, Claude Sonnet
Hardware: 4 vCPU / 16GB RAM (AWS)
Ingest: Glossary-indexed v2 (temporal, preference, fact, update line indexes)
Architecture: INDEX.json routing layer + chronological JSONL journal + entity files
Scoring: Programmatic v3 matcher (substring, number, off-by-one tolerance,
abstention detection, phrase overlap) with manual review of all
flagged incorrect answers. Preference questions judged by Claude
using official LongMemEval rubric.
Data: hypotheses-sonnet.jsonl, hypotheses-opus.jsonl (raw model answers)
scores-sonnet.jsonl, scores-opus.jsonl (every judgment with method)
Available on request for independent verification.