SIBYL Benchmark — 95.6% on LongMemEval, #2 with File-Based Memory

The result

LongMemEval is a 500-question benchmark for evaluating long-term memory in conversational AI systems. It was published at ICLR 2025 by researchers at the University of Michigan. It tests whether a system can remember facts, track updates, reason about time, and recall preferences across extended multi-session conversations.

SIBYL scored 95.6% using Claude Opus 4.6 and 93.6% using Claude Sonnet. Both runs used the same memory architecture: hierarchical file memory with a glossary-indexed journal. No vectors. No embeddings. No retrieval pipeline. The model reads JSON files directly.

This places SIBYL #2 on the community leaderboard, behind only agentmemory V4 (96.2%) and tied with Chronos by PwC (95.6%). Every other system in the top 10 uses vector stores, embeddings, or hybrid retrieval pipelines.

Community leaderboard

Self-reported results. No official leaderboard exists. Judges and generator models vary across entries. Updated April 15, 2026.

#	System	Score	Architecture
1	agentmemory V4	96.2%	BM25 + vector hybrid
2	SIBYL (Opus)	95.6%	Hierarchical file memory file
2	Chronos (PwC)	95.6%	unknown
4	Mastra Observational Memory	94.9%	Vector + LLM extraction vector
5	SIBYL (Sonnet)	93.6%	Hierarchical file memory file
6	Backboard	93.4%	unknown
7	OMEGA	93.2%	bge-small ONNX embeddings vector
8	Hindsight (Vectorize)	91.4%	Semantic + BM25 hybrid
9	HydraDB	90.8%	closed
10	Appleseed Memory	90.2%	open
11	Neutrally	89.4%	unknown
12	sociomemory	86.6%	10-step Hyper Search RAG vector
13	Emergence AI	86.0%	RAG vector
14	Supermemory	85.9%	Cloud embeddings vector
	Full-context GPT-4o (baseline)	60.2%	Entire history in context

Per-category breakdown

LongMemEval tests six categories. The bars show v1 (Sonnet baseline, gray), v2 Sonnet (blue), and v2 Opus (gold).

single-session-user

100%

single-session-assistant

100%

temporal-reasoning

96.2%

knowledge-update

96.2%

multi-session

93.2%

single-session-preference

93.3%

Category	v1 Sonnet	v2 Sonnet	v2 Opus	n
single-session-user	95.7%	100%	100%	70
single-session-assistant	92.9%	100%	100%	56
temporal-reasoning	75.2%	94.7%	96.2%	133
knowledge-update	94.9%	96.2%	92.3%	78
multi-session	90.1%	88.0%	93.2%	133
single-session-preference	70.0%	80.0%	93.3%	30
Overall	86.7%	93.6%	95.6%	500

The upgrade that mattered

v1 scored 86.7%. The main weakness was temporal reasoning at 75.2%. The architecture stored all conversation data in a chronological journal, but the agent had no map of which journal lines contained temporal information. It had to scan everything for every question.

The v2 upgrade added glossary-indexed ingest. During data preparation, every journal line is classified into categories: temporal references, preferences, facts, and knowledge updates. The resulting INDEX.json contains line-level pointers with date metadata and content previews.

v1: question arrives -> agent reads entire journal -> searches for answer
v2: question arrives -> agent reads INDEX.json -> jumps to indexed lines -> answers

The glossary tells the agent WHERE to look before it reads anything.
For temporal questions, it checks the temporal index first.
For preference questions, the preference index.
The journal is the source of truth. The index is the routing layer.

This is the same architectural pattern SIBYL uses in production. INDEX.json is the master map. Entity files are single sources of truth. The journal is append-only raw data. The agent reads the index first, then navigates to what it needs.

Temporal reasoning went from 75.2% to 96.2%. The overall score went from 86.7% to 95.6%. The architectural insight: giving a model a map of where to look is more effective than making it scan everything, and it costs nothing. No embedding model. No vector similarity search. A JSON file with line numbers.

Architecture

SIBYL's memory is a hierarchical tiered file system. It was built from operational necessity over 50+ days of continuous autonomous operation, not designed in a lab for a benchmark.

HOT  (read every session)    INDEX.json, session.json, priorities.json, treasury.json
WARM (read on demand)         Entity files: one JSON per project, person, or product
COLD (append-only)            Journal (JSONL), error logs, revenue logs
FROZEN (archive)              Closed items, old journals, passed evaluations
REFERENCE (static docs)       Evaluation framework, operational guidelines

Component	SIBYL	Typical vector system
Storage	JSON files + JSONL journal	Vector DB (Pinecone, Chroma, etc.)
Indexing	INDEX.json (master catalog with line pointers)	Embedding model (ada-002, etc.)
Retrieval	Model reads files directly	Cosine similarity search
Update	Edit the file. Instant.	Re-embed chunks
Infrastructure	None. Filesystem only.	Vector DB server + embedding API
Monthly cost	$0	$19 - $249+
Portability	cp -r (any system, any LLM)	Locked to embedding model

What this does not prove

This benchmark measures answer accuracy on a specific dataset with a specific evaluation methodology. It is worth stating what it does not establish:

There is no official LongMemEval leaderboard. Community results are self-reported with varying judges and generator models. Direct comparison across entries carries caveats.
SIBYL uses Claude (Opus 4.6 and Sonnet) as both the answering model and the evaluation judge for preference questions. Other entries use GPT-4o or GPT-4o-mini as judges. Judge choice affects scores.
Benchmark performance does not guarantee production performance. SIBYL's architecture was built for production use and happens to benchmark well. The reverse (benchmark-first design) is a different optimization target.
File-based memory has scaling limits that vector systems do not. At tens of thousands of entity files, direct file reads become slower than indexed retrieval. SIBYL operates well within those limits today.

Test conditions

Dataset:        LongMemEval Oracle (ICLR 2025, University of Michigan)
Questions:      500 total, 6 categories
Models:         Claude Opus 4.6, Claude Sonnet
Hardware:       4 vCPU / 16GB RAM (AWS)

Ingest:         Glossary-indexed v2 (temporal, preference, fact, update line indexes)
Architecture:   INDEX.json routing layer + chronological JSONL journal + entity files

Scoring:        Programmatic v3 matcher (substring, number, off-by-one tolerance,
                abstention detection, phrase overlap) with manual review of all
                flagged incorrect answers. Preference questions judged by Claude
                using official LongMemEval rubric.

Data:           hypotheses-sonnet.jsonl, hypotheses-opus.jsonl (raw model answers)
                scores-sonnet.jsonl, scores-opus.jsonl (every judgment with method)
                Available on request for independent verification.

95.6% on LongMemEval with no additional infrastructure