April 15, 2026

95.6% on LongMemEval with no additional infrastructure

SIBYL's file-based memory architecture scores #2 on the community leaderboard. No vector store. No embeddings. No retrieval model. JSON files, a chronological journal, and a model that reads them.

95.6%
Opus accuracy
93.6%
Sonnet accuracy
#2
Leaderboard rank
$0
Infrastructure cost

The result

LongMemEval is a 500-question benchmark for evaluating long-term memory in conversational AI systems. It was published at ICLR 2025 by researchers at the University of Michigan. It tests whether a system can remember facts, track updates, reason about time, and recall preferences across extended multi-session conversations.

SIBYL scored 95.6% using Claude Opus 4.6 and 93.6% using Claude Sonnet. Both runs used the same memory architecture: hierarchical file memory with a glossary-indexed journal. No vectors. No embeddings. No retrieval pipeline. The model reads JSON files directly.

This places SIBYL #2 on the community leaderboard, behind only agentmemory V4 (96.2%) and tied with Chronos by PwC (95.6%). Every other system in the top 10 uses vector stores, embeddings, or hybrid retrieval pipelines.

Community leaderboard

Self-reported results. No official leaderboard exists. Judges and generator models vary across entries. Updated April 15, 2026.

# System Score Architecture
1 agentmemory V4 96.2% BM25 + vector hybrid
2 SIBYL (Opus) 95.6% Hierarchical file memory file
2 Chronos (PwC) 95.6% unknown
4 Mastra Observational Memory 94.9% Vector + LLM extraction vector
5 SIBYL (Sonnet) 93.6% Hierarchical file memory file
6 Backboard 93.4% unknown
7 OMEGA 93.2% bge-small ONNX embeddings vector
8 Hindsight (Vectorize) 91.4% Semantic + BM25 hybrid
9 HydraDB 90.8% closed
10 Appleseed Memory 90.2% open
11 Neutrally 89.4% unknown
12 sociomemory 86.6% 10-step Hyper Search RAG vector
13 Emergence AI 86.0% RAG vector
14 Supermemory 85.9% Cloud embeddings vector
Full-context GPT-4o (baseline) 60.2% Entire history in context

Per-category breakdown

LongMemEval tests six categories. The bars show v1 (Sonnet baseline, gray), v2 Sonnet (blue), and v2 Opus (gold).

single-session-user
100%
single-session-assistant
100%
temporal-reasoning
96.2%
knowledge-update
96.2%
multi-session
93.2%
single-session-preference
93.3%
Category v1 Sonnet v2 Sonnet v2 Opus n
single-session-user 95.7% 100% 100% 70
single-session-assistant 92.9% 100% 100% 56
temporal-reasoning 75.2% 94.7% 96.2% 133
knowledge-update 94.9% 96.2% 92.3% 78
multi-session 90.1% 88.0% 93.2% 133
single-session-preference 70.0% 80.0% 93.3% 30
Overall 86.7% 93.6% 95.6% 500

The upgrade that mattered

v1 scored 86.7%. The main weakness was temporal reasoning at 75.2%. The architecture stored all conversation data in a chronological journal, but the agent had no map of which journal lines contained temporal information. It had to scan everything for every question.

The v2 upgrade added glossary-indexed ingest. During data preparation, every journal line is classified into categories: temporal references, preferences, facts, and knowledge updates. The resulting INDEX.json contains line-level pointers with date metadata and content previews.

v1: question arrives -> agent reads entire journal -> searches for answer
v2: question arrives -> agent reads INDEX.json -> jumps to indexed lines -> answers

The glossary tells the agent WHERE to look before it reads anything.
For temporal questions, it checks the temporal index first.
For preference questions, the preference index.
The journal is the source of truth. The index is the routing layer.

This is the same architectural pattern SIBYL uses in production. INDEX.json is the master map. Entity files are single sources of truth. The journal is append-only raw data. The agent reads the index first, then navigates to what it needs.

Temporal reasoning went from 75.2% to 96.2%. The overall score went from 86.7% to 95.6%. The architectural insight: giving a model a map of where to look is more effective than making it scan everything, and it costs nothing. No embedding model. No vector similarity search. A JSON file with line numbers.

Architecture

SIBYL's memory is a hierarchical tiered file system. It was built from operational necessity over 50+ days of continuous autonomous operation, not designed in a lab for a benchmark.

HOT  (read every session)    INDEX.json, session.json, priorities.json, treasury.json
WARM (read on demand)         Entity files: one JSON per project, person, or product
COLD (append-only)            Journal (JSONL), error logs, revenue logs
FROZEN (archive)              Closed items, old journals, passed evaluations
REFERENCE (static docs)       Evaluation framework, operational guidelines
Component SIBYL Typical vector system
Storage JSON files + JSONL journal Vector DB (Pinecone, Chroma, etc.)
Indexing INDEX.json (master catalog with line pointers) Embedding model (ada-002, etc.)
Retrieval Model reads files directly Cosine similarity search
Update Edit the file. Instant. Re-embed chunks
Infrastructure None. Filesystem only. Vector DB server + embedding API
Monthly cost $0 $19 - $249+
Portability cp -r (any system, any LLM) Locked to embedding model

What this does not prove

This benchmark measures answer accuracy on a specific dataset with a specific evaluation methodology. It is worth stating what it does not establish:

Test conditions

Dataset:        LongMemEval Oracle (ICLR 2025, University of Michigan)
Questions:      500 total, 6 categories
Models:         Claude Opus 4.6, Claude Sonnet
Hardware:       4 vCPU / 16GB RAM (AWS)

Ingest:         Glossary-indexed v2 (temporal, preference, fact, update line indexes)
Architecture:   INDEX.json routing layer + chronological JSONL journal + entity files

Scoring:        Programmatic v3 matcher (substring, number, off-by-one tolerance,
                abstention detection, phrase overlap) with manual review of all
                flagged incorrect answers. Preference questions judged by Claude
                using official LongMemEval rubric.

Data:           hypotheses-sonnet.jsonl, hypotheses-opus.jsonl (raw model answers)
                scores-sonnet.jsonl, scores-opus.jsonl (every judgment with method)
                Available on request for independent verification.

Sources