⭐ reproducible · methodology openResearch · Benchmarks

Our numbers, out in the open — and reproducible.

We publish how ULTRAMEMORY actually performs on recall, latency, and model quality — with the exact setup, data, and scripts to re-run it yourself. No cherry-picked wins, no marketing math.

Reproduce it yourself Read the methodologyreproducible

RECALL p95188ms

Our measured p95, held to a maintained sub-200ms SLO. 2026-05 run.

RECALL QUALITY0.91
our LongMemEval-style eval (v1)
MODEL DEGRADATIONnear-zero
with vs without our context

Our results on our system, our methodology — dated, and runnable below.

RESEARCH[1 / 7]

Why this page exists.

Memory vendors benchmark themselves, and the results don't agree — competing vendors publish conflicting LOCOMO/LongMemEval-style numbers on the same public datasets, each running their own harness, so those figures are disputed vendor self-benchmarks rather than neutral ground truth. There's almost no independent, reproducible memory benchmarking — that gap is the white-space. So we're publishing ours in the open, with a transparent, reproducible methodology, and showing exactly how to check it.

OUR RESULTS · OUR SYSTEM

These are our results on our system, with our methodology stated — not an independent or neutral benchmark, and not a claim about anyone else's system. We don't publish competitor numbers we can't reproduce, and where a vendor figure (including the contested LOCOMO/LongMemEval claims) comes up, we frame it as a self-reported, disputed benchmark rather than an established result.

THE NUMBERS[2 / 7]

The proof, shown — not claimed.

Three things we measure: recall quality on a named public eval, p95 latency against our maintained sub-200ms SLO, and near-zero model degradation. Always p95, never just average.

RECALL p95188ms

Our measured p95 end-to-end recall, held to a maintained sub-200ms SLO — a target we measure and hold to, not a structural guarantee (the hot path includes a synchronous query-embed step). 2026-05 run.

RECALL QUALITY0.91
answer accuracy on our LongMemEval-style public eval (v1) · run config →
LATENCY p5096ms
median end-to-end recall, warm · 2026-05 run · run config →
LATENCY p95188ms
measured against our maintained sub-200ms p95 SLO — even the slow ones are fast · 2026-05 run · run config →
MODEL DEGRADATIONnear-zero
downstream quality delta, with vs without our context — clean, token-budgeted context doesn't drag the model down · run config →

Our results on our system, with our methodology stated — dated, never implied as independent or neutral.

Bars are our measured results; numbers in DM Mono are the exact figures from the linked runs. Each bar links to its dated run config.

METHODOLOGY[3 / 7]

How we measured it.

Dataset → harness → metric → environment. Earn the trust the white-space requires — every number traces back to a step here.

01 · DATASET
Which public eval, and why
A LongMemEval-style public eval set (v1), chosen for long-horizon, multi-session recall that matches how agents actually use memory. We state any preprocessing and link the raw data so you can diff it.
02 · HARNESS
What we ran, how many trials
The open benchmark harness, warm and cold runs reported separately, fixed trial count, and a stated concurrency level — we test under multi-agent concurrency, the case our governance is built for.
03 · METRIC DEFINITIONS
Exactly how each number is computed
Recall is answer accuracy on the eval set; latency is p95 end-to-end recall (not just average); degradation is the downstream quality delta with vs without our context. No ambiguous "accuracy."
04 · ENVIRONMENT
Region, tier, models, date
Region, hardware/tier, and the exact model(s) used, date-stamped on every figure. Every number here is dated because the system changes, and an old number is not a current claim.

If a number here can't be reproduced from this section, treat it as a bug — tell us.

REPRODUCE IT[4 / 7]

One command, your own eyes.

Reproducibility is the claim — so we make it one command. Clone the public benchmark repo, set your API key, run it, and get a results table you can diff against ours.

# clone the public benchmark repo
git clone https://github.com/ultramemory/benchmarks
cd benchmarks

# point it at your own account
export ULTRAMEMORY_API_KEY=sk-...

# one command → a results table you can diff against ours
npx @ultramemory/bench run --eval longmemeval-v1 --report table

One command, your own machine, your own eyes. Public benchmark repo ↗ · Pinned results (JSON) ↗

HONEST LIMITS[5 / 7]

What we're still working on.

Credibility through candor. We'd rather show a real limit than a fake win.

Datasets we haven't covered yet
We start with one long-horizon recall eval. Broader public sets (and more domains) are next — and we'll add them dated, not back-dated.
Where our numbers are weakest
Cold-start latency before the cache warms, and very large single-query token budgets, are where we have the most headroom. We show both p50 and p95 so the slow tail is visible.
What we won't claim
Not independent or neutral, not third-party-verified, and never a structural latency guarantee — the hot read path includes a synchronous query-embed step, so the number is measured and dated, not asserted.

QUOTED ELSEWHERE[6 / 7]

The single source of truth.

Other pages link here for any number they quote. Read the plain-language stories behind these metrics:

Instant recall → · Keeps your AI sharp → · Shared memory → · Compare → · For developers →

Transparency

Don't take our word for it — run it.

Our results on our system, our methodology, dated — and one command away from your own.

Reproduce it yourself READ THE BLOG ↗

Our numbers, out in the open — and reproducible.

Which public eval, and why

What we ran, how many trials

Exactly how each number is computed

Region, tier, models, date

Datasets we haven't covered yet

Where our numbers are weakest

What we won't claim

Don't take our word for it — run it.