Skip to content
⭐ reproducible · methodology openResearch · Benchmarks

Our numbers, out in the open — and reproducible.

We publish how ULTRAMEMORY actually performs on recall, latency, and model quality — with the exact setup, data, and scripts to re-run it yourself. No cherry-picked wins, no marketing math.

RECALL p95188ms

Our measured p95, held to a maintained sub-200ms SLO. 2026-05 run.

  • RECALL QUALITY0.91
    our LongMemEval-style eval (v1)
  • MODEL DEGRADATIONnear-zero
    with vs without our context

Our results on our system, our methodology — dated, and runnable below.

RESEARCH[1 / 7]

Why this page exists.

Memory vendors benchmark themselves, and the results don't agree — competing vendors publish conflicting LOCOMO/LongMemEval-style numbers on the same public datasets, each running their own harness, so those figures are disputed vendor self-benchmarks rather than neutral ground truth. There's almost no independent, reproducible memory benchmarking — that gap is the white-space. So we're publishing ours in the open, with a transparent, reproducible methodology, and showing exactly how to check it.

OUR RESULTS · OUR SYSTEM

These are our results on our system, with our methodology stated — not an independent or neutral benchmark, and not a claim about anyone else's system. We don't publish competitor numbers we can't reproduce, and where a vendor figure (including the contested LOCOMO/LongMemEval claims) comes up, we frame it as a self-reported, disputed benchmark rather than an established result.

THE NUMBERS[2 / 7]

The proof, shown — not claimed.

Three things we measure: recall quality on a named public eval, p95 latency against our maintained sub-200ms SLO, and near-zero model degradation. Always p95, never just average.

RECALL p95188ms

Our measured p95 end-to-end recall, held to a maintained sub-200ms SLO — a target we measure and hold to, not a structural guarantee (the hot path includes a synchronous query-embed step). 2026-05 run.

  • RECALL QUALITY0.91
    answer accuracy on our LongMemEval-style public eval (v1) · run config →
  • LATENCY p5096ms
    median end-to-end recall, warm · 2026-05 run · run config →
  • LATENCY p95188ms
    measured against our maintained sub-200ms p95 SLO — even the slow ones are fast · 2026-05 run · run config →
  • MODEL DEGRADATIONnear-zero
    downstream quality delta, with vs without our context — clean, token-budgeted context doesn't drag the model down · run config →

Our results on our system, with our methodology stated — dated, never implied as independent or neutral.

Bars are our measured results; numbers in DM Mono are the exact figures from the linked runs. Each bar links to its dated run config.

METHODOLOGY[3 / 7]

How we measured it.

Dataset → harness → metric → environment. Earn the trust the white-space requires — every number traces back to a step here.

  1. 01 · DATASET

    Which public eval, and why

    A LongMemEval-style public eval set (v1), chosen for long-horizon, multi-session recall that matches how agents actually use memory. We state any preprocessing and link the raw data so you can diff it.

  2. 02 · HARNESS

    What we ran, how many trials

    The open benchmark harness, warm and cold runs reported separately, fixed trial count, and a stated concurrency level — we test under multi-agent concurrency, the case our governance is built for.

  3. 03 · METRIC DEFINITIONS

    Exactly how each number is computed

    Recall is answer accuracy on the eval set; latency is p95 end-to-end recall (not just average); degradation is the downstream quality delta with vs without our context. No ambiguous "accuracy."

  4. 04 · ENVIRONMENT

    Region, tier, models, date

    Region, hardware/tier, and the exact model(s) used, date-stamped on every figure. Every number here is dated because the system changes, and an old number is not a current claim.

If a number here can't be reproduced from this section, treat it as a bug — tell us.

REPRODUCE IT[4 / 7]

One command, your own eyes.

Reproducibility is the claim — so we make it one command. Clone the public benchmark repo, set your API key, run it, and get a results table you can diff against ours.

# clone the public benchmark repo
git clone https://github.com/ultramemory/benchmarks
cd benchmarks

# point it at your own account
export ULTRAMEMORY_API_KEY=sk-...

# one command → a results table you can diff against ours
npx @ultramemory/bench run --eval longmemeval-v1 --report table

One command, your own machine, your own eyes. Public benchmark repo ↗ · Pinned results (JSON) ↗

HONEST LIMITS[5 / 7]

What we're still working on.

Credibility through candor. We'd rather show a real limit than a fake win.

  • Datasets we haven't covered yet

    We start with one long-horizon recall eval. Broader public sets (and more domains) are next — and we'll add them dated, not back-dated.

  • Where our numbers are weakest

    Cold-start latency before the cache warms, and very large single-query token budgets, are where we have the most headroom. We show both p50 and p95 so the slow tail is visible.

  • What we won't claim

    Not independent or neutral, not third-party-verified, and never a structural latency guarantee — the hot read path includes a synchronous query-embed step, so the number is measured and dated, not asserted.

QUOTED ELSEWHERE[6 / 7]

The single source of truth.

Other pages link here for any number they quote. Read the plain-language stories behind these metrics:

Instant recall → · Keeps your AI sharp → · Shared memory → · Compare → · For developers →

Transparency

Don't take our word for it — run it.

Our results on our system, our methodology, dated — and one command away from your own.