Benchmark

Best-of-N on agent-memory queries: the regression check most people skip

"Run the composer three times and take majority" is supposed to be a free accuracy boost. On our LongMemEval-S run, the failure-side gain was +7 points and the win-side loss was about the same. Net zero, at 3× the cost. The regression-side measurement is the work most people skip, and it is the thing that turns "+9 points free" into a wash.

Published April 21, 2026 · By Jacob Davis and Ben Meyerson

This post is the long version of a footnote in our LongMemEval-S writeup. We mentioned that majority voting across three composer samples did not move the score the way the framing suggested it should. Several people wrote in asking how we measured that. This is the answer, with the raw counts, the confidence interval, and the methodology we think anyone doing best-of-N work on agent-memory benchmarks should follow.

The headline is simple. If you only measure the upside of best-of-N, you almost always find some. If you also measure the downside, wins that flip to losses when you re-sample, you frequently find that the two cancel. Reporting one without the other is how "free 9 points" turns into "we spent 3× per query and the score did not move."

The appeal of best-of-N

Best-of-N is a tough thing to argue against on the surface. Composer outputs are stochastic — temperature isn't zero, the model picks slightly different phrasings and reasoning paths and occasionally a different numeric answer on the same input — and so if you sample k times and aggregate the results (majority vote, judge picks the best, heuristic over the candidates), you've got strictly more information to work with than a single draw produced. More information producing a better answer is the kind of thing nobody wants to bet against.

The arithmetic backs it up at first glance. Take a question the model gets right about 70% of the time. Three independent samples push the right-on-majority probability close to 0.78. Across a five-hundred-task benchmark, that framing implies somewhere in the double digits of accuracy points, free.

Recent agent papers lean on this pretty hard. The standard move over the last year has been to take a baseline system, identify its failures, re-sample N times, and report the fraction of those failures that recover. The recovery rate becomes the headline. "Best-of-3 lifts accuracy 9 points." "Self-consistency adds 12." None of those claims is technically wrong — the failure-side measurement is usually real — but they're systematically incomplete in a way that matters once you start actually paying for the 3× compute they imply.

The flaw in the framing

The framing reports best-of-N gains on the failure side only. You take the tasks that previously failed, re-sample, and count how many now pass. That gives you a positive number. The positive number gets put on a slide.

What is almost never reported is the regression-side measurement: of the tasks that previously passed, how many flip to failing when you re-sample and aggregate? This is the same composer with the same temperature, but a fresh draw is a fresh draw. If the original answer happened to land on the right side of a coin flip, a fresh three-sample majority may not. The win-side flip rate is rarely zero. Sometimes it is comparable to the failure-side recovery rate. Occasionally it is higher.

The asymmetry is the trap. Failure-side analysis only measures the direction you want the result to go. The framing implies that more samples cannot hurt; you are voting, after all, not picking one out of a hat. But voting only protects you from outliers when the underlying distribution is biased toward the right answer. On hard tasks where the model is near the decision boundary, the majority of three draws can be the wrong answer just as easily as it can be the right one. Wins on the boundary are exactly as vulnerable to re-sampling as losses are recoverable.

If the win-side flip rate is comparable to the failure-side recovery rate, the expected accuracy change is roughly zero. You have spent three times the per-query compute to move points around between tasks, not to gain any net.

Our numbers

We ran LongMemEval-S end to end with our v44 composer prompt and a canonical user-profile pass. The full 500-task run scored 458 pass, 42 fail. We wrote up the system itself in the main benchmark post; the relevant fact for this post is the pass/fail split.

The natural question after looking at 42 failures: can we sample the composer multiple times and recover a meaningful chunk of those? And the natural follow-up, if you are honest about methodology: at what cost to the wins?

The failure-side analysis

We re-ran the composer N=3 on each of the 42 failures. Same composer, same temperature, same retrieved memories, same profile. The only thing that varied was the random seed. Each of the three samples was scored independently by the GPT-4o judge against the gold answer.

Result across 3 samplesCountWhat it means under N=3 majority
Pass on 2 or 3 of 37Captured (majority vote flips them to pass)
Pass on exactly 1 of 37Not captured by majority vote; needs a critic-as-selector
Pass on 0 of 328Sampling does not help

So under a straight majority-of-three rule, the upside is +7 tasks. That is a 1.4 percentage point lift on the 500-task denominator. A larger ensemble or a smarter aggregator might capture some of the 7 "lucky-flip" cases (the ones that pass on exactly 1 of 3), but a vote is a vote: 1-of-3 loses.

If we stopped the analysis here and put +7 (or, framed more aggressively, "up to +14 if we add a selector") on a slide, that would match the failure-side-only methodology you see in a lot of recent agent work. The number is real. The number is also misleading.

The regression-side analysis (the work most people skip)

The regression-side measurement is the work most people skip. It is also the work that determines whether the failure-side number means anything.

We turned the same procedure around on the wins. The 500-task run had 458 passes. Re-running the composer on 458 tasks costs real money, so we sampled. We drew a stratified random sample across the six LongMemEval categories, targeting 200 wins. We hit 191; the single-session-preference category only had 8 untested wins available at the time, which capped the stratum and pulled the total down. The other five strata hit their targets.

On each of the 191 sampled wins, we ran the composer N=3 with fresh seeds, took the majority across the three samples, and re-graded with the same GPT-4o judge against the same gold. The original passes had been graded once each (single-grade methodology, the same protocol everyone uses on LongMemEval); the re-runs were graded once each as three independent samples and then aggregated.

Outcome under N=3 majorityCountRate
Still pass18898.4%
Regressed to fail31.6%

Point estimate: 1.6% regression rate. With 3 of 191, the Wilson 95% confidence interval is approximately [0.5%, 4.5%]. The upper bound is what matters for an accuracy decision. In the worst plausible world consistent with the sample, you flip roughly one in twenty wins each time you re-sample.

Multiplying back to the population

The full run had 458 wins. At the point estimate of 1.6%, expected regressions are 458 × 0.016 = 7.3 tasks. At the Wilson upper bound of 4.5%, expected regressions are 458 × 0.045 ≈ 20.6 tasks.

Expected change under N=3 majority
Failure-side recovery (measured directly)+7
Win-side regression (point estimate)−7.3
Net at point estimate≈ 0
Net at Wilson upper bound on regression≈ −14
Net at Wilson lower bound on regression (0.5%)≈ +4.5

The central estimate is wash. The pessimistic end of the confidence interval has us losing points by sampling more. The optimistic end has us gaining roughly one percentage point, at 3× cost. There is no version of this where the headline-friendly framing ("+7 from re-sampling") survives contact with the regression measurement.

The point that bears repeating: the +7 on the failure side is real. We measured it. If we shipped majority-of-three and only re-graded the original failures, our internal dashboard would say accuracy is up. The win-side measurement is what catches the offsetting loss. Without it, the dashboard is a mirror.

The cost side

Best-of-3 costs 3× per query in composer compute and 3× in judge calls if you grade per sample. Token-wise it is the same composer prompt and the same retrieved memories on each draw, so prompt caching reduces the marginal cost relative to "three completely independent calls," but the output side still triples and the output side is where most of the cost lives on long-form composer responses.

So the question on the table is not "should we do best-of-3?" It is "should we pay 3× per query for an expected accuracy change of approximately zero, with downside risk up to negative double-digit task counts?" The answer is no. We did not ship majority-of-three as a default. It is not even on the roadmap as a default. The data is in benchmarks/results/20260420/ if you want to verify.

Where best-of-N could still be worth it

"Majority-of-three as a default is not worth shipping" is not the same as "best-of-N is never useful." The variant we still think is worth investigating is a critic-as-selector.

The 7 tasks that passed on exactly 1 of 3 are the interesting ones. A judge that reads all three candidates and picks the one whose answer is best-supported by the cited memories might capture them, provided that same judge does not also pick the wrong candidate on tasks where the majority would have been right. We have not built this. The regression-side measurement is going to be the gating question there too: a selector that recovers 7 wins and creates 7 false picks is the same trap with extra steps, and the literature on LLM-as-judge selectors is full of variants that look great on the failure side and have never been measured on the win side. We think a well-calibrated selector can beat majority vote here; we do not yet have the evidence.

Two narrower cases are also worth flagging. In accuracy-critical workloads (medical, legal, material financial decisions) where the cost of one wrong answer dominates 3× compute, a 1.6% regression rate may still net positive once you outcome-weight the tasks. And if your sampling model is much cheaper than your primary composer (draft with a small model, arbitrate with a large one), the cost multiplier collapses toward 1.2–1.5× and the threshold for "worth shipping" drops. Neither changes the methodology; they change the cost side or the loss-weighting.

What an honest best-of-N analysis needs

The short version: always re-run the win side. Failure-side recovery is half the gain, and without the other half the net is unknown. Stratify by category if your benchmark has them, because flip rates are not uniform across question types and a non-stratified sample over-weights the easy strata. Put a Wilson or Clopper-Pearson interval on the regression rate and report the upper bound; with small counts like 3 of 191 the binomial uncertainty is wide enough to swing the decision. Grade both sides with the same judge protocol, or you are measuring best-of-N plus a judge change. State the cost multiplier explicitly: "best-of-N lifts accuracy" without "for X cost" is one of the most misleading framings in this corner of the literature. None of this is novel statistics. The reason the win-side measurement is the work most people skip is that it is expensive and the result is usually the unfun one. Most experiments do not survive a clean regression check. The ones that do are worth shipping.

Honest caveats on our own numbers

Two caveats worth stating up front, both of which work against the strength of our conclusion in opposite directions.

The judge itself has variance. We re-graded our full 500-task run three additional times to bound judge noise in the main writeup; those grades came in at 448, 449, and 450 against the original 458. That implies a single-grade judge standard deviation of roughly 4 tasks on the 500 denominator, or about ±0.8 percentage points. Our measured regression rate is 3 out of 191, or 1.6 percentage points. That is on the same scale as the judge noise. Some non-trivial fraction of our 3 "regressions" are probably not real composer variance at all; they are the judge giving a different grade to a similar answer. We did not separate those out, partly because doing so cleanly would require multiple grades on each candidate, and at that point the cost of the regression check itself starts to dominate.

The sample is small. 191 wins is enough to put a useful upper bound on the regression rate, but it is not enough to distinguish "1% regression" from "2% regression" with confidence. If we doubled the sample, the point estimate could move and the Wilson upper bound would tighten substantially. We did not do that doubling because the policy decision was already clear at the sample size we had: even at the optimistic end of the interval, best-of-3 majority is not a slam dunk at 3× cost, and at the pessimistic end it is negative. Tightening the estimate would change our confidence in the conclusion, not the conclusion itself.

The combined effect of the two caveats is that the true central estimate of the win-side flip rate is probably a bit lower than 1.6% (because some of the measured flips are judge noise) but the right-tail uncertainty is also wider than the binomial interval alone suggests (because we have judge noise stacked on top of composer variance). The qualitative conclusion is robust to both: the expected net under N=3 majority is somewhere in a band that includes zero, with meaningful probability of being negative.

What to take from this

Best-of-N is a hypothesis, not a free win. When you see a paper or a blog post quote a clean accuracy lift from re-sampling, the first thing worth checking is whether they ran the regression-side measurement at all. If they didn't, the headline number is the upside of an asymmetric trade with the downside unmeasured, and that's not a result — it's half of one.

The work that turns "+9 points free" into "net zero, costs 3×" isn't glamorous. It's sampling a few hundred wins, re-running the composer on each, paying the inference bill twice, and computing a Wilson interval at the end. The slide it produces is one nobody volunteers to put up. It's also the slide that separates a methodology you can ship from one that ships a regression with it.

None of this is an argument against best-of-N as a research direction. The argument is narrower: the literature reports it inconsistently, the inconsistency runs in a predictable direction, and the fix is one extra experiment that almost nobody runs. Run the win side. Then run it again on the next claim like this you see.

Further reading

Closely related

Engram