How LatticeZero evaluates absolute and relative binding predictions across held-out and independent panels.
A transparent, plain-English summary of what we tested, what passed, and exactly which results are production, which are product-candidates, and which are beta — with the caveats stated up front.
We validate affinity predictions the way a careful lab would: on held-out and independent panels, with metrics reported in full — including where the model is weaker.
Predicting a single binding free energy per complex.
Ranking how a change to a molecule shifts binding within a series.
Fresh, externally-sourced public complexes the model was not tuned or validated against.
The live production path is deployed and reproducible in-browser; candidate and beta modes are labeled and are not the default.
The shipped default. Live and reproducible in-browser.
Evaluated as evidence. Not the production default; not promoted.
Under evaluation on broader panels. Separate from the promoted product mode.
A single validated production scorer is the live default for absolute affinity — the path used for in-app scoring and for full-feature scoring of user-prepared complexes.
Strong agreement with experiment on a held-out blind panel. Reported as evidence only.
| Metric | Value |
|---|---|
| Rank agreement (Spearman ρ) | 0.9004 |
| Mean absolute error | 0.85 kcal/mol |
| Cases off by ≥ 2 kcal/mol | 2 |
Status: candidate-only. Not the production default, shown as evidence — not a shipped promise.
20 public complexes with measured affinities, sourced so they were not part of any prior LatticeZero validation set, scored through the production full-feature path with no model adjustment of any kind.
| Independent challenge panel | n | Spearman ρ | MAE | RMSE |
|---|---|---|---|---|
| Combined panel | 20 | 0.77 | 2.11 | 2.86 |
| Combined, excluding one documented outlier | 19 | 0.80 | 1.74 | 2.05 |
| Held-out public-database subset | 15 | 0.75–0.79 | 1.6–2.1 | — |
| Fully-independent external subset | 5 | 0.90 | 2.11 | 2.35 |
One carbohydrate-rich case exposed a known chemistry limitation and is reported both included and excluded — we do not hide it. On this independent panel the model ranks affinities well (ρ ≈ 0.8) with a mean absolute error around 2 kcal/mol. This is a ranking and sanity result, not a universal calibrated-ΔG guarantee.
Customers can submit their own quality-controlled, prepared complexes and receive a score through the same production full-feature scoring path used in these benchmarks. The workflow is proprietary; what matters publicly is that the production path — not a candidate or beta mode — produces these scores.
Relative affinity asks a different question: within a molecular series, does the model rank the effect of each change correctly?
The shipped production default for relative ranking, evaluated on blind lead-optimization targets.
| Metric | Value |
|---|---|
| Rank agreement (Spearman ρ) | 0.8144 |
A separate beta mode under evaluation on broader holdout panels. Not the promoted product mode.
| Panel | Spearman ρ |
|---|---|
| Lead-optimization holdout — holdout_27 (exp_schrodinger_holdout) | 0.8091 |
| Broad holdout — global_69 | 0.7334 |
holdout_27 is not a PDBbind-relative panel — it is a separate lead-optimization holdout. The beta mode is a distinct evaluation track from the promoted product mode and should not be read as the shipped number.
Held-out blind panel + independent 20-complex challenge.
Promoted product mode + beta broad-holdout mode.
Validation summaries · June 2026 · Production scoring is live and reproducible in-browser on the benchmark pages.