Spatial Detection Analysis¶
This page documents the investigation into the combined detection capability of the library's two spatial analysis systems, the false positive characterisation, and the design decisions that resulted from it.
The two detection systems¶
Spatial pattern classifier (classifyPattern, called by analyzeWaferMap
when enablePatternClassification is true) uses pure geometry — connected
components, radial distance distributions, eccentricity, and linear scores —
to label the whole-wafer failure signature as edge-ring, center, scratch, etc.
It is rule-based with no trained model.
Statistical regional analysis (ring, quadrant, sector, cluster, edge-arc findings) compares specific zones of the wafer against the rest using significance tests. It runs independently of the classifier and produces its own findings.
Both systems run together when you call analyzeWaferMap. When the classifier
identifies a pattern, correlated regional findings (e.g. the ring finding that
supports an edge-ring classification) are downgraded to info severity to
avoid visual redundancy — the classifier finding is the headline, the regional
findings are supporting evidence.
Benchmark dataset¶
All benchmarks were run against WM-811K — 25,519 labelled wafers from TSMC 300mm fabrication (Wu et al., IEEE Transactions on Semiconductor Manufacturing, 2015). This is the standard public dataset for wafer map pattern classification.
Scripts are in scripts/:
| Script | Purpose |
|---|---|
run-benchmark-npz.mjs |
Classifier-only per-class recall/precision |
run-benchmark-combined.mjs |
Combined classifier + regional detection rates |
run-benchmark-fp.mjs |
FP rate on WM-811K "Random"/"none" wafers |
run-benchmark-synthetic-fp.mjs |
FP rate on synthetic i.i.d. random wafers |
Dataset source: /home/paul/projects/LSWMD.pkl → converted to
tests/fixtures/wm811k-benchmark.npz via scripts/prepare-benchmark.py.
Detection rates¶
Classifier alone¶
| Pattern | Recall | Notes |
|---|---|---|
| Near-full | 100% | |
| Edge-ring | 74% | |
| Edge-local | 65% | |
| Center | 60% | |
| Random | 59% | |
| Scratch | 26% | Fragmented patterns miss |
| Donut | 15% | Geometric overlap with center |
Overall exact-match: 64% · Detection rate (any pattern flagged): 86.2%
Combined (classifier + regional analysis)¶
Of the 2,915 wafers the classifier missed, the regional analysis recovered:
| Measure | Rate |
|---|---|
| Any regional finding fired | 92.9% of misses |
| Semantically matched finding | 92.0% of misses |
Combined detection rate: 99.0% (any) / 98.9% (semantically matched)
The any/match gap is small — regional findings on classifier misses are almost always semantically correct, not noise.
Per-label rescue breakdown (classifier misses only):
| Label | Misses | Any rescue | Match rescue |
|---|---|---|---|
| center | 690 | 98.8% | 98.8% |
| donut | 238 | 98.7% | 98.3% |
| edge-local | 940 | 93.1% | 93.1% |
| edge-ring | 488 | 81.1% | 79.5% |
| scratch | 559 | 93.0% | 90.2% |
False positive characterisation¶
WM-811K "Random"/"none" wafers (N = 4,459)¶
| Method | FP rate |
|---|---|
| Classifier only | 41.2% |
| Regional analysis only | 87.7% |
| Combined | 91.8% |
Important caveat: WM-811K "Random" is a catch-all label — ambiguous, multi-modal, or low-confidence wafers all end up there. Some fraction genuinely do have spatial structure. This FP rate is not a clean baseline.
Synthetic random wafers (i.i.d. Bernoulli, N = 500/cell)¶
To get a clean FP baseline, wafers were generated with purely random die failures at varying failure rates and grid sizes (20×20 to 52×52, matching WM-811K's range):
| Fail rate | Regional FP rate |
|---|---|
| 2% | 13.9% |
| 5% | 15.1% |
| 10% | 55.9% ← known issue |
| 20% | 2.6% |
| 40% | 2.2% |
| 60% | 3.7% |
The 10% failure rate spike is a fundamental statistical power problem: with 1,700+ dies on a large wafer, the two-proportion z-test is powerful enough to call random regional clustering noise significant. This cannot be fixed with post-hoc count gates — any gate that reduces the number of candidates fed into the Bonferroni correction also relaxes the correction, cancelling the benefit. The correct fix (replacing Bonferroni with Benjamini-Hochberg FDR applied globally) is a larger refactor; this spike is documented as a known limitation.
Comparison to published methods¶
| Method | Accuracy on WM-811K | Notes |
|---|---|---|
| Our classifier (exact match) | 64% | Rule-based, no training |
| Our system (any detection) | 86.2% / 99.0% combined | |
| Decision tree + Radon features | >98% | 59 handcrafted features, trained ensemble |
| CNN-based (various) | 96–99.9% | Trained on balanced/oversampled subsets |
Our system is not directly comparable to CNN figures — they solve a different problem: they require labelled training data, operate on pixel images, and typically evaluate on class-balanced subsets. Our system works from die coordinates and bin data with no training, produces fully interpretable named findings, and functions on any wafer regardless of diameter or die pitch.
The most relevant comparison is the Decision Tree + Radon transform approach. The recall gap (64% vs >98%) is largely in the hard classes (donut 15%, scratch 26%) which benefit from global frequency-domain information (Radon transform) that local geometry cannot capture.
Our unique strength: the combined 99% detection rate is competitive as a detection system, and the integration with statistical regional analysis (which CNN papers do not provide) means the library surfaces both a pattern label and statistical evidence for the finding — giving process engineers more actionable information than a bare classification alone.
Design decisions¶
Adaptive thresholds¶
minimumClusterSize scales with wafer die count (max(5, round(N × 0.003))).
A 5-die cluster is meaningful on a small wafer but noise on a 2,500-die wafer.
This is the only threshold adapted at runtime — other regional analysis
thresholds (minimumSampleSize, significanceLevel) must not be adapted
because changing the number of tests fed into the Bonferroni correction alters
the correction itself, producing unpredictable FP rate changes.
The classifier's minimumFailingDies and salienceSize floor also scale by
the same formula, for the same reason.
Removed from public API¶
The following were removed from AnalyzeWaferMapOptions to prevent users from
inadvertently degrading accuracy:
minimumSampleSize— interacts with Bonferroni; user adjustment produces unpredictable FP rate changesminimumClusterSize— now auto-scaled; no user knob neededpatternThresholds/PatternThresholds/DEFAULT_PATTERN_THRESHOLDS— 17 inter-dependent thresholds calibrated on 25k wafers; adjusting one without understanding the others silently degrades classification accuracy
RELATED_FAMILIES mapping¶
When the classifier identifies a pattern with high/medium confidence, correlated
regional findings are downgraded to info to avoid double-counting. The mapping
was updated to include sector and quadrant for edge-local and scratch —
both patterns can produce a sector or quadrant finding when the failure is
sufficiently localised.
Donut improvement — not feasible with current features¶
innerOuterRatio was investigated as a co-discriminator for center vs donut.
In the WM-811K dataset, donuts have higher inner/outer failure rates than
centers — counter-intuitive, but explained by the donut hole geometry: the
donut ring sits in the outer half of the wafer, but the outer half contains far
more dies than the inner circle, so the outer failure rate is diluted. This
feature is not a reliable discriminator. The 15% donut recall ceiling is
intrinsic to the p25DistNorm overlap between the two classes.