Weather Edge

Daily briefing

All value bets across every open market, ranked by edge. Status tells you whether to bet now or wait for fresher models.

Model freshness — current time: — UTC

Min edge %

Bankroll $

Exclude US

Analyse market

Paste a Polymarket URL to load buckets automatically, then run the forecast analysis.

Step 1 — Load from Polymarket

Paste a URL above — city, date and buckets will all fill in automatically.

Step 2 — Confirm details

City — auto-set from URL

Market type

Date — auto-set from URL

Bankroll ($) ⓘ

Step 3 — Buckets

Temperature bucketMarket yes %

Results tracker

Paper trades logged automatically from the briefing. Results resolve overnight — check here each morning.

View:

📋 Paper trades

Auto-logged from briefing · resolves overnight from Polymarket

Manual trade log — for real bets or manual paper trades

Market (e.g. London 17°C)

Direction

Resolution date

Model probability (%)

Market odds (%)

Stake ($)

Outcome

Notes (optional)

Bet analysis

AI-powered statistical report on your paper trade history. Filters by bet type, outcome, stars, and city. Meaningful conclusions only appear once sufficient trades are logged.

Filters

Bet type

Outcome

Min stars

Region

City (blank = all)

Min trades to analyse

Report to Claude

Copy diagnostic blocks and paste into Claude for analysis. Each block computes live from all resolved trades in D1.

1 · NO losses by confidence bucket

NO bets only · model prob buckets · n / wins / HR / Brier / Wilson CI. Red flag: HR much lower than bucket midpoint = model overconfident on NO bets.

2 · YES HR by signal type

YES bets split by signal (GFS-only vs multi-model) · n / wins / HR / P&L / Wilson CI. Key question: does multi-signal YES outperform GFS-only YES?

3 · INT / US split by direction

INT·YES / INT·NO / US·YES / US·NO · n / HR / Brier / P&L / Wilson CI. Go-live decision: if INT·NO Wilson CI lower >45% at n≥30, that is the live-betting slice.

4 · Brier by lead time

0–24h / 24–48h / 48–96h / 4d+ · n / HR / Brier / P&L. Expected: Brier rises with lead time. Flag if D+2/3 worse than D+1 by >0.02.

5 · Reliability / calibration — highest diagnostic priority

All resolved trades · probability buckets · n / actual win rate / Wilson CI / vs midpoint. Shown for ALL trades, YES only, and NO only. Perfect calibration = dev of 0pp. Large negative dev on YES = structural YES model failure.

6 · Logistic recalibration coefficients

Fitted logistic regression coefficients (YES and NO separately), sample sizes, and per-bucket table: raw model prob vs calibrated prob vs actual HR. Gate: needs 30+ resolved trades per direction.

📊 Report Everything — full diagnostic bundle

Compiles all 5 diagnostic blocks plus a summary header into a single clipboard payload. Paste directly into Claude for a complete picture in one shot.

Weather Edge — User Guide

The definitive reference: how the app works, how to use it, and the betting strategy behind it.

1 · What this app does and why it works

Polymarket runs daily weather markets where traders bet on whether the temperature in a given city will land in a specific range. The market price reflects collective human judgment about the probability. Weather Edge replaces that human judgment with something better: a 31-member numerical weather prediction ensemble that produces a genuine probability distribution across outcomes.

The edge comes from a structural asymmetry that will not close: processing 31 ensemble forecasts and comparing them to market prices requires System 2 thinking — slow, deliberate, computational. The people pricing these markets use System 1 — fast, intuitive, heuristic. System 1 is incapable of running the calculation. This is not a temporary inefficiency. It is permanent, because it is cognitive.

The app automates the entire System 2 process: fetch models → compute probabilities → compare to market → size the bet → log and track. Your job is to check it twice a day and press the button.

2 · The four data sources

GFS ensemble (31 members)

The American Global Forecast System run 31 times with slightly different initial conditions. Each run produces a complete temperature forecast. We count what fraction of the 31 runs land in each bucket after rounding to whole degrees — matching Wunderground's resolution, which is how markets resolve. This is the primary betting signal. If 14 of 31 members show a daily high of 19°C, model probability = 45%.

Dispersion correction. Raw member-counting has a known flaw: ensembles are under-dispersive — their spread is too narrow, so they are overconfident and systematically under-price the tail buckets. The open-source bots that count raw members inherit this exactly. Instead of placing each member in one hard bucket, the app treats each as a small Gaussian (kernel dressing) and integrates its mass across buckets. This fattens the tails to the right width and smooths the noise of having only 31 members. The kernel width grows with lead time. It is shown as the "Dispersion fix" chip, and per-bucket tooltips show raw→corrected. This is a genuine mathematical edge over naive member-counting, not just better packaging.

ECMWF IFS (deterministic + ensemble)

The European Centre for Medium-Range Weather Forecasts — generally considered the world's most accurate NWP system. Used as a cross-check against GFS. When both agree, confidence is high. When they diverge by more than 1.5°, caution is warranted.

Relative-skill blend. GFS and ECMWF are combined by their expected skill for the specific lead time, not pooled or averaged blindly. ECMWF gets more weight (≈55% at 1 day, rising toward ≈72% at 7 days, reflecting its accuracy advantage that widens with lead time), and the model that is more confident for this particular day — the one with the tighter ensemble spread — gets a further modest nudge. The blend ratio is shown as a chip on the results.

GFS deterministic

The single best-estimate GFS forecast (as opposed to the ensemble). Used as a sanity check — it should be close to the ensemble mean. Large divergence between deterministic and ensemble mean is a flag.

METAR (live station) + model-bias nowcast

Current observed temperature at the exact airport station the market resolves against. Beyond just displaying it, the app now compares the model's forecast for the current hour against the actual station reading to measure live bias: if the model reads 2° warmer than the station right now, it is running warm today, in this airmass. For same-day and next-day markets this bias (damped to half its measured size) is subtracted from the ensemble before computing the edge — a today-specific correction independent of the long-run city-bias tracker. Shown as the "Live model bias" chip; "applied" means it is feeding the edge, "info only" means the market is too far ahead for live conditions to matter. Requires a free CheckWX API key in your Cloudflare environment variables.

3 · Your daily workflow

The 80/20 answer: set one alarm for 07:05 BST. That single session captures the majority of available edge.

Primary — 07:05 BST daily

GFS 06Z and ECMWF 00Z both available ~06:00 BST. Both models fresh simultaneously. Markets dormant all night — maximum anchoring gap. European markets not yet repriced. US traders asleep. Run briefing, click Log all BET NOW, done in 15 minutes.

Secondary — 19:00 BST daily

GFS 12Z and ECMWF 12Z both available ~18:00 BST. Good for US markets — evening repricing often incomplete. Best window for Asian cities.

Opportunistic — ~00:30 BST

GFS 18Z available ~midnight BST. ECMWF stale. BET NOW fires on GFS alone if spread ≤1.5° and edge ≥12pp — shown with blue GFS only badge. Do not set an alarm for this.

4 · Model freshness — the three-tier status system

Tier 1 — Both GFS and ECMWF fresh

Maximum confidence. BET NOW fires when edge ≥10pp, spread tight, members sufficient. Standard case at 07:05 and 19:00 BST.

Tier 2 — GFS fresh, ECMWF stale (blue GFS only badge)

BET NOW fires only if GFS spread ≤1.5° AND edge ≥12pp. Applies at the midnight window.

Tier 3 — GFS stale

Always WAIT regardless of ECMWF freshness. GFS ensemble is the primary signal.

GFS runs 00Z/06Z/12Z/18Z + ~5h lag. ECMWF runs 00Z/12Z + ~5h lag. In BST: GFS available ~05:00/11:00/17:00/23:00. ECMWF ~06:00/18:00.

5 · Reading the daily briefing table

Each row is the single best opportunity in that market — the bucket with the largest model-vs-market divergence.

Dir — Direction

YES — model thinks this bucket is more likely than market implies. Buy YES shares. NO — market has overpriced this bucket. Buy NO shares.

Model% — Model probability

Fraction of GFS ensemble members landing in this bucket after rounding to whole degrees. 14 of 31 = 45%.

Mkt% — Market implied probability

Current YES price as a percentage. Edge figures are overstated by roughly 2-4pp due to fees and spread. Never bet on edges below 8pp gross.

Edge — The opportunity

Model% minus Mkt%. Green (+) = underpriced, bet YES. Red (−) = overpriced, bet NO.

Kelly — Suggested bet size

Quarter-Kelly from your bankroll, scaled by lead time and confidence stars. Always treat Kelly as a ceiling, not a target.

Spread — GFS internal uncertainty

Standard deviation of the 31 GFS members. ±0.8° = confident. ±2.5° = uncertain — consider halving Kelly.

Stars — Combined confidence rating

★★★ all conditions favourable. ★★☆ one condition marginal. ★☆☆ multiple conditions weak — speculative only.

6 · The status signals

BET NOW Edge ≥10pp · GFS fresh · spread acceptable · members sufficient.

BET NOW GFS only Tier 2 — GFS fresh, ECMWF stale, spread ≤1.5° and edge ≥12pp.

WAIT Edge exists but one or more conditions unmet. Check again after the next model run.

PASS Edge below threshold. Not worth trading after fees.

7 · Betting strategy — from conventional edge to Kahneman/Taleb

The people pricing these markets are not irrational. They are human. Human brains run decision-making shortcuts that create predictable, systematic, exploitable errors.

Daniel Kahneman — Thinking, Fast and Slow

System 1 (fast, intuitive) prices the market. System 2 (slow, deliberate) runs the 31-member ensemble calculation. System 1 can't do that. It will never do that. This structural gap is the permanent, non-arbitrageable core of your edge.

Nassim Taleb — The Black Swan

Markets built on human intuition systematically underprice tail events. The tail buckets in temperature markets face a double discount: statistically underpriced (Taleb) and psychologically avoided (Kahneman).

WYSIATI. The market prices what it can see: yesterday's weather, the BBC headline, the season. It cannot see 31 ensemble runs, the spread across members, or the ECMWF divergence. You can.

Anchoring. The first prices set on a market are highly sticky. The largest edges appear in the 1–2 hours after model updates, before the market has repriced.

Availability bias. After a cold spell, cold feels probable. The ensemble has no memory of last week. After unusual weather, the opposite tail is systematically underpriced.

Overconfidence — the rule for you. Never override the model based on personal weather intuition. The moment you do, you have become the market you are trying to beat.

The four opportunity types

Type A — Fresh model · stale market (WYSIATI + Anchoring)

The bread and butter. Every day at 07:05 BST the GFS model updates. The market price was set by humans yesterday. The gap between them is your edge — it closes within 1-2 hours as traders reprice. The majority of your BET NOW rows will be Type A.

Type B — Recency play · check opposite tail (Availability bias)

After unusual weather — a heatwave, cold snap — the market overweights continuation. When the model starts showing reversion the market hasn't priced, the opposite tail is systematically underpriced. Not auto-detected yet; spot manually after any sustained unusual weather run.

Type C — Tail underpriced · barbell bet (Loss aversion + Fat tails)

A tail bucket trading at very low odds (≤10%) where the ensemble shows meaningful support. Small stake, high payout. Auto-detected when market odds ≤10% and edge ≥8pp. Individually speculative — collectively exploiting a structural inefficiency that will never close.

Type D — History + model agree · highest confidence (Base rate neglect)

Where GFS model, ERA5 historical base rate, and edge all point in the same direction. Two independent signals disagreeing with the market. The D tag fires in the Analyse tab when Hist%, Model%, and edge all agree direction and edge ≥8pp.

8 · Kelly sizing — what it is and how to use it

The Kelly criterion answers: given an edge, what fraction of bankroll should you stake to maximise long-run growth without risking ruin?

The formula

f = (p×b − q) / b — where p = model probability, q = 1−p, b = decimal odds. The result f is the optimal fraction of bankroll.

Quarter-Kelly baseline. The app uses 25% of full Kelly. Accounts for the fact that our probability estimates are uncertain.

Continuous lead-time decay. max(0.25, 1 − 0.12×(daysAhead−1)). Day 1: 100%. Day 3: 76%. Day 5: 52%. Day 7: 28%.

Divergence reducer. When GFS and ECMWF disagree: 0–1°: 100%, 1–2°: 85%, 2–3°: 70%, 3–4°: 50%, >4°: 25%.

Star multiplier. ★★★ = 100%, ★★☆ = 50%, ★☆☆ = 25% of scaled Kelly.

5% bankroll cap. Hard ceiling per trade regardless of formula output.

25% portfolio cap. Across all open paper trades combined, total stake is capped at 25% of bankroll. Logging is blocked once that ceiling is reached, so several simultaneous BET NOW rows can't quietly stack into a large fraction of the bankroll.

Selection haircut (single-bucket). The single largest-edge bucket is the one most likely inflated by estimation noise (winner's curse). Until the multi-bucket engine ships, the stake on the single best bucket is sized on 80% of the measured edge — the displayed edge stays honest, only the stake is trimmed.

Live odds recheck. When you log a paper trade, the current Polymarket price is re-fetched. If the edge has closed below 8pp or flipped, the trade is not logged and you're told the price moved.

The honest caveat. All multipliers are derived from judgment, not empirical calibration. Until you have 200+ resolved trades and Brier score below 0.20, start real-money bets at half the displayed amount.

8a · Multi-bucket Kelly portfolio In development · not yet live

The briefing currently bets only the single bucket with the largest edge in each market. That leaves value on the table and introduces a subtle bias. This section documents the planned fix: staking multiple buckets in the same market as one jointly-sized portfolio.

The problem — winner's-curse selection bias

Picking only the single max-edge bucket systematically selects the bucket whose edge is most inflated by estimation noise. The largest measured edge is disproportionately likely to be the one where our probability estimate is too high by chance. Betting only that bucket concentrates the position on the noisiest signal in the market.

The fix — exploit negative correlation. Within one market the buckets are mutually exclusive: at most one can win, so their payoffs are negatively correlated. A portfolio of negatively-correlated positive-edge bets has lower variance than any single leg, which is exactly the condition under which joint Kelly permits a larger total stake than sizing one bucket alone — not a smaller one.

The algorithm. Rank candidate buckets by model/market probability ratio. Add buckets to the portfolio one at a time, highest ratio first, while the marginal contribution to expected log-wealth stays positive. Stop when the next bucket would reduce it. Then solve the stakes jointly across the selected set rather than sizing each in isolation.

YES-only for v1. The first version stakes YES legs exclusively. This keeps the maths clean and unit-testable with no double-counting. A NO position overlaps YES-on-the-complement of its bucket, so mixing NO legs in requires a netting layer to avoid counting the same outcome twice — that's deferred to v1.1.

Guardrails. Per-leg fees still apply, so each bucket must individually clear the edge threshold before it can join the portfolio. Total stakes are capped so that even an un-bet bucket resolving against the whole portfolio cannot breach the 5% bankroll limit on the market.

Status. Documented here for sign-off only. The engine itself — algorithm, test cases, and schema — is not built yet and nothing in the live app sizes multi-bucket positions today.

9 · Reading the reliability diagram

The reliability diagram answers: when the model says 40% probability, does it actually win 40% of the time? It appears in the tracker once you have 15 resolved trades.

How to read it. X axis = model predicted probability. Y axis = actual win rate. Dashed diagonal = perfect calibration. Dots sized by trade count per bucket.

Points above the diagonal — underconfident. Model predicts 40% but you win 55%. True edge is larger than calculated.

Points below the diagonal — overconfident. Model predicts 60% but you win 45%. Kelly stakes are too large. Reduce until the diagram corrects.

Brier score

Mean squared error of probability forecasts. 0 = perfect. 0.25 = uninformative coin flip. A well-calibrated weather model achieves 0.15–0.20 for day-1 forecasts.

9b · The strategic framework — why this edge exists

Kahneman — availability heuristic as the primary exploitable bias

Human traders anchor to what has happened recently and systematically underprice reversion to the long-run climatological mean. A week of hot weather makes the market overconfident that tomorrow will be hot too — not because of any rational model, but because recent events feel more likely. This is Kahneman's availability heuristic operating at scale and it is measurable: your ERA5 historical base rates reveal exactly when the market has drifted too far from the long-run mean. Type B bets (reversion plays) are the direct exploitation of this bias. International markets show this effect more strongly than US markets because fewer sophisticated operators are correcting for it.

Taleb — tail underpricing as a structural market feature

The market cannot price tail weather events correctly because human traders are loss-averse and will not sell a 3% contract at fair value — the potential loss feels disproportionate relative to the premium received. This creates systematic cheap optionality on extreme weather outcomes. Type C bets (tail underpricing, market ≤10%) are a deliberate black swan strategy: you pay a small repeated cost waiting for a rare event that pays 10-30× stake. The key insight from Taleb is that tail wins are not anomalies to be stripped from P&L — they are the point. A strategy that is breakeven ex-tails but strongly positive including tails has genuine structural edge, not luck. The three P&L figures on the dashboard are designed to make this visible.

The Dixon-Coles analogy — structural mispricing not random noise

Dixon and Coles found that football bookmakers systematically mispriced low-scoring games because naive Poisson models ignored score correlation. The weather equivalent is not the same distributional anomaly but the same underlying mechanism: a structural feature of how markets price outcomes that a simple model gets wrong. Here, the structural feature is the market overweighting recent conditions (the Kahneman effect) combined with an inability to price tails (the Taleb effect). Both are persistent because they arise from human cognitive architecture, not from a correctable data or model error. They will not be arbitraged away quickly by competing bots — especially on international markets where bot coverage is thin.

9a · Reading the performance dashboard — what each number means

Hit rate

Simply the percentage of bets that won. 41% means 25 wins from 61 resolved trades. The target is 55%+ — not because you need to win more than half of bets, but because the markets you should be betting on are priced below 50%, so winning 55% means you are consistently finding genuine mispricing rather than just getting lucky. Below 50% over many trades is a clear signal to review the model or filtering. The cumulative trend chart matters more than the snapshot — a rising line from left to right means quality is improving as you accumulate experience and tighten the filters.

Model EV (Expected Value)

Every time you place a bet, the model says "I think the true probability of this outcome is X%, but the market is only offering Y%." The gap is your edge. Model EV multiplies that edge by your stake for every trade and adds them all up — regardless of whether the bet actually won or lost. So if you bet $25 with a 20pp edge, that trade contributes $5 to model EV. Do it 100 times and model EV is +$500, even if you happened to lose 60 of them through bad luck. It is the purest measure of whether the model is genuinely finding mispricing in a consistent direction. Variance (luck) cannot touch it. If model EV is strongly positive and growing steadily, the edges are real and profit will follow with enough volume. If model EV is flat or declining, the model is not finding genuine disagreements — it is just noise.

The three P&L figures

Total P&L incl. tails — every win counted, including rare outsized payouts where the market massively underpriced a tail event and it came in. This is the Taleb number: black swan wins are not anomalies to be stripped out, they are the point. The tail win count (🦢) tells you how many are in the figure.

Ex-tails P&L — strips out wins paying more than 5× stake. This is the conservative baseline: does the strategy make money without relying on lightning striking? If this is deeply negative over many trades, the edge is coming from variance rather than skill.

Model EV — see above. This is the most important number at low trade counts. A positive and growing model EV with negative ex-tail P&L simply means you need more volume — the edges are there but variance is dominating.

The KPI trend charts

Three charts running from trade 1 to now. Each has two lines: a solid bright line showing the cumulative figure (all trades so far), and a faint dashed line showing the rolling 20-trade window (recent performance). The cumulative line is the main signal — it starts noisy at low trade counts and gradually stabilises. The rolling line shows whether recent trades are performing differently from the long-run average: if it drops below the cumulative line, you have had a bad recent patch; if it rises above, recent trades are outperforming. Reference lines on each chart show targets: 55% hit rate, 0.20 Brier (good), 0.25 Brier (coin flip), and zero P&L. Available at 20+ resolved trades.

Avg |edge| and direction split

The average absolute distance between model probability and market price, across all trades. Displayed as a positive number regardless of YES/NO direction — a NO bet at −39pp and a YES bet at +39pp both represent the same amount of disagreement with the market. The direction split (e.g. 40% YES · 60% NO) shows whether your book is balanced. A heavy NO skew (flagged in amber above 70%) means most of your bets are against highly-priced markets — lower payouts per win, and more sensitive to the tail filter.

10 · City bias analysis

The city analysis panel in the tracker diagnoses whether GFS has a systematic warm or cold bias for each location. Cards appear at 5+ resolved trades per city.

The YES/NO split. The key diagnostic. If GFS runs warm, YES bets on high-temperature buckets lose more than expected while NO bets win more.

Confidence gates. Below 10 trades: no diagnosis. 10–19: tentative. 20–29: emerging, 10% stake reduction. 30+: 25% reduction.

Non-stationarity. GFS bias varies by season and is reset by model upgrades. Treat city bias as a rolling signal, not a fixed correction.

10a · Historical base rates — ERA5

The Hist% column in the Analyse tab shows the historical frequency of each temperature bucket for that city and month, drawn from 10 years of ERA5 reanalysis data.

What it's good for. Catching structural market mispricing. Detecting availability bias. Identifying Type D opportunities.

What it cannot do. Predict tomorrow. GFS is far better at that.

Recency weighting. The app applies linear decay weight to the 10-year ERA5 data. Most recent year gets weight 2.0, oldest year 0.2. This partially corrects for the warming trend cold bias.

Sample size note. ~21 observations per bucket — treat Hist% as having ±5-8pp uncertainty. A gap of 2-3pp is not meaningful. A gap of 15pp+ is.

11 · Known limitations

31 members is not a large ensemble. Probability estimates are granular to ~3pp. True probability could differ from our estimate by 10pp in either direction.

Members are not independent. Effective sample size for capturing true atmospheric uncertainty is considerably less than 31.

Edge overstated by ~2-4pp. We compare to the displayed market price, not the true breakeven price after fees.

Kelly multipliers are arbitrary. Not derived from empirical data. Reasonable starting points, nothing more.

Selection bias in hit rate. We only bet when edge exceeds a threshold. Hit rate is not an unbiased estimator of model accuracy.

One season is not calibration. 50 trades in June tells you almost nothing about winter.

11a · Sample sizes — when can you trust the data?

To detect 60% hit rate vs 50% (large effect): 50 trades at 90% confidence.

To detect 55% hit rate vs 50% (moderate effect): 193 trades at 90% confidence.

For city-level bias direction: 100 trades per city.

For auto-betting go/no-go: 500 total trades, 50+ per major city, Brier score below 0.20.

The autocorrelation problem

Sequential trades during the same weather regime are not statistically independent. Ten trades during a June heatwave may have an effective sample size of 2-3. Real-world requirements are roughly double the thresholds above.

12 · Development history — the logic of each improvement

Each build added something specific to close a gap between what the model knew and what the app could act on. This is the sequence and the reasoning behind it.

Stage 1 — Core engine: GFS ensemble probabilities + kernel dressing

The foundation. Instead of taking the raw GFS forecast at face value, the app runs 50+ ensemble members and kernel-dresses them — smoothing the probability distribution to account for the fact that no model is perfectly precise. This produces better-calibrated probabilities than any single forecast. The core insight: Polymarket prices are set by humans using yesterday's intuition; the model updates at 07:05 BST. That gap is the edge.

Stage 2 — ECMWF as a second opinion + model disagreement signal

GFS alone can be confidently wrong. Adding ECMWF (a different model from a different institution) gives a genuine second opinion. When the two models agree, confidence rises and Kelly sizing increases. When they diverge by more than 2°C, the stake is cut — not because either is wrong, but because disagreement signals genuine uncertainty the market may have already priced. This is the divergence reducer in the Kelly formula.

Stage 3 — ERA5 historical base rates: the long-run anchor

Ten years of hourly historical weather data (ERA5 reanalysis) gives the true base rate for each temperature bucket, per city, per month. This surfaces two things: (1) Type D bets — where both the model and history agree against the market, the highest-confidence signal; and (2) the Type B signal introduced later — where history and the market are pulling in completely opposite directions, suggesting the market is overreacting to recent weather.

Stage 4 — Kelly sizing: how much to stake, not just whether to bet

Having an edge is one thing; sizing the bet correctly is another. The Kelly criterion gives the mathematically optimal fraction of bankroll to stake. The app uses quarter-Kelly (25% of full Kelly) as a safety margin, then applies four further adjustments: decay for bets further ahead in time (model less reliable at day 3 vs day 1); the divergence reducer above; a star multiplier for overall signal quality; and a hard 5% bankroll cap per trade. Without correct sizing, even a genuine edge can blow up a bankroll through variance.

Stage 5 — Daily briefing: automation of the morning scan

Before the briefing tab, every bet required manually loading each city and market. The briefing automates this: at 07:05 BST it scans all cities, all dates, finds every BET NOW signal, and ranks them by edge. The 07:05 timing is deliberate — GFS has just updated, Polymarket prices haven't yet responded. This is when the edge is widest. The briefing also introduced BET ALL NOW — log every qualifying signal in one click.

Stage 6 — GFS sanity gate: filtering false positives

A specific problem emerged: when running on GFS alone (ECMWF unavailable), some signals were being generated for temperatures that were physically implausible for that city and month — e.g. Miami in June showing cold. The sanity gate checks the GFS ensemble mean against the known seasonal range (from ERA5) for that city and month, and silently rejects any signal where the model is sitting comfortably inside the range. The gate only fires on Tier 2 (GFS-only) runs to avoid false positive trades being logged.

Stage 7 — Performance dashboard: measuring what matters

As trades accumulated, the need to measure performance properly became critical. The dashboard tracks three things that matter: hit rate (are you winning more than you lose?), Brier score (are your probabilities well-calibrated, not just right/wrong?), and P&L (are you actually making money?). The Brier benchmark bar contextualises the score — 0.25 is a coin flip, 0.20 is good, 0.15 is excellent. The reliability diagram shows whether your 60% predictions actually win 60% of the time.

Stage 8 — CLV tracking: did the market agree with you after the fact?

Closing Line Value (CLV) is a concept from sports betting: if you back something at 40% odds and the market closes at 55%, the market itself validated your entry — you got in cheap before others spotted the same signal. Positive mean CLV over many trades is stronger evidence of genuine edge than win rate alone, because it measures whether the market moved in your direction after you entered. Negative CLV means the market consistently knows better than you at entry time.

Stage 9 — P&L bug fix: correct payout for NO bets

A material bug was discovered: the P&L formula was using the YES market price to calculate winnings on NO bets. A NO bet on a 90% market (i.e. you think it won't happen) pays only ~11% if you win — but the code was treating it as paying 1000%. This produced wildly inflated P&L figures that made the dashboard look profitable when it wasn't. Fixed by making the payout formula direction-aware: YES bets use the YES price, NO bets use (100 minus the YES price).

Stage 10 — Bet type as a first-class field + Type B ERA5 signal current build

Previously the bet type (A/B/C/D) was only stored as text inside a notes string, making it fragile to read back. It's now a proper data field on every trade. Type B (reversion play) was previously manual-only — no signal. A data-driven trigger was added: when ERA5 historical base rate strongly disagrees with the market (one says >70%, the other says <40%), this is the statistical fingerprint of the market overweighting a recent unusual weather run. The model agreeing with ERA5 makes it actionable.

13 · What comes next

Historical backtest tab. Fetch resolved Polymarket markets + ERA5 actuals, simulate Kelly bets, test market mispricing thesis without waiting for live trade volume.

Automatic bias compensation. Once city bias is confirmed at 30+ trades, Kelly auto-reduces on the biased direction.

Auto-betting infrastructure. Python backend, ★★★ only, Tier 1 only, daily loss limit 5%, kill switch. Not until 500+ calibrated trades.

Bootstrap confidence intervals. Error bars on all dashboard statistics.

Precipitation markets. Infrastructure exists, awaiting Polymarket daily markets.