Polymarket runs daily weather markets where traders bet on whether the temperature in a given city will land in a specific range. The market price reflects collective human judgment about the probability. Weather Edge replaces that human judgment with something better: a 31-member numerical weather prediction ensemble that produces a genuine probability distribution across outcomes.
The edge comes from a structural asymmetry that will not close: processing 31 ensemble forecasts and comparing them to market prices requires System 2 thinking — slow, deliberate, computational. The people pricing these markets use System 1 — fast, intuitive, heuristic. System 1 is incapable of running the calculation. This is not a temporary inefficiency. It is permanent, because it is cognitive.
The app automates the entire System 2 process: fetch models → compute probabilities → compare to market → size the bet → log and track. Your job is to check it twice a day and press the button.
The American Global Forecast System run 31 times with slightly different initial conditions. Each run produces a complete temperature forecast. We count what fraction of the 31 runs land in each bucket after rounding to whole degrees — matching Wunderground's resolution, which is how markets resolve. This is the primary betting signal. If 14 of 31 members show a daily high of 19°C, model probability = 45%.
Dispersion correction. Raw member-counting has a known flaw: ensembles are under-dispersive — their spread is too narrow, so they are overconfident and systematically under-price the tail buckets. The open-source bots that count raw members inherit this exactly. Instead of placing each member in one hard bucket, the app treats each as a small Gaussian (kernel dressing) and integrates its mass across buckets. This fattens the tails to the right width and smooths the noise of having only 31 members. The kernel width grows with lead time. It is shown as the "Dispersion fix" chip, and per-bucket tooltips show raw→corrected. This is a genuine mathematical edge over naive member-counting, not just better packaging.
The European Centre for Medium-Range Weather Forecasts — generally considered the world's most accurate NWP system. Used as a cross-check against GFS. When both agree, confidence is high. When they diverge by more than 1.5°, caution is warranted.
Relative-skill blend. GFS and ECMWF are combined by their expected skill for the specific lead time, not pooled or averaged blindly. ECMWF gets more weight (≈55% at 1 day, rising toward ≈72% at 7 days, reflecting its accuracy advantage that widens with lead time), and the model that is more confident for this particular day — the one with the tighter ensemble spread — gets a further modest nudge. The blend ratio is shown as a chip on the results.
The single best-estimate GFS forecast (as opposed to the ensemble). Used as a sanity check — it should be close to the ensemble mean. Large divergence between deterministic and ensemble mean is a flag.
Current observed temperature at the exact airport station the market resolves against. Beyond just displaying it, the app now compares the model's forecast for the current hour against the actual station reading to measure live bias: if the model reads 2° warmer than the station right now, it is running warm today, in this airmass. For same-day and next-day markets this bias (damped to half its measured size) is subtracted from the ensemble before computing the edge — a today-specific correction independent of the long-run city-bias tracker. Shown as the "Live model bias" chip; "applied" means it is feeding the edge, "info only" means the market is too far ahead for live conditions to matter. Requires a free CheckWX API key in your Cloudflare environment variables.
The 80/20 answer: set one alarm for 07:05 BST. That single session captures the majority of available edge.
GFS 06Z and ECMWF 00Z both available ~06:00 BST. Both models fresh simultaneously. Markets dormant all night — maximum anchoring gap. European markets not yet repriced. US traders asleep. Run briefing, click Log all BET NOW, done in 15 minutes.
GFS 12Z and ECMWF 12Z both available ~18:00 BST. Good for US markets — evening repricing often incomplete. Best window for Asian cities.
GFS 18Z available ~midnight BST. ECMWF stale. BET NOW fires on GFS alone if spread ≤1.5° and edge ≥12pp — shown with blue GFS only badge. Do not set an alarm for this.
Maximum confidence. BET NOW fires when edge ≥10pp, spread tight, members sufficient. Standard case at 07:05 and 19:00 BST.
BET NOW fires only if GFS spread ≤1.5° AND edge ≥12pp. Applies at the midnight window.
Always WAIT regardless of ECMWF freshness. GFS ensemble is the primary signal.
GFS runs 00Z/06Z/12Z/18Z + ~5h lag. ECMWF runs 00Z/12Z + ~5h lag. In BST: GFS available ~05:00/11:00/17:00/23:00. ECMWF ~06:00/18:00.
Each row is the single best opportunity in that market — the bucket with the largest model-vs-market divergence.
YES — model thinks this bucket is more likely than market implies. Buy YES shares. NO — market has overpriced this bucket. Buy NO shares.
Fraction of GFS ensemble members landing in this bucket after rounding to whole degrees. 14 of 31 = 45%.
Current YES price as a percentage. Edge figures are overstated by roughly 2-4pp due to fees and spread. Never bet on edges below 8pp gross.
Model% minus Mkt%. Green (+) = underpriced, bet YES. Red (−) = overpriced, bet NO.
Quarter-Kelly from your bankroll, scaled by lead time and confidence stars. Always treat Kelly as a ceiling, not a target.
Standard deviation of the 31 GFS members. ±0.8° = confident. ±2.5° = uncertain — consider halving Kelly.
★★★ all conditions favourable. ★★☆ one condition marginal. ★☆☆ multiple conditions weak — speculative only.
BET NOW Edge ≥10pp · GFS fresh · spread acceptable · members sufficient.
BET NOW GFS only Tier 2 — GFS fresh, ECMWF stale, spread ≤1.5° and edge ≥12pp.
WAIT Edge exists but one or more conditions unmet. Check again after the next model run.
PASS Edge below threshold. Not worth trading after fees.
The people pricing these markets are not irrational. They are human. Human brains run decision-making shortcuts that create predictable, systematic, exploitable errors.
System 1 (fast, intuitive) prices the market. System 2 (slow, deliberate) runs the 31-member ensemble calculation. System 1 can't do that. It will never do that. This structural gap is the permanent, non-arbitrageable core of your edge.
Markets built on human intuition systematically underprice tail events. The tail buckets in temperature markets face a double discount: statistically underpriced (Taleb) and psychologically avoided (Kahneman).
Anchoring. The first prices set on a market are highly sticky. The largest edges appear in the 1–2 hours after model updates, before the market has repriced.
Availability bias. After a cold spell, cold feels probable. The ensemble has no memory of last week. After unusual weather, the opposite tail is systematically underpriced.
Overconfidence — the rule for you. Never override the model based on personal weather intuition. The moment you do, you have become the market you are trying to beat.
The bread and butter. Every day at 07:05 BST the GFS model updates. The market price was set by humans yesterday. The gap between them is your edge — it closes within 1-2 hours as traders reprice. The majority of your BET NOW rows will be Type A.
After unusual weather — a heatwave, cold snap — the market overweights continuation. When the model starts showing reversion the market hasn't priced, the opposite tail is systematically underpriced. Not auto-detected yet; spot manually after any sustained unusual weather run.
A tail bucket trading at very low odds (≤10%) where the ensemble shows meaningful support. Small stake, high payout. Auto-detected when market odds ≤10% and edge ≥8pp. Individually speculative — collectively exploiting a structural inefficiency that will never close.
Where GFS model, ERA5 historical base rate, and edge all point in the same direction. Two independent signals disagreeing with the market. The D tag fires in the Analyse tab when Hist%, Model%, and edge all agree direction and edge ≥8pp.
The Kelly criterion answers: given an edge, what fraction of bankroll should you stake to maximise long-run growth without risking ruin?
f = (p×b − q) / b — where p = model probability, q = 1−p, b = decimal odds. The result f is the optimal fraction of bankroll.
Continuous lead-time decay. max(0.25, 1 − 0.12×(daysAhead−1)). Day 1: 100%. Day 3: 76%. Day 5: 52%. Day 7: 28%.
Divergence reducer. When GFS and ECMWF disagree: 0–1°: 100%, 1–2°: 85%, 2–3°: 70%, 3–4°: 50%, >4°: 25%.
Star multiplier. ★★★ = 100%, ★★☆ = 50%, ★☆☆ = 25% of scaled Kelly.
5% bankroll cap. Hard ceiling per trade regardless of formula output.
25% portfolio cap. Across all open paper trades combined, total stake is capped at 25% of bankroll. Logging is blocked once that ceiling is reached, so several simultaneous BET NOW rows can't quietly stack into a large fraction of the bankroll.
Selection haircut (single-bucket). The single largest-edge bucket is the one most likely inflated by estimation noise (winner's curse). Until the multi-bucket engine ships, the stake on the single best bucket is sized on 80% of the measured edge — the displayed edge stays honest, only the stake is trimmed.
Live odds recheck. When you log a paper trade, the current Polymarket price is re-fetched. If the edge has closed below 8pp or flipped, the trade is not logged and you're told the price moved.
The honest caveat. All multipliers are derived from judgment, not empirical calibration. Until you have 200+ resolved trades and Brier score below 0.20, start real-money bets at half the displayed amount.
The briefing currently bets only the single bucket with the largest edge in each market. That leaves value on the table and introduces a subtle bias. This section documents the planned fix: staking multiple buckets in the same market as one jointly-sized portfolio.
Picking only the single max-edge bucket systematically selects the bucket whose edge is most inflated by estimation noise. The largest measured edge is disproportionately likely to be the one where our probability estimate is too high by chance. Betting only that bucket concentrates the position on the noisiest signal in the market.
The algorithm. Rank candidate buckets by model/market probability ratio. Add buckets to the portfolio one at a time, highest ratio first, while the marginal contribution to expected log-wealth stays positive. Stop when the next bucket would reduce it. Then solve the stakes jointly across the selected set rather than sizing each in isolation.
YES-only for v1. The first version stakes YES legs exclusively. This keeps the maths clean and unit-testable with no double-counting. A NO position overlaps YES-on-the-complement of its bucket, so mixing NO legs in requires a netting layer to avoid counting the same outcome twice — that's deferred to v1.1.
Guardrails. Per-leg fees still apply, so each bucket must individually clear the edge threshold before it can join the portfolio. Total stakes are capped so that even an un-bet bucket resolving against the whole portfolio cannot breach the 5% bankroll limit on the market.
Status. Documented here for sign-off only. The engine itself — algorithm, test cases, and schema — is not built yet and nothing in the live app sizes multi-bucket positions today.
The reliability diagram answers: when the model says 40% probability, does it actually win 40% of the time? It appears in the tracker once you have 15 resolved trades.
Points above the diagonal — underconfident. Model predicts 40% but you win 55%. True edge is larger than calculated.
Points below the diagonal — overconfident. Model predicts 60% but you win 45%. Kelly stakes are too large. Reduce until the diagram corrects.
Mean squared error of probability forecasts. 0 = perfect. 0.25 = uninformative coin flip. A well-calibrated weather model achieves 0.15–0.20 for day-1 forecasts.
Human traders anchor to what has happened recently and systematically underprice reversion to the long-run climatological mean. A week of hot weather makes the market overconfident that tomorrow will be hot too — not because of any rational model, but because recent events feel more likely. This is Kahneman's availability heuristic operating at scale and it is measurable: your ERA5 historical base rates reveal exactly when the market has drifted too far from the long-run mean. Type B bets (reversion plays) are the direct exploitation of this bias. International markets show this effect more strongly than US markets because fewer sophisticated operators are correcting for it.
The market cannot price tail weather events correctly because human traders are loss-averse and will not sell a 3% contract at fair value — the potential loss feels disproportionate relative to the premium received. This creates systematic cheap optionality on extreme weather outcomes. Type C bets (tail underpricing, market ≤10%) are a deliberate black swan strategy: you pay a small repeated cost waiting for a rare event that pays 10-30× stake. The key insight from Taleb is that tail wins are not anomalies to be stripped from P&L — they are the point. A strategy that is breakeven ex-tails but strongly positive including tails has genuine structural edge, not luck. The three P&L figures on the dashboard are designed to make this visible.
Dixon and Coles found that football bookmakers systematically mispriced low-scoring games because naive Poisson models ignored score correlation. The weather equivalent is not the same distributional anomaly but the same underlying mechanism: a structural feature of how markets price outcomes that a simple model gets wrong. Here, the structural feature is the market overweighting recent conditions (the Kahneman effect) combined with an inability to price tails (the Taleb effect). Both are persistent because they arise from human cognitive architecture, not from a correctable data or model error. They will not be arbitraged away quickly by competing bots — especially on international markets where bot coverage is thin.
Simply the percentage of bets that won. 41% means 25 wins from 61 resolved trades. The target is 55%+ — not because you need to win more than half of bets, but because the markets you should be betting on are priced below 50%, so winning 55% means you are consistently finding genuine mispricing rather than just getting lucky. Below 50% over many trades is a clear signal to review the model or filtering. The cumulative trend chart matters more than the snapshot — a rising line from left to right means quality is improving as you accumulate experience and tighten the filters.
Every time you place a bet, the model says "I think the true probability of this outcome is X%, but the market is only offering Y%." The gap is your edge. Model EV multiplies that edge by your stake for every trade and adds them all up — regardless of whether the bet actually won or lost. So if you bet $25 with a 20pp edge, that trade contributes $5 to model EV. Do it 100 times and model EV is +$500, even if you happened to lose 60 of them through bad luck. It is the purest measure of whether the model is genuinely finding mispricing in a consistent direction. Variance (luck) cannot touch it. If model EV is strongly positive and growing steadily, the edges are real and profit will follow with enough volume. If model EV is flat or declining, the model is not finding genuine disagreements — it is just noise.
Total P&L incl. tails — every win counted, including rare outsized payouts where the market massively underpriced a tail event and it came in. This is the Taleb number: black swan wins are not anomalies to be stripped out, they are the point. The tail win count (🦢) tells you how many are in the figure.
Ex-tails P&L — strips out wins paying more than 5× stake. This is the conservative baseline: does the strategy make money without relying on lightning striking? If this is deeply negative over many trades, the edge is coming from variance rather than skill.
Model EV — see above. This is the most important number at low trade counts. A positive and growing model EV with negative ex-tail P&L simply means you need more volume — the edges are there but variance is dominating.
Three charts running from trade 1 to now. Each has two lines: a solid bright line showing the cumulative figure (all trades so far), and a faint dashed line showing the rolling 20-trade window (recent performance). The cumulative line is the main signal — it starts noisy at low trade counts and gradually stabilises. The rolling line shows whether recent trades are performing differently from the long-run average: if it drops below the cumulative line, you have had a bad recent patch; if it rises above, recent trades are outperforming. Reference lines on each chart show targets: 55% hit rate, 0.20 Brier (good), 0.25 Brier (coin flip), and zero P&L. Available at 20+ resolved trades.
The average absolute distance between model probability and market price, across all trades. Displayed as a positive number regardless of YES/NO direction — a NO bet at −39pp and a YES bet at +39pp both represent the same amount of disagreement with the market. The direction split (e.g. 40% YES · 60% NO) shows whether your book is balanced. A heavy NO skew (flagged in amber above 70%) means most of your bets are against highly-priced markets — lower payouts per win, and more sensitive to the tail filter.
The city analysis panel in the tracker diagnoses whether GFS has a systematic warm or cold bias for each location. Cards appear at 5+ resolved trades per city.
Confidence gates. Below 10 trades: no diagnosis. 10–19: tentative. 20–29: emerging, 10% stake reduction. 30+: 25% reduction.
Non-stationarity. GFS bias varies by season and is reset by model upgrades. Treat city bias as a rolling signal, not a fixed correction.
The Hist% column in the Analyse tab shows the historical frequency of each temperature bucket for that city and month, drawn from 10 years of ERA5 reanalysis data.
What it cannot do. Predict tomorrow. GFS is far better at that.
Recency weighting. The app applies linear decay weight to the 10-year ERA5 data. Most recent year gets weight 2.0, oldest year 0.2. This partially corrects for the warming trend cold bias.
Sample size note. ~21 observations per bucket — treat Hist% as having ±5-8pp uncertainty. A gap of 2-3pp is not meaningful. A gap of 15pp+ is.
Members are not independent. Effective sample size for capturing true atmospheric uncertainty is considerably less than 31.
Edge overstated by ~2-4pp. We compare to the displayed market price, not the true breakeven price after fees.
Kelly multipliers are arbitrary. Not derived from empirical data. Reasonable starting points, nothing more.
Selection bias in hit rate. We only bet when edge exceeds a threshold. Hit rate is not an unbiased estimator of model accuracy.
One season is not calibration. 50 trades in June tells you almost nothing about winter.
To detect 55% hit rate vs 50% (moderate effect): 193 trades at 90% confidence.
For city-level bias direction: 100 trades per city.
For auto-betting go/no-go: 500 total trades, 50+ per major city, Brier score below 0.20.
Sequential trades during the same weather regime are not statistically independent. Ten trades during a June heatwave may have an effective sample size of 2-3. Real-world requirements are roughly double the thresholds above.
The foundation. Instead of taking the raw GFS forecast at face value, the app runs 50+ ensemble members and kernel-dresses them — smoothing the probability distribution to account for the fact that no model is perfectly precise. This produces better-calibrated probabilities than any single forecast. The core insight: Polymarket prices are set by humans using yesterday's intuition; the model updates at 07:05 BST. That gap is the edge.
GFS alone can be confidently wrong. Adding ECMWF (a different model from a different institution) gives a genuine second opinion. When the two models agree, confidence rises and Kelly sizing increases. When they diverge by more than 2°C, the stake is cut — not because either is wrong, but because disagreement signals genuine uncertainty the market may have already priced. This is the divergence reducer in the Kelly formula.
Ten years of hourly historical weather data (ERA5 reanalysis) gives the true base rate for each temperature bucket, per city, per month. This surfaces two things: (1) Type D bets — where both the model and history agree against the market, the highest-confidence signal; and (2) the Type B signal introduced later — where history and the market are pulling in completely opposite directions, suggesting the market is overreacting to recent weather.
Having an edge is one thing; sizing the bet correctly is another. The Kelly criterion gives the mathematically optimal fraction of bankroll to stake. The app uses quarter-Kelly (25% of full Kelly) as a safety margin, then applies four further adjustments: decay for bets further ahead in time (model less reliable at day 3 vs day 1); the divergence reducer above; a star multiplier for overall signal quality; and a hard 5% bankroll cap per trade. Without correct sizing, even a genuine edge can blow up a bankroll through variance.
Before the briefing tab, every bet required manually loading each city and market. The briefing automates this: at 07:05 BST it scans all cities, all dates, finds every BET NOW signal, and ranks them by edge. The 07:05 timing is deliberate — GFS has just updated, Polymarket prices haven't yet responded. This is when the edge is widest. The briefing also introduced BET ALL NOW — log every qualifying signal in one click.
A specific problem emerged: when running on GFS alone (ECMWF unavailable), some signals were being generated for temperatures that were physically implausible for that city and month — e.g. Miami in June showing cold. The sanity gate checks the GFS ensemble mean against the known seasonal range (from ERA5) for that city and month, and silently rejects any signal where the model is sitting comfortably inside the range. The gate only fires on Tier 2 (GFS-only) runs to avoid false positive trades being logged.
As trades accumulated, the need to measure performance properly became critical. The dashboard tracks three things that matter: hit rate (are you winning more than you lose?), Brier score (are your probabilities well-calibrated, not just right/wrong?), and P&L (are you actually making money?). The Brier benchmark bar contextualises the score — 0.25 is a coin flip, 0.20 is good, 0.15 is excellent. The reliability diagram shows whether your 60% predictions actually win 60% of the time.
Closing Line Value (CLV) is a concept from sports betting: if you back something at 40% odds and the market closes at 55%, the market itself validated your entry — you got in cheap before others spotted the same signal. Positive mean CLV over many trades is stronger evidence of genuine edge than win rate alone, because it measures whether the market moved in your direction after you entered. Negative CLV means the market consistently knows better than you at entry time.
A material bug was discovered: the P&L formula was using the YES market price to calculate winnings on NO bets. A NO bet on a 90% market (i.e. you think it won't happen) pays only ~11% if you win — but the code was treating it as paying 1000%. This produced wildly inflated P&L figures that made the dashboard look profitable when it wasn't. Fixed by making the payout formula direction-aware: YES bets use the YES price, NO bets use (100 minus the YES price).
Previously the bet type (A/B/C/D) was only stored as text inside a notes string, making it fragile to read back. It's now a proper data field on every trade. Type B (reversion play) was previously manual-only — no signal. A data-driven trigger was added: when ERA5 historical base rate strongly disagrees with the market (one says >70%, the other says <40%), this is the statistical fingerprint of the market overweighting a recent unusual weather run. The model agreeing with ERA5 makes it actionable.
Automatic bias compensation. Once city bias is confirmed at 30+ trades, Kelly auto-reduces on the biased direction.
Auto-betting infrastructure. Python backend, ★★★ only, Tier 1 only, daily loss limit 5%, kill switch. Not until 500+ calibrated trades.
Bootstrap confidence intervals. Error bars on all dashboard statistics.
Precipitation markets. Infrastructure exists, awaiting Polymarket daily markets.