POLYEDGE

System Active
Overview
Calibration
Trade History
Positions
System Logs
Claude Supervisor
Docs

Unrealized P&L

$--
across -- positions

Capital Deployed

$--
Exposure: --

Model Status

--
ΔR² --

Edge Quality

--
avg decay: --

Top Positions

Market P&L
Loading...

OOS Validation

--
Significance: --

Bias Detection by Event Type

Current Opportunities

Market Type Market Price True Prob Edge Kelly Size Action

Recent System Activity

Loading system activity...

Total Markets

--
Active: --

Resolved Markets

--
Snapshots: --

Training Data

--
--
Reactive: -- Tactical: -- Strategic: -- Reflective: --

Actions

Stage 1 - Fundamentals

R² = --
Brier Score: --

Stage 2 - Bias Correction

R² = --
Brier Score: --

ΔR² Improvement

--
LR Test: --

OOS McFadden ΔR²

--
Walk-forward backtest

OOS Brier Improvement

--
vs market prices

Test Samples

--
-- folds

In-Sample vs OOS

--
ΔR² gap (lower = less overfit)

Expected Calibration Error (ECE)

--
Model ECE (lower is better)
Market ECE: --

Tail Calibration

Loading...

Edge Decay Signal

--
--

Reliability Diagram

Model Performance Over Time

Walk-Forward Backtest Folds

Fold Train End Test Size Brier S2 Brier Mkt Improvement McFadden ΔR²

OOS ΔR² Trend Over Time

Edge Decay by Trade

Trade Side Original Edge Current Edge Decay Ratio Hours Held Direction

Trade History

Time Market Type Side Price Model Prob Edge Kelly P&L Cumulative P&L Status

Realized P&L

--
-- trades resolved

Win Rate

--
W: -- / L: --

Unrealized P&L

--
-- open positions

Capital Deployed

--
ROI: --

Edge Accuracy

--
MAE: --

Best / Worst

--
--

Cumulative P&L Over Time

P&L by Event Type

Predicted vs Realized Edge

Trade P&L Distribution

Resolved Trades - Feedback Loop

IDMarketTypeSideEntrySizePredicted EdgeRealized EdgeEdge ErrorP&LResolved
Loading...

Open Limit Orders

Loading limit orders...

Position Management

Loading position data...

Resolved Trades

--
win rate: --

Edge Hit Rate

--
predicted sign = realized sign

Edge Correlation

--
predicted vs realized edge

Mean Edge Error

--
bias: --

Edge Accuracy by Event Type

TypeTradesHit RateCorrelationMAEP&L
No resolved trades yet

Predicted vs Realized Edge

Shows whether predicted edges match actual outcomes
Waiting for resolved trades...

Cycle History

Time Markets Opportunities Trades Unrealized P&L ΔR² Positions Bankroll
Loading...

Recent Alerts

Loading alerts...

Supervisor Status

Checking...
Model: --

Model

--

System Confidence

--
Run supervisor review to get score

Ask the Supervisor

Ask Claude any question about your system's strategy, risk management, or improvements.

Foundational Basics

Before diving into how PolyEdge works, here are the key concepts explained in plain language.

Prediction Market Probability
The price IS the crowd's guess at the chance of something happening.
On Polymarket, if "Will X happen?" trades at $0.70, the crowd thinks there's a 70% chance. If the event actually happens, the contract pays $1.00. If not, it pays $0.00. Our job is to find cases where the crowd's guess is wrong.
ΔR² (Delta R-Squared)
How much better our model is compared to just using market prices alone.
Think of it like a test score improvement. If the market price alone gets a 75% on predicting outcomes, and our model gets 77%, the ΔR² is that 2% improvement. Even small improvements matter because they compound over hundreds of trades. A ΔR² above 0% means our model adds real value.
A ΔR² of 2% might sound small, but across 1,000 trades it can mean the difference between losing money and consistent profits.
p-value
How confident we are that our results aren't just luck.
The p-value answers: "If our model had zero skill, what's the chance we'd see results this good by random luck?" A p-value of 0.01 means there's only a 1% chance the results are just luck. We typically need p < 0.05 (less than 5% chance of luck) to trust the model.
Flipping a coin and getting 7 heads in a row has a p-value of about 0.008 — unlikely enough that you'd suspect the coin is rigged. That's the same logic we apply to model performance.
Brier Score
A report card for probability predictions — lower is better.
The Brier score measures how close our probability predictions are to what actually happens. If we say there's a 90% chance of something and it happens, that's a good prediction. If we say 90% and it doesn't happen, that's a bad one. The score ranges from 0 (perfect) to 1 (terrible). A weather forecast that's always right gets 0. One that always says 50% gets 0.25.
Our model's Brier score: ~0.06 (very accurate). The raw market: ~0.08. That gap is where our edge comes from.
McFadden R²
How well our model explains why outcomes happen.
Similar to a regular R², but designed for yes/no outcomes like prediction markets. Values above 0.2 are considered good, above 0.4 is excellent. Our two-stage model typically achieves R² of 0.70–0.80, meaning it captures most of the factors that drive outcomes.
Kelly Criterion
A formula that tells you the optimal amount to bet based on your edge.
If you have a 60% edge on a fair coin flip, Kelly says bet 20% of your bankroll. Bet too much and a losing streak wipes you out. Bet too little and you leave money on the table. We use "fractional Kelly" (15% of the full Kelly amount) to be extra conservative — giving up some profit for much more safety.
Full Kelly says "bet $100." We bet $15 instead. We grow slower but survive the inevitable bad streaks.
Out-of-Sample (OOS) Testing
Testing the model on data it has never seen before, to prove it works for real.
It's easy to build a model that "predicts" the past perfectly — that's like memorizing the answers to a test you've already taken. OOS testing is like taking a brand new exam. We train the model on old data, then test it on newer data it's never touched. Only if it passes OOS testing do we allow it to trade.
Expected Calibration Error (ECE)
How honest our probability predictions are — when we say 70%, does it really happen 70% of the time?
ECE groups all predictions into buckets (e.g., all the times we said 60–70%) and checks if the actual rate matches. An ECE near 0 means perfectly calibrated. An ECE of 0.05 means predictions are off by about 5 percentage points on average.

How the Model Finds Edge

PolyEdge uses a two-stage approach inspired by academic research on horse racing markets (Benter 1994). Here's how each piece fits together:

1
Data Collection

Every hour, the system pulls data from 200+ active Polymarket markets and 500+ resolved markets. It records prices, order book depth, volume, and timing information. Resolved markets (where we know the outcome) become our training data.

2
Stage 1: Baseline Model

First, we build a model using just the market price and basic features (volume, liquidity, time to close). This represents what the crowd already knows. Think of it as "the market is mostly right, but how right?"

3
Stage 2: L2-Regularized Bias Correction

This is where the edge comes from. Stage 2 uses L2-regularized logistic regression (sklearn LogisticRegression, C=0.1) with 6 core features: stage1_logit, price_stage1_diff, depth_imbalance, price_uncertainty, log_time, and flb_correction. The L2 penalty prevents overfitting by shrinking coefficients toward zero, and the reduced feature set (down from 13) avoids multicollinearity. The ΔR² between Stage 1 and Stage 2 tells us exactly how much value these bias corrections add.

4
Prediction Shrinkage

Final predictions blend 60% model probability + 40% market price (shrinkage_factor=0.6). This conservative blending acknowledges that markets are mostly right and prevents the model from making extreme predictions on thin evidence. Shrinkage improves out-of-sample stability.

5
Walk-Forward Validation

We don't just test once. We use "walk-forward" testing with 5 expanding windows: train on months 1–2, test on month 3. Then train on months 1–3, test on month 4. And so on. Expanding-window temporal splits prevent data leakage by ensuring the model never sees future data during training. This simulates real trading conditions where you only know the past.

6
OOS Trading Gate

The system will not trade unless three conditions are met: (1) OOS ΔR² > 0, meaning the model beats market prices on unseen data; (2) OOS Brier improvement > 0, meaning predictions are more accurate; (3) at least 20 test observations, so the results are statistically meaningful. All three must pass.

7
Edge Detection & Sizing

For each live market, we compare our model's probability to the market price. The difference is the "edge." We then use Kelly Criterion (at 15% strength) to size positions proportionally to the edge, and apply risk limits to prevent overconcentration.

The full cycle runs automatically every hour: ingest data, retrain the model, validate OOS, detect biases, scan for edges, size positions, and run Claude's strategic review.

Bias Types Explained

Markets are mostly efficient, but crowds make systematic errors. These errors are our opportunity.

Favorite-Longshot Bias (FLB)
People overpay for longshots and underpay for favorites.
Just like lottery tickets, people are drawn to low-probability, high-payout bets. A contract trading at $0.05 (5% chance) might really only have a 2% chance. Meanwhile, contracts at $0.95 (95% chance) might actually be worth $0.97. This creates a systematic tilt we can exploit. When b1 > 1.5, this bias is strong.
If a market says "Event X" has a 5% chance and the true probability is 2%, that's a 3-cent overvaluation on every contract — the FLB at work.
Optimistic Bias
People tend to think good things are more likely than they really are.
When b0 (the intercept) is significantly different from 0, it means the crowd systematically overestimates or underestimates probabilities across the board, regardless of whether the event is likely or unlikely. A positive b0 means general overconfidence. We detect this per event type, so political markets might show different optimism than crypto markets.
Event-Specific Detection
Different types of events have different biases — we track each one separately.
Crypto markets behave differently from political markets. Sports bettors have different biases than economics watchers. PolyEdge classifies every market into a category (political, crypto, sports, entertainment, economics, geopolitical, science, other) and estimates biases for each type independently.
Bayesian Shrinkage
When we don't have enough data for a category, we blend its estimate with the overall average to stay safe.
Say we only have 15 resolved sports markets — too few to be confident. Bayesian shrinkage automatically blends the sports-specific bias estimate with the overall (all-market) estimate. The fewer observations we have, the more we lean on the overall average. With 1,000+ observations, we trust the category-specific estimate almost entirely. This prevents us from overreacting to small, noisy samples.
Sports bias with only 15 samples: 95% overall average + 5% sports-specific. Sports bias with 500 samples: 4% overall average + 96% sports-specific.

Behavioral Features

Beyond the classic biases, we track 5 behavioral signals rooted in psychology research that create predictable mispricing.

Volume Momentum (Anchoring Bias)
When prices move away from where most of the trading happened, people anchored to the old price create an opportunity.
We compute a Volume-Weighted Average Price (VWAP) for each market, which represents the "consensus price" weighted by how much was traded at each level. When the current price diverges from VWAP, it often signals that the market has moved but participants are still anchored to old levels.
Deadline Effect (Certainty Illusion)
As events approach resolution, uncertainty increases but people act as if they're more certain.
This combines price uncertainty (how volatile the price is) with time pressure (how close to expiry). Markets near their resolution date show a specific kind of mispricing where participants overreact to late-breaking information or freeze up and stop updating. We capture this by multiplying uncertainty by an exponential time-pressure function.
Cluster Divergence (Narrow Framing)
People treat each market in isolation instead of comparing it to similar events.
If 10 crypto markets all predict different prices for similar events, the ones that diverge most from the group average may be mispriced. Narrow framing means people evaluate each market individually without considering how similar markets are priced. We measure how far each market's price deviates from its event-type average.
Nonlinear FLB (Extreme Price Interaction)
The Favorite-Longshot bias gets dramatically worse at extreme prices.
At moderate prices (30–70 cents), the FLB is mild. But at extreme prices (below 10 cents or above 90 cents), the bias amplifies nonlinearly. A 5-cent contract doesn't just have proportionally more FLB than a 30-cent contract — it has dramatically more. We capture this by interacting the FLB correction with how extreme the price is.
Informed Trading Pressure (Adverse Selection)
Detecting when traders with inside knowledge are pushing the price.
When the order book is imbalanced (more bids than asks, or vice versa) and the bid-ask spread is wide, it often signals that informed traders are positioning. Wide spreads mean market makers are nervous about being picked off by someone who knows more. We combine order book imbalance with spread tightness to detect these situations and figure out which direction the informed money is flowing.

Risk Controls

PolyEdge has multiple layers of protection to prevent catastrophic losses.

Dry-Run Mode (Default)
The system simulates trades without risking real money until you're ready.
By default, all trades are paper trades. The system records what it would have done, tracks simulated P&L, and lets you evaluate performance before going live. You need to provide a private key and explicitly disable dry-run mode to trade for real.
Fractional Kelly (15%)
We bet 15% of what the math says is optimal, trading growth for safety.
Full Kelly sizing maximizes long-run growth but creates stomach-churning swings. At 15% Kelly (kelly_fraction=0.15), we capture significant expected growth with dramatically lower risk of drawdowns. This conservative approach is appropriate given model uncertainty in prediction markets, where edges are typically small and noisy.
Exposure Limits
Hard caps on how much is at risk at any time, overall and per category.
The system enforces maximum total exposure (% of bankroll at risk) and per-event-type limits. Even if the model finds 50 great opportunities in crypto, it won't load up all in one category. This prevents a single sector shock from devastating the portfolio.
Edge Cap (20% Maximum)
If the model claims an edge above 20%, we cap it — edges that big are usually errors.
In prediction markets, a 20% edge is enormous. If the model says "the true probability is 95% but the market says 60%," something is probably wrong with the model, not the market. Capping edges prevents overconfident predictions from creating outsized positions.
Overfitting Detection
If the model fits the training data too perfectly, we reject it — it's memorizing, not learning.
Two automatic triggers: R² > 0.95 (model explains 95%+ of variation, which is suspicious for noisy financial data) or Brier score < 0.001 (predictions are almost perfect, which shouldn't happen in uncertain markets). Either one causes the Stage 2 corrections to be thrown out for that cycle.
OOS Trading Gate
The model must prove it works on unseen data before any trades are allowed.
Three conditions must all be met: ΔR² > 0 (model beats market on new data), Brier improvement > 0 (more accurate predictions), and at least 20 test observations (enough data to be meaningful). This is the strictest safeguard — the system will sit idle rather than trade on an unvalidated model.

Recent Performance

Validated out-of-sample results demonstrating the model's genuine predictive power.

OOS McFadden ΔR² is +1.10% across 5 walk-forward folds with 3,620 test samples. This proves the model adds value beyond market prices on completely unseen data.
Walk-Forward OOS Results
Model tested on unseen data across 5 expanding time windows, proving genuine edge.
The walk-forward validation uses expanding-window temporal splits to prevent data leakage. Results by fold:
Fold 1: Near zero ΔR² (small training set, expected)
Fold 2: Near zero ΔR² (still building training data)
Fold 3: +1.35% ΔR² (model begins to show edge)
Fold 4: +2.52% ΔR² (strongest fold, large training set)
Fold 5: +1.91% ΔR² (consistent positive performance)

Average across all 5 folds: +1.10% with 3,620 total test observations.
L2-Regularized Stage 2
Stage 2 uses penalized logistic regression to prevent overfitting on noisy market data.
The switch from unregularized statsmodels to sklearn's LogisticRegression with C=0.1 (strong L2 penalty) dramatically improved OOS performance. The regularization shrinks coefficients toward zero, preventing the model from fitting to noise in the training data. Combined with reducing from 13 features to 6 core features (stage1_logit, price_stage1_diff, depth_imbalance, price_uncertainty, log_time, flb_correction), this eliminates multicollinearity and produces more stable predictions.
Prediction Shrinkage
Final predictions blend 60% model + 40% market price for conservative, stable estimates.
With shrinkage_factor=0.6, the final probability is: p_final = 0.6 * p_model + 0.4 * p_market. This acknowledges that market prices contain valuable information and prevents the model from making extreme predictions. Shrinkage is a standard technique in statistical forecasting that trades a small amount of in-sample fit for significantly better out-of-sample stability.
Folds 1-2 showing near-zero ΔR² is expected and healthy — it confirms the model needs sufficient training data before it can beat the market. The consistent improvement in Folds 3-5 demonstrates genuine learning, not overfitting.

Reading the Dashboard

A quick guide to each tab and what to look for.

Overview Tab

Your command center. The four cards at the top show bankroll, total P&L, open positions, and OOS validation status. The OOS card is the most important — if it shows "VALIDATED" in green, the model has proven itself on unseen data. Below that, you'll see bias detection by event type and action buttons to trigger manual cycles.

Current Opportunities

Shown on the Overview tab, this table lists markets where the model detects mispricing of at least 2%. The "Edge" column is the percentage difference between our model's probability and the market price. "Kelly" shows the optimal bet fraction. "Size" shows the dollar amount. Opportunities are refreshed automatically every 5 minutes by the tactical scheduler.

Calibration Tab

Model performance and calibration diagnostics combined. Stage 1/Stage 2 R² and ΔR² show the model improvement from bias correction. OOS (out-of-sample) metrics validate on unseen data. ECE should be under 0.05. The reliability diagram shows calibration visually. Edge decay tracks whether open trades' edges are holding up over time.

Trade History Tab

Every trade the system has made (or simulated in dry-run). Check the "Model Prob" vs "Price" columns to see the predicted edge. Resolved trades show realized P&L. Invalidated trades (dimmed rows) had edges so extreme they were likely model errors.

Model Selector

Choose which AI model powers the supervisor. Available models include Claude (Opus, Sonnet, Haiku) and GPT-4o variants. Flagship models give deeper analysis, fast models are quicker and cheaper.

Claude Supervisor Tab

Deep strategic review by the selected AI model. The confidence score (0–100) summarizes overall system health. Green (70+) means healthy. Yellow (40–69) means needs attention. Red (<40) means critical issues. You can also ask specific strategy questions using the text box.

Quick Reference Glossary

AICModel quality score (lower = better fit with fewer variables)
BankrollTotal simulated capital available for trading
b0 (intercept)Overall optimism/pessimism bias level
b1 (slope)Favorite-Longshot bias strength (>1.5 = strong FLB)
Brier ScorePrediction accuracy: 0 = perfect, 0.25 = coin flip
CLOBCentral Limit Order Book — Polymarket's trading engine
ConvergingEdge is getting smaller over time (market agreeing with you)
Cross-FittingTraining and predicting on different data splits to avoid overfitting
DivergingEdge is growing over time (market moving against you)
ΔR²Improvement from Stage 1 to Stage 2 (>0 = model adds value)
Dry RunSimulated trading without real money
ECEExpected Calibration Error (<0.05 = well-calibrated)
EdgeDifference between model probability and market price
Expanding WindowCross-validation where the training set grows over time, never shrinks
ExposureTotal capital at risk across all open positions
FLBFavorite-Longshot Bias (longshots overpriced)
Kelly %Optimal bet size as fraction of bankroll (15% fractional)
L2 RegularizationA penalty that shrinks model coefficients toward zero to prevent overfitting
LR TestLikelihood Ratio test: checks if Stage 2 is a real improvement
Mark-to-MarketUpdating position values to current market prices
McFadden Pseudo-R²Measures how well a logit model explains outcomes compared to a null model
OOSOut-of-Sample — tested on data the model hasn't seen
P&LProfit and Loss (realized = closed, unrealized = open)
Prediction ShrinkageBlending model predictions with market prices (60/40) for stability
Shrinkage FactorHow much we trust category-specific vs overall estimate (0-1)
VWAPVolume-Weighted Average Price (consensus trading level)
Walk-Forward ValidationTesting method using expanding time windows to simulate real trading
Win RatePercentage of resolved trades that were profitable

Ask about...

Trade Thesis

Analyzing trade...