Skip to content
strategies6 min read

Survivorship bias killed my 1H bot (part 2)

Part 2 of the build-in-public bot series. A holdout test on 11 unseen coins revealed the 1H strategy barely beat a coin flip. The 4H version survived.

A dark trading chart on a monitor showing equity curves, the kind of out-of-sample test that separates real edges from noise

Affiliate disclosure: This post may contain affiliate links. If you sign up through a link on this page I may earn a commission at no extra cost to you. This does not affect my ratings. Learn more.

Most traders who backtest a strategy on 10 coins and get great results have not found an edge. They've found a memory of those 10 coins.

I almost shipped a 1-hour bot to live trading. It had excellent headline numbers, a clean equity curve, and I'd convinced myself it was ready. Then I ran one extra test before going live, and the whole picture collapsed. This is Part 2 of me documenting this project in public. Part 1 covered the first month live, the bugs, and going down 13%. This one is about the mistake I caught before it cost me even more.

What the 1-hour version looked like on paper

The strategy I'm running is a trend ensemble: UT Bot signals and EMA 20/50 crossovers, both voting per symbol, with an ADX regime filter that stands completely flat when the market is choppy. I described it in detail in Part 1. The short version is: it trend-follows on a specific timeframe with hard exits and equal risk per symbol.

When I built the first version on the 1-hour timeframe, the numbers looked genuinely promising. Backtested across the same handful of coins I'd been watching, it showed a clean edge. Return over multiple years, manageable drawdowns, decent signal frequency. I started getting excited.

The thing about that kind of excitement is it's the most dangerous feeling in quant work. You're looking at a backtest that trained itself to the exact history it was shown, and you're calling that "results."

The holdout test

Before going live, I ran what I'd call an out-of-universe holdout: I took the exact same strategy config (no changes, same parameters) and applied it to roughly 11 coins that were never part of choosing anything. Symbols I'd never touched in the design process. They weren't cherry-picked to look bad either, just a group I'd deliberately kept separate from day one for this exact check.

If a strategy has a real edge, it should show up on coins it's never seen before. That's the only honest test. Everything else is curve-fitting with extra steps.

Here's what happened.

1-hour holdout

About 41% of the period-slots came in positive, and the mean was roughly breakeven.

That's essentially a coin flip. A coin comes up heads 50% of the time. My 1H strategy, on unseen coins, came in at 41%. The headline returns from the original backtest were almost entirely explained by the specific symbols I'd built it on. The "edge" was those coins' quirks during those years, not a transferable pattern.

There's a secondary reason too. The 1H timeframe generates around 230 trades per symbol per year. At that frequency, fees eat you alive in crypto futures. Funding rates, taker fills, spreads: they add up to a constant bleed. An edge that looks like 1-2% per trade on paper can easily flip negative once real friction is applied.

4-hour holdout

The 4H version told a completely different story. 70% of period-slots came in positive, with a mean of roughly +13.8% per symbol.

Same strategy logic, same ADX filter, same UT Bot and EMA rules. Just zoomed out one full level. The edge persisted out-of-sample. That's the test I was looking for: a pattern that survives on coins that had no vote in the parameter choices.

The 4H also trades about 58 times per symbol per year, versus the 230 on 1H. Fewer trades means the noise-to-signal ratio improves and each trade has to clear a higher bar. That's actually a feature of trend-following: you're trying to catch meaningful moves, not every wiggle.

A 41% holdout win rate on 1H versus 70% on 4H. That 29-point gap is not market regime. It's survivorship bias, visible.

Why this happens so often

The Wikipedia page on survivorship bias uses the classic WWII airplane example, but the trading version is subtler and more expensive. You backtest on coins you follow. You follow coins that moved a lot in the last few years. Coins that moved a lot were often trending strongly. A trend-following system on coins that happened to trend? It's going to look great. That's not signal, that's selection.

The fix isn't complicated but it requires actually doing it. You hold out a set of assets before you start any parameter work, you run zero tests on them until you have a finished strategy, and then you run one test and look at the number. Not ten tests, one. The moment you start tweaking based on holdout results, the holdout is contaminated and you're back to curve-fitting.

I nearly skipped this step. I had a working bot, I was excited, and I talked myself halfway into "I'll validate it on more coins after I go live." That would have been expensive. The whole point of the holdout is to run it when it still hurts to be wrong.

What I actually shipped

The live bot runs on 4H. ETH, NEAR, SOL. Win rate somewhere in the 30-40% range, which sounds bad until you remember the system wins because the winners are roughly twice the size of the losers, not because it rarely loses. An ATR bracket (stop at 3x ATR, take-profit at 6x ATR) handles that ratio automatically.

The ADX filter stands the whole thing flat when markets are ranging. That's the core of why the 4H holdout numbers looked reasonable: the regime detection actually works and shows up on coins it's never seen before. An edge that only shows up in-sample is not an edge. An edge that replicates on unseen data is at least worth taking seriously.

I'll be honest about the expectations. The backtest shows around +35% a year. I do not expect +35% a year live. Live returns on a corrected, honest backtest (Part 3 will get into a particularly nasty backtest artifact that cost me a lot of false confidence) tend to run well below that. My central expectation is low-double-digit annual returns, with wide variance, and drawdowns of 25-35% being completely normal.

Anyone selling you a strategy that "passed out-of-sample testing" is using a phrase that can mean almost anything. The test I described above: same coins, held out from day one, one shot. That's what it actually means.

If you're at the stage where you're trying to find a strategy before any of this, the bot-match quiz is a reasonable place to start. It'll tell you whether building custom is even the right move for your situation versus using an exchange-native bot like Bybit's built-in suite.

Next up I'll write Part 3, which is about a backtest artifact that made one coin look like it returned +5,000% in a year. That one stings more than the 1H debacle, honestly.

Follow the whole series on the journey page.

Share:X / TwitterReddit
Hung Phu
Hung Phu
DCA BotsGrid BotsPythonCrypto FuturesBacktesting

Python algo trader since 2019. I build and test trading bots with real capital on Bybit and Binance. AlgoGrade is my lab notebook.

Related posts