On this day, exactly 12 years ago (9:30 EDT 1 Aug 2012), was the most expensive software bug ever, in both terms of dollars per second and total lost. The company managed to pare losses through the heroics of Goldman Sachs, and “only” lost $457 million (which led to its dissolution).
Devs were tasked with porting their HFT bot to an upcoming NYSE API service that was announced to go live less than a 33 days in the future. So they started a death march sprint of 80 hour weeks. The HFT bot was written in C++. Because they didn’t want to have to recompile once, the lead architect decided to keep the same exact class and method signature for their PowerPeg::trade() method, which was their automated testing bot that they had been using since 2003. This also meant that they did not have to update the WSDL for the clients that used the bot, either.
They ripped out the old dead code and put in the new code. Code that actually called real logic, instead of the test code, which was designed, by default, to buy the highest offer given to it.
They tested it, they wrote unit tests, everything looked good. So they decided to deploy it at 8 AM EST, 90 minutes before market open. QA testers tested it in prod, gave the all clear. Everyone was really happy. They’d done it. They’d made the tight deadline and deployed with just 90 minutes to spare…
They immediately went to a sprint standup and then sprint retro meeting. Per their office policy, they left their phones (on mute) at their desks.
During the retro, the markets opened at 9:30 EDT, and the new bot went WILD (!!) It just started buying the highest offer offered for all of the stocks in its buy list. The markets didn’t react very abnormally, becuase it just looked like they were bullish. But they were buying about $5 million shares per second… Within 2 minutes, the warning alarms were going on in their internal banking sector… a huge percentage of their $2.5 billion in operating cash was being depleted, and fast!
So many people tried to contact the devs, but they were in a remote office in Hoboken due to the high price of realestate in Manhattan. And their phones were off and no one was at their computer.
The CEO was seen getting people to run through the halls of the building, yelling, and finally the devs noticed. 11 minutes ahd gone by and the bots had bought over $3 billion of stock. The total cash reserves were depleted. The compnay was in SERIOUS trouble…
None of the devs could find the source of the bug. The CEO, desperate, asked for solutions. “KILL THE SERVERS!!” one of the devs shouted!!
They got techs @ the datacenter next to the NYSE building to find all 8 servers that ran the bots and DESTROYED them with fireaxes. Just ripping the wires out… And finally, after 37 minutes, the bots stopped trading. Total paper loss: $10.8 billion.
The SEC + NYSE refused to rewind the trades for all but 6 stocks, the on paper losses were still at $8 billion. No way they coudl pay. Goldman Sachs stepped in and offered to buy all the stocks @ a for-profit price of $457 million, which they agreed to. All in all, the company lost close to $500 million and all of its corporate clients left, and it went out of business a few weeks later.
Now what was the cause of the bug? Fat fingering human error during release.
The sysop had declined to implement CI/CD, which was still in its infancy, probably because that was his full-time job and he was making like $300,000 in 2012 dollars ($500k today). There were 8 servers that housed the bot and a few clients on the same servers.
The sysop had correctly typed out and pasted the correct rsync commands to get the new C++ binary onto the servers, except for server 5 of 8. In the 5th instance, he had an extra 5 in the server name. The rsync failed, but because he pasted all of the commands at once, he didn’t notice…
Because the code used the exact same method signature for the trade() method, server 5 was happy to buy up the most expensive offer it was given, because it was running the Sad Path test trading software. If they had changed the method signature, it wouldn’t have run and the bug wouldn’t have happened.
At 9:43 EDT, the devs decided collectively to do a “rollback” to the previous release. This was the worst possible mistake, because they added in the Power Peg dead code to the other 7 servers, causing the problems to grow exponentially. Although, it took about 3 minutes for anyone in Finance to actually inform them. At that point, more than $50 million dollars per second was being lost due to the bug.
It wasn’t until 9:58 EDT that the servers had all been destroyed that the trading stopped.
Here is a description of the aftermath:
It was not until 9:58 a.m. that Knight engineers identified the root cause and shut down SMARS on all the servers; however, the damage had been done. Knight had executed over 4 million trades in 154 stocks totaling more than 397 million shares; it assumed a net long position in 80 stocks of approximately $3.5 billion as well as a net short position in 74 stocks of approximately $3.15 billion.
28 minutes. $8.65 billion inappropriately purchased. ~1680 seconds. $5.18 million/second.
But after the rollback at 9:43, about $4.4 billion was lost. ~900 seconds. ~$49 million/second.
That was the story of how a bad software decision and fat-fingered manual production release destroyed the most profitable stock trading firm of the time, and was the most expensive software bug in human history.
Holy shit that’s wild
CI/CD in 2012? Incredible. That didn’t become the norm for me until 2017.
This would make an excellent short film. The fire axes scene would be epic.
The actual SEC report is relatively short - and surprisingly accessible.
That is a good read, thank you. Didn’t have procedures, had two different brokersge systems running at once because they’d no procedures to follow, lost a fortune.
I’m thinking it’s the "most expensive bug in history so far - haven’t seen an accurate total for CrowdStrike’s little faux pas, yet.
We can argue on whether it’s a “bug” outright (since it is technically a correct implementation of a faulty design), but Boeing’s MCAS pitching the plane based on the input of a singular faulty sensor has probably caused billions in direct damages, and billions more in reputational damage.
NULL
references (which Crowdstrike is an instance of) are often referred to as “the billion dollar mistake”, but the actual cost of “historical” languages skimping out on optionally-nullable types is certainly in the trillions.
“At 9:43 EDT, the devs decided collectively to do a “rollback” to the previous release. This was the worst possible mistake,”
No, the WORST POSSIBLE MISTAKE was doing a major roll out, then NO ONE STICKING AROUND TO WATCH WHAT HAPPENED! Seriously, who does this?? It’s like lighting the fuse on your firework show, then having an all hands staff meeting in a sound proofed trailer with blackout curtains.
Why does hft even exist? Does it have any value?
Short answer:
It gave us compiler explorer, now that it has served its purpose we should stop doing it.Long answer:
Why does hft even exist?
Hft can exist because most stock markets react to requests as fast as possible and have no noticable fees for certain use cases. This means algorithms that do simple trades like if goggle goes up, buy other tech companies or buy any stock that goes up in europe on the NY market can make small profits if they are faster than everyone else.
Does it have any value?
There is one exchange that imposes a delay on every request, effectively inhibiting hft, and its opening actually improved market conditions on all exchanges. This implies it has negative value.
They also spend millions on hardware, tools and developers to skim small sums of many transaction on the stock market. They are effectively a (very inefficient) tax on the stock market that goes to improving C++ compilers and funding hardware startups.
Quora just sucks big time
May I ask why? Edit: A genuine question. Sometimes Quora answers popup in my search, but I didn’t see any particular problem with it. Other than the AI answers.
Not OP but I’ve noticed a lot of racists, extreme right nationalists, religious chauvinists, etc. Which sucks because the concept is quite cool and I’d love to be able to jump from one question to the next, but pretty quickly I get irritated and have to close it.
Mhmm, okay then. So that means not enough or good enough moderation. I will keep an eye if I see such search results again.
Do yourself a favor and get a Browser extension to block domains.
Quora is absolutely on my blocklist.
My problem is the lack of real answers from professionals.
As someone in the tech industry, I get curious and look around. And the top rated answers (usually theres like 3 answers max) are often wrong.
There was a hilarious Quora answer that asked about web development, and the person who answered was a guy whose experience was “I’ve been on the internet” who gave a solution using technology from 2010.
It’s hilariously bad and Quora don’t care.
I spent countless hours writing quality content there. Fuck quora. I’ll never make any content for companies again.
Weird that they used quora of all places this news was reported.
Now what was the cause of the bug? Fat fingering human error during release.
There isn’t a singular “the” cause usually, and if we do want to press for it, I’d say an aggressive deadline for a major product that needs engineers to slave away was the cause. At that point bugs become stastically inevitable. Whoever decided on that promised deadline was the first responsible person.
And nothing of value was lost