- cross-posted to:
- sysadmin@lemmy.world
- cross-posted to:
- sysadmin@lemmy.world
All our servers and company laptops went down at pretty much the same time. Laptops have been bootlooping to blue screen of death. It’s all very exciting, personally, as someone not responsible for fixing it.
Apparently caused by a bad CrowdStrike update.
Edit: now being told we (who almost all generally work from home) need to come into the office Monday as they can only apply the fix in-person. We’ll see if that changes over the weekend…
Reading into the updates some more… I’m starting to think this might just destroy CloudStrike as a company altogether. Between the mountain of lawsuits almost certainly incoming and the total destruction of any public trust in the company, I don’t see how they survive this. Just absolutely catastrophic on all fronts.
If all the computers stuck in boot loop can’t be recovered… yeah, that’s a lot of cost for a lot of businesses. Add to that all the immediate impact of missed flights and who knows what happening at the hospitals. Nightmare scenario if you’re responsible for it.
This sort of thing is exactly why you push updates to groups in stages, not to everything all at once.
Looks like the laptops are able to be recovered with a bit of finagling, so fortunately they haven’t bricked everything.
And yeah staged updates or even just… some testing? Not sure how this one slipped through.
Not sure how this one slipped through.
I’d bet my ass this was caused by terrible practices brought on by suits demanding more “efficient” releases.
“Why do we do so much testing before releases? Have we ever had any problems before? We’re wasting so much time that I might not even be able to buy another yacht this year”
At least nothing like this happens in the airline industry
Certainly not! Or other industries for that matter. It’s a good thing executives everywhere aren’t just concentrating on squeezing the maximum amount of money out of their companies and funneling it to themselves and their buddies on the board.
Sure, let’s “rightsize” the company by firing 20% of our workforce (but not management!) and raise prices 30%, and demand that the remaining employees maintain productivity at the level it used to be before we fucked things up. Oh and no raises for the plebs, we can’t afford it. Maybe a pizza party? One slice per employee though.
Agreed, this will probably kill them over the next few years unless they can really magic up something.
They probably don’t get sued - their contracts will have indemnity clauses against exactly this kind of thing, so unless they seriously misrepresented what their product does, this probably isn’t a contract breach.
If you are running crowdstrike, it’s probably because you have some regulatory obligations and an auditor to appease - you aren’t going to be able to just turn it off overnight, but I’m sure there are going to be some pretty awkward meetings when it comes to contract renewals in the next year, and I can’t imagine them seeing much growth
Nah. This has happened with every major corporate antivirus product. Multiple times. And the top IT people advising on purchasing decisions know this.
Yep. This is just uninformed people thinking this doesn’t happen. It’s been happening since av was born. It’s not new and this will not kill CS they’re still king.
At my old shop we still had people giving money to checkpoint and splunk, despite numerous problems and a huge cost, because they had favourites.
Don’t most indemnity clauses have exceptions for gross negligence? Pushing out an update this destructive without it getting caught by any quality control checks sure seems grossly negligent.
deleted by creator
explain to the project manager with crayons why you shouldn’t do this
Can’t; the project manager ate all the crayons
Why is it bad to do on a Friday? Based on your last paragraph, I would have thought Friday is probably the best week day to do it.
Most companies, mine included, try to roll out updates during the middle or start of a week. That way if there are issues the full team is available to address them.
deleted by creator
And hence the term read-only Friday.
rolling out an update to production that there was clearly no testing
Or someone selected “env2” instead of “env1” (#cattleNotPets names) and tested in prod by mistake.
Look, it’s a gaffe and someone’s fired. But it doesn’t mean fuck ups are endemic.
Was it not possible for MS to design their safe mode to still “work” when Bitlocker was enabled? Seems strange.
I’m not sure what you’d expect to be able to do in a safe mode with no disk access.
I think you’re on the nose, here. I laughed at the headline, but the more I read the more I see how fucked they are. Airlines. Industrial plants. Fucking governments. This one is big in a way that will likely get used as a case study.
The London Stock Exchange went down. They’re fukd.
Yeah saw that several steel mills have been bricked by this, that’s months and millions to restart
Got a link? I find it hard to believe that a process like that would stop because of a few windows machines not booting.
a few windows machines with controller application installed
That’s the real kicker.
Those machines should be airgapped and no need to run Crowdstrike on them. If the process controller machines of a steel mill are connected to the internet and installing auto updates then there really is no hope for this world.
But daddy microshoft says i gotta connect the system to the internet uwu
I work in an environment where the workstations aren’t on the Internet there’s a separate network, there’s still a need for antivirus and we were hit with bsod yesterday
then there really is no hope for this world.
I don’t know how to tell you this, but…
There are a lot of heavy manufacturing tools that are controlled and have their interface handled by Windows under the hood.
They’re not all networked, and some are super old, but a more modernized facility could easily be using a more modern version of Windows and be networked to have flow of materials, etc more tightly integrated into their systems.
The higher precision your operation, the more useful having much more advanced logs, networked to a central system, becomes in tracking quality control.
Imagine if after the fact, you could track a set of .1% of batches that are failing more often and look at the per second logs of temperature they were at during the process, and see that there’s 1° temperature variance between the 30th to 40th minute that wasn’t experienced by the rest of your batches. (Obviously that’s nonsense because I don’t know anything about the actual process of steel manufacturing. But I do know that there’s a lot of industrial manufacturing tooling that’s an application on top of windows, and the higher precision your output needs to be, the more useful it is to have high quality data every step of the way.)
Testing in production will do that
Not everyone is fortunate enough to have a seperate testing environment, you know? Manglement has to cut cost somewhere.
Manglement is the good term lmao
Don’t we blame MS at least as much? How does MS let an update like this push through their Windows Update system? How does an application update make the whole OS unable to boot? Blue screens on Windows have been around for decades, why don’t we have a better recovery system?
Crowdstrike runs at ring 0, effectively as part of the kernel. Like a device driver. There are no safeguards at that level. Extreme testing and diligence is required, because these are the consequences for getting it wrong. This is entirely on crowdstrike.
What lawsuits do you think are going to happen?
They can have all the clauses they like but pulling something like this off requires a certain amount of gross negligence that they can almost certainly be held liable for.
Whatever you say my man. It’s not like they go through very specific SLA conversations and negotiations to cover this or anything like that.
I forgot that only people you have agreements with can sue you. This is why Boeing hasn’t been sued once recently for their own criminal negligence.
👌👍
😔💦🦅🥰🥳
Forget lawsuits, they’re going to be in front of congress for this one
For what? At best it would be a hearing on the challenges of national security with industry.
The amount of servers running Windows out there is depressing to me
The four multinational corporations I worked at were almost entirely Windows servers with the exception of vendor specific stuff running Linux. Companies REALLY want that support clause in their infrastructure agreement.
I’ve worked as an IT architect at various companies in my career and you can definitely get support contracts for engineering support of RHEL, Ubuntu, SUSE, etc. That isn’t the issue. The issue is that there are a lot of system administrators with “15 years experience in Linux” that have no real experience in Linux. They have experience googling for guides and tutorials while having cobbled together documents of doing various things without understanding what they are really doing.
I can’t tell you how many times I’ve seen an enterprise patch their Linux solutions (if they patched them at all with some ridiculous rubberstamped PO&AM) manually without deploying a repo and updating the repo treating it as you would a WSUS. Hell, I’m pleasantly surprised if I see them joined to a Windows domain (a few times) or an LDAP (once but they didn’t have a trust with the Domain Forest or use sudoer rules…sigh).
The issue is that there are a lot of system administrators with “15 years experience in Linux” that have no real experience in Linux.
Reminds me of this guy I helped a few years ago. His name was Bob, and he was a sysadmin at a predominantly Windows company. The software I was supporting, however, only ran on Linux. So since Bob had been a UNIX admin back in the 80s they picked him to install the software.
But it had been 30 years since he ever touched a CLI. Every time I got on a call with him, I’d have to give him every keystroke one by one, all while listening to him complain about how much he hated it. After three or four calls I just gave up and used the screenshare to do everything myself.
AFAIK he’s still the only Linux “sysadmin” there.
Companies REALLY want that support clause in their infrastructure agreement.
RedHat, Ubuntu, SUSE - they all exist on support contracts.
I dunno, but doesn’t like a quarter of the internet kinda run on Azure?
so 40% of azure crashes a quarter of the internet…
I guess Spotify was running on the other 40%, as many other services
doesn’t like a quarter of the internet kinda run on Azure?
Said another way, 3/4 of the internet isn’t on Unsure cloud blah-blah.
And azure is - shhh - at least partially backed by Linux hosts. Didn’t they buy an AWS clone and forcibly inject it with money like Bobby Brown on a date in the hopes of building AWS better than AWS like they did with nokia? MS could be more protectively diverse than many of its best customers.
I’ve had my PC shut down for updates three times now, while using it as a Jellyfin server from another room. And I’ve only been using it for this purpose for six months or so.
I can’t imagine running anything critical on it.
Windows server, the OS, runs differently from desktop windows. So if you’re using desktop windows and expecting it to run like a server, well, that’s on you. However, I ran windows server 2016 and then 2019 for quite a few years just doing general homelab stuff and it is really a pain compared to Linux which I switched to on my server about a year ago. Server stuff is just way easier on Linux in my experience.
It doesn’t have to, though. Linux manages to do both just fine, with relatively minor compromises.
Expecting an OS to handle keeping software running is not a big ask.
Yup, I use Linux to run a Jellyfin server, as well as a few others things. The only problem is that the CPU I’m using (Ryzen 1st gen) will crash every couple weeks or so (known hardware fault, I never bothered to RMA), but that’s honestly not that bad since I can just walk over and restart it. Before that, it ran happily on an old Phenom II from 2009 for something like 10 years (old PC), and I mostly replaced it because the Ryzen uses a bit less electricity (enough that I used to turn the old PC off at night; this one runs 24/7 as is way more convenient).
So aside from this hardware issue, Linux has been extremely solid. I have a VPS that tunnels traffic into my Jellyfin and other services from outside, and it pretty much never goes down (I guess the host reboots it once a year or something for hardware maintenance). I run updates when I want to (when I remember, which is about monthly), and it only goes down for like 30 sec to reboot after updates are applied.
So yeah, Linux FTW, once it’s set up, it just runs.
not that bad since I can just walk over and restart it.
You can try to use watchdog to automatically restart on crashes. Or go through RMA.
I could, but it’s a pretty rare nuisance. I’d rather just replace the CPU than go through RMA, a newer gen CPU is quite inexpensive, I could probably get by with a <$100 CPU since anything AM4 should work (I have an X370 with support for 5XXX series CPUs).
I’m personally looking at replacing it with a much lower power chip, like maybe something ARM. I just haven’t found something that would fit well since I need 2-4 SATA (PCIe card could work), 16GB+ RAM, and a relatively strong CPU. I’m hopeful that with ARM Snapdragon chips making their way to laptops and RISC-V getting more available, I’ll find something that’ll fit that niche well. Otherwise, I’ll just upgrade when my wife or I upgrade, which is what I usually do.
I just haven’t found something that would fit well since I need 2-4 SATA (PCIe card could work), 16GB+ RAM, and a relatively strong CPU.
4 SATA, 8GB RAM is easy to find. What do you need 16 gigs for? Compiling Gentoo?
Star64 for ARM and Quartz64 for RV.
big ask.
Off the car lot, we say ‘request’. But good on you for changing careers.
I really have no idea why you think your choice of wording would be relevant to the discussion in any way, but OK…
Removed by mod
Wow dude you’re so cool. I bet that made you feel so superior. Everyone on here thinks you are so badass.
I do as well!
Wow and the most predictable reply too? Poor guy. Better luck next time.
Where did you think Microsoft was getting all (hyperbole) of their money from?
I know i was really surprised how many there are. But honestly think of how many companies are using active directory and azure
Yeah my plans of going to sleep last night were thoroughly dashed as every single windows server across every datacenter I manage between two countries all cried out at the same time lmao
I always wondered who even used windows server given how marginal its marketshare is. Now i know from the news.
Marginal? You must be joking. A vast amount of servers run on Windows Server. Where I work alone we have several hundred and many companies have a similar setup. Statista put the Windows Server OS market share over 70% in 2019. While I find it hard to believe it would be that high, it does clearly indicate it’s most certainly not a marginal percentage.
I’m not getting an account on Statista, and I agree that its marketshare isn’t “marginal” in practice, but something is up with those figures, since overwhelmingly internet hosted services are on top of Linux. Internal servers may be a bit different, but “servers” I’d expect to count internet servers…
Most servers aren’t Internet-facing.
There are a ton of Internet facing servers, vast majority of cloud instances, and every cloud provider except Microsoft (and their in house “windows” for azure hosting is somehow different, though they aren’t public about it).
In terms of on premise servers, I’d even say the HPC groups may outnumber internal windows servers. While relatively fewer instances, they all represent racks and racks of servers, and that market is 100% Linux.
I know a couple of retailers and at least two game studios are keeping at scale windows a thing, but Linux mostly dominates my experience of large scale deployment in on premise scale out.
It just seems like Linux is just so likely for scenarios that also have lots of horizontal scaling, it is hard to imagine that despite that windows still being a majority share of the market when all is said and done, when it’s usually deployed in smaller quantities in any given place.
It’s stated in the synopsis, below where it says you need to pay for the article. Anyway, it might be true as the hosting servers themselves often host up to hundreds of Windows machines. But it really depends on what is measured and the method used, which we don’t know because who the hell has a statista account anyway.
since overwhelmingly internet hosted services are on top of Linux
This is a common misconception. Most internet hosted services are behind a Linux box, but that doesn’t mean those services actually run on Linux.
Well, I’ve seen some, but they usually don’t have automatic updates and generally do not have access to the Internet.
This is a crowdstrike issue specifically related to the falcon sensor. Happens to affect only windows hosts.
It’s only marginal for running custom code. Every large organization has at least a few of them running important out-of-the-box services.
Not too long ago, a lot of Customer Relationship Management (CRM) software ran on MS SQL Server. Businesses made significant investments in software and training, and some of them don’t have the technical, financial, or logistical resources to adapt - momentum keeps them using Windows Server.
For example, small businesses that are physically located in rural areas can’t use cloud based services because rural internet is too slow and unreliable. Its not quite the case that there’s no amount of money you can pay for a good internet connection in rural America, but last time I looked into it, Verizon wanted to charge me $20,000 per mile to run a fiber optic cable from the nearest town to my client’s farm.
Almost everyone, because the Windows server market share isn’t marginal at all.
My current company does and I hate it so much. Who even got that idea in the first place? Linux always dominated server-side stuff, no?
You should read the saga of when MS bought Hotmail. The work they had to do to be able to run it on Windows was incredible. It actually helped MS improve their server OS, and it still wasn’t as performance when they switched over.
No, Linux doesn’t now nor has it ever dominated the server space.
In university computer science, in the states, MS server was the main server OS that they taught my class during our education.
Microsoft loses money to let the universities and students use and learn MS server for free, or at least they did at the time. This had the effect of making a lot of fresh grad developers more comfortable with using MS server, and I’m sure it led to MS server being used in cases where there were better options.
Yes, but the developers learned on Windows, so they wrote software for Windows.
How many coffee cups have you drank in the last 12 hours?
I work in a data center
I lost count
What was Dracula doing in your data centre?
Because he’s Dracula. He’s twelve million years old.
THE WORMS
Surely Dracula doesn’t use windows.
I work in a datacenter, but no Windows. I slept so well.
Though a couple years back some ransomware that also impacted Linux ran through, but I got to sleep well because it only bit people with easily guessed root passwords. It bit a lot of other departments at the company though.
This time even the Windows folks were spared, because CrowdStrike wasn’t the solution they infested themselves with (they use other providers, who I fully expect to screw up the same way one day).
There was a point where words lost all meaning and I think my heart was one continuous beat for a good hour.
Did you feel a great disturbance in the force?
Oh yeah I felt a great disturbance (900 alarms) in the force (Opsgenie)
How’s it going, Obi-Wan?
This is going to be a Big Deal for a whole lot of people. I don’t know all the companies and industries that use Crowdstrike but I might guess it will result in airline delays, banking outages, and hospital computer systems failing. Hopefully nobody gets hurt because of it.
Big chunk of New Zealands banks apparently run it, cos 3 of the big ones can’t do credit card transactions right now
It was mayhem at PakNSave a bit ago.
In my experience it’s always mayhem at PakNSave.
If anything, it’s probably calmed P’n’S down a bit…
cos 3 of the big ones can’t do credit card transactions right now
Bitcoin still up and running perhaps people can use that
Bitcoin Cash maybe. Didn’t they bork Bitcoin (Core) so you have to wait for confirmations in the next block?
Several 911 systems were affected or completely down too
CrowdStrike: It’s Friday, let’s throw it over the wall to production. See you all on Monday!
Wow, I didn’t realize CrowdStrike was widespread enough to be a single point of failure for so much infrastructure. Lot of airports and hospitals offline.
The Federal Aviation Administration (FAA) imposed the global ground stop for airlines including United, Delta, American, and Frontier.
Flights grounded in the US.
deleted by creator
Ironic. They did what they are there to protect against. Fucking up everyone’s shit
Maybe centralizing everything onto one company’s shoulders wasn’t such a great idea after all…
Wait, monopolies are bad? This is the first I’ve ever heard of this concept. So much so that I actually coined the term “monopoly” just now to describe it.
Someone should invent a game, that while playing demonstrates how much monopolies suck for everyone involved (except the monopolist)
And make it so you lose friends and family over the course of the 4+ hour game. Also make a thimble to fight over, that would be dope.
Get your filthy fucking paws off my thimble!
I’m sure a game that’s so on the nose with its message could never become a commercialised marketing gimmick that perversely promotes existing monopolies. Capitalists wouldn’t dare.
Crowdstrike is not a monopoly. The problem here was having a single point of failure, using a piece of software that can access the kernel and autoupdate running on every machine in the organization.
At the very least, you should stagger updates. Any change done to a business critical server should be validated first. Automatic updates are a bad idea.
Obviously, crowdstrike messed up, but so did IT departments in every organization that allowed this to happen.
Monopolies aren’t absolute, ever, but having nearly 25% market share is a problem, and is a sign of an oligopoly. Crowdstrike has outsized power and has posted article after article boasting of its dominant market position for many years running.
I think monopoly-like conditions have become so normalised that people don’t even recognise them for what they are.
Yes, it’s almost as if there should be laws to prevent that sort of thing. Hmm
Well now that I’ve invented the concept for the first time, we should invent laws about it. We’ll get in early, develop a monopoly on monopoly legislation and steer it so it benefits us.
Wow, monopolies rule!
I mean, I’m sure those companies that have them don’t think so—when they aren’t the cause of muti-industry collapses.
The too big to fail philosophy at its finest.
Since when has any antivirus ever had the intent of actually protecting against viruses? The entire antivirus market is a scam.
CrowdStrike has a new meaning… literally Crowd Strike.
They virtually blew up airports
An offline server is a secure server!
Honestly my philosophy these days, when it comes to anything proprietary. They just can’t keep their grubby little fingers off of working software.
At least this time it was an accident.
Clownstrike
Crowdshite haha gotem
CrowdCollapse
The thought of a local computer being unable to boot because some remote server somewhere is unavailable makes me laugh and sad at the same time.
I don’t think that’s what’s happening here. As far as I know it’s an issue with a driver installed on the computers, not with anything trying to reach out to an external server. If that were the case you’d expect it to fail to boot any time you don’t have an Internet connection.
Windows is bad but it’s not that bad yet.
It’s just a fun coincidence that the azure outage was around the same time.
expect it to fail to boot any time you don’t have an Internet connection.
So, like the UbiSoft umbilical but for OSes.
Edit: name of publisher not developer.
A remote server that you pay some serious money to that pushes a garbage driver that prevents yours from booting
Not only does it (possibly) prevent booting, but it will also bsod it first so you’ll have to see how lucky you get.
Goddamn I hate crowdstrike. Between this and them fucking up and letting malware back into a system, I have nothing nice to say about them.
It’s bsod on boot
And anything encrypted with bitlocker can’t even go into safe mode to fix it
It doesn’t consistently bsod on boot, about half of affected machines did in our environment, but all of them did experience a bsod while running. A good amount of ours just took the bad update, bsod’d and came back up.
yeah so you can’t get Chinese government spyware installed.
https://www.theregister.com/ has a series of articles on what’s going on technically.
Latest advice…
There is a faulty channel file, so not quite an update. There is a workaround…
-
Boot Windows into Safe Mode or WRE.
-
Go to C:\Windows\System32\drivers\CrowdStrike
-
Locate and delete file matching “C-00000291*.sys”
-
Boot normally.
-
Yep, this is the stupid timeline. Y2K happening to to the nuances of calendar systems might have sounded dumb at the time, but it doesn’t now. Y2K happening because of some unknown contractor’s YOLO Friday update definitely is.
A few years ago when my org got the ask to deploy the CS agent in linux production servers and I also saw it getting deployed in thousands of windows and mac desktops all across, the first thought that came to mind was “massive single point of failure and security threat”, as we were putting all the trust in a single relatively small company that will (has?) become the favorite target of all the bad actors across the planet. How long before it gets into trouble, either because if it’s own doing or due to others?
I guess that we now know
No bad actors did this, and security goes in fads. Crowdstrike is king right now, just as McAfee/Trellix was in the past. If you want to run around without edr/xdr software be my guest.
If you want to run around without edr/xdr software be my guest.
I don’t think anyone is saying that… But picking programs that your company has visibility into is a good idea. We use Wazuh. I get to control when updates are rolled out. It’s not a massive shit show when the vendor rolls out the update globally without sufficient internal testing. I can stagger the rollout as I see fit.
You can do this with CS as well, but the dumbasses where pushing major updates via channel files which aren’t for that. They tried to squeak by without putting out a major update via the sensor updates which you can control. Basically they fucked up their own structure because a bunch of people where complaining and more than likely management got involved and overwrote best practices.
Hmm. Is it safer to have a potentially exploitable agent running as root and listening on a port, than to not have EDR running on a well-secured low-churn enterprise OS - sit down, Ubuntu - adhering to best practice for least access and least-services and good role-sep?
It’s a pickle. I’m gonna go with “maybe don’t lock down your enterprise Linux hard and then open a yawning garage door of a hole right into it” but YMMV.
Reality is, if your users are educated, then your more secure than any edr with dumbass users. But we all know this is a pipe dream.
All of the security vendors do it over enough time. McAfee used to be the king of them.
https://www.zdnet.com/article/defective-mcafee-update-causes-worldwide-meltdown-of-xp-pcs/
I’m so exhausted… This is madness. As a Linux user I’ve busy all day telling people with bricked PCs that Linux is better but there are just so many. It never ends. I think this is outage is going to keep me busy all weekend.