the overnight shift consisted of security and an unaccompanied technician who had only been on the job for a week.
That poor bastard.
This is interesting. What I’m hearing is they didn’t have proper anti-affinity rules I’m place, or backups for mission-critical equipment.
The data center did some dumb stuff, but that shouldn’t matter if you set up your application failover properly. Architecture and not testing failovers are the real issue here
Surprised a company of their scale and with such a reliance on stability isn’t running their own data centres. I guess they were trusting their failover process enough not to care
It was poor design. Poor design caused a 2 day outage. When you’ve got an H/A control plane designed, deployed in production, running services, and you ARE NOT actively using it for new services let alone porting old services to it, you’ve got piss poor management with no understanding of risk.
Mr magoo it’s the CEO