Around 10:00AM PST, we started noticing a connectivity problem from our web servers to our database servers. We started investigating the issue, and got on the phone with Rackspace, to get a better idea of what was going on. After about an hour of being on the phone with them, we figured out the issue and the problem is resolved.
So what happened? The short version.
A configuration change was made to our environment, which started transferring traffic over an untested load balancer instead of the dedicated firewall. Rackspace is still figuring out how this happened, and why.
Long version
We’ve been growing like crazy, and with that growth has come architectural changes that need to be made, some of the changes we’ve made in the past have involved us switching to a hybrid cloud environment. Away from clusters built on cloud servers, into real iron. The first phase of this happened 5 months ago, we had a minor hiccup during that period, but it wasn’t as bad as this one. A few weeks ago, we noticed another need to upgrade, and we started talking with Rackspace about upgrading our infrastructure to what they called “Intensive”. This meant adding a dedicated hardware load balancer instead of cloud load balancers (Zeus).
This Sunday May 13th 2012, the load balancer was added into our configuration, it took less than 10 seconds, and everything seemed to work fine. The next step to this load balancer switch, was to coordinate a switch from the Cisco Firewall to this load balancer, for something called “RackConnect”. This allows our cloud servers to talk to our dedicated hardware. Somehow, the prep work that was being done to make this connection, got ahead of itself, and the environment changed. As of right now, no one knows why. Rackspace is investigating this.
We had plans on ditching cloud servers completely, in order to use a custom cloud built on Openstack, I’m actually in California right now at ChefConf 2012, and I was talking to someone from Dell about their openstack implementation on bare metal.. we may speed that up now, so we have more control.
If anything, all these experiences have given me some great insight into public and private clouds, and how and when to use them.
So, I apologize for the downtime, this is what happened, and we’re working on a way to prevent this from happening.