When things start to go wrong it can sometimes be impossible to contain the unravelling - as if the problematic situation quickly gains momentum and begins to 'snowball' into an even worse situation.
This happened to me recently. And since much of what went wrong could have been prevented with good process controls, I believe I have some valuable lessons to share with you.
At the very least this post will be entertaining, as I assume many of you reading this will think to yourself: "yep, been there done that".
I'll start by mentioning how I met the folks at Load Impact and started working with their product and writing for them.
I was doing some shopping for a hosting provider for my personal and business website, and ran across someone else's blog post that tested the performance of all the major players in the 'affordable web hosting' segment. We are talking the $8/month type deals here - the bare bones.
This author used Load Impact to quantify the performance of all these providers and provided great insight into how they faired from a performance and scalability perspective.
My first thought was: awesome! - I'll use that same tool and test a few out myself, and then compare it to the performance of a self-hosted site. I already had a bunch of VMs running on an ESXI server so adding a turnkey wordpress site would be super easy.
It turns out that my self hosted site was much faster and scaled all I needed (thanks to Load Impact), so in the end I decided to just self host.
I'm not making any money from the sites - no ecommerce or ads - so it doesn't really matter from a business perspective.It's also easier to manage backups and control security when you manage the whole environment.
But it's also much more likely that the whole thing will get screwed up in a major time-consuming way.
I imagine there are many SMBs out there that self host as well, for a variety of reasons. It could be that you like having control of your company assets, it was faster and cheaper, or you just like doing everything yourself.
It's often very difficult for smart people to avoid doing things they can do but probably shouldn't do as it might not be the best use of their time.
In this blog post I'll demonstrate how how quickly this situation can go wrong and then go from bad to worse:
Problem #1: my ISP screwed me!
If you are in business long enough, your ISP will screw you too. I made a change to my service plan (added phone line) which I did the week before we went out of town.
For some reason nothing happened so I decided called my provider while 300 miles away from my house. Of course, this is exactly when things started to unravel.
Instead of provisioning my modem correctly, they removed my internet service and added phone. No internet. To make matter worse, I'm not at home so I can't troubleshoot.
Lesson #1 - don't make changes with your ISP unless you can be onsite quickly to troubleshoot.
It was nearly impossible for me to troubleshoot this issue as I couldn't VPN into my network, there wasn't a connection at all.
I even had a neighbor come in and manually reboot both my firewall and modem. That didn't work, so my only recourse was a dreaded call to customer support.
The first time I called it was a total waste of time, the Customer Support agent had no idea what was going on so that call ended.
Call number two the next day was slightly more productive in that it ended 45 minutes later and a level 2 support ticket was opened.
Finally, upon getting a level 2 engineer on the line (I was home at this point), they immediately recognized that my modem was mis-provisioned and was setup for phone only! It only took minutes to properly provision the modem and get it back online.
Lesson #2 - if you are technically savvy, then immediately demand a level 2 support engineer. Time spent with first line support is usually a totally frustrating time suck.
Problem #2: Some things start working again and others mysteriously don't
After the final problem-resolving phone call was complete I was tired, hot (AC was off while out of town) and irritated. So when the internet connection finally came back up, I wasn't exactly in a "I'm making great decisions" mindset.
I'm not sure what is going on.
Lesson #3 - Don't start making significant changes to things when tired, hot and irritated. It won't go well.
This is exactly the point at which I should have made a copy of the VM in it's current state to make sure I don't make things worse. But instead I immediately go to my backup server (Veeam) and try to restore the VM in question.
Well guess what? That didn't work either, some sort of problem with the storage repository for Veeam. Unfortunately, the problem is that some of the backup data is corrupt.
I ended up with a partially restored but completely unusable webserver VM.
Lesson #4 - Test your backups regularly and make sure you have more than one copy of mission critical backups.
At some point in this whole fiasco, I remembered what this package I had on my desk was. It was a replacement hard drive for my ZFS array because one of my 4 drives in the RAIDZ1 array was "failing".
I figured that now would be the perfect time to swap that drive out and allow the array to heal itself.
Under normal circumstances this is a trivial operation, no big deal. Not this time!
This time, instead of replacing the failing hard drive, I accidentally replace a perfectly good drive!
So now I have a really tenuous situation with a degraded array that includes a failing hard drive and no redundancy whatsoever.
Fortunately there wasn't any real data loss and eventually I was able to restore the VM from a good backup source.
Finally back online!
Lesson #5 - Be extra diligent when working on your storage systems and refer to Lesson #3.
The overall message here is most, if not all, of these issues could have been easily avoided. But that is the case 99% of the time in IT - people make mistakes, there is a lack of good well documented processes to handle outages, and of course hardware will fail.
It's also worth noting that in large enterprises mechanisms for change control are usually in place - preventing staff from really messing things up or making changes during business hours.
Unfortunately, many of smaller businesses don't have those constraints.
So what does this have to do with Load Impact? Nothing directly...but I think it's important for people to be aware of the impact that load and performance testing can have on the infrastructure that runs your business and plan accordingly when executing test plans.
Just like you wouldn't do something stupid like changing network configs, ISP settings or Storage without thoroughly thinking it through, you should also not unleash a world-wide load test with 10,000 concurrent users without thinking about when you should execute the test (hint - schedule it) and what the impact will be on the production systems.
Hopefully there is a test/dev or pre-production environment where testing can take place continuously, but don't forget many times there are shared resources like firewalls and routers that may still be affected even if the web/app tiers may not be.
And always remember Murphy's law: Anything that can go wrong will go wrong.
This post was written by Peter Cannell. Peter has been a sales and engineering professional in the IT industry for over 15 years. His experience spans multiple disciplines including Networking, Security, Virtualization and Applications. He enjoys writing about technology and offering a practical perspective to new technologies and how they can be deployed. Follow Peter on his blog or connect with him on Linkedin.
Don't miss Peter's next post, subscribe to the Load Impact blog by clicking the "follow" button below.