Vigilant.IT recently went through a rebranding.
New culture, renewed products, new logo and of course a brand-new website. Just like our old website, it was a no brainer to host our new website on Azure too.
Everything was going according to plan. We got to our launch date, swapped the old for the new, general applause, let’s go to business as usual, aka BAU…
That’s until we decided to clean up and archive our old website *insert terrifying music here*!
On a bright and sunny Friday morning (Have you noticed that Fridays are always bright and sunny, even when it rains?), I got the casual, yet dreaded question from one of our team members: “Hey Steph, is there something wrong with our website, I can’t access it”
Anyone in the IT industry knows the cold sweat feeling that follows. That “Crap! I pressed the wrong button” or “pulled the wrong plug” moment, because it simply cannot be a coincidence that the website goes offline while your team is cleaning up the old one.
But, the experience kicks in, and I calmy go through in my head some immediate possible scenarios and troubleshooting steps:
- The website could have been stopped by mistake > go to the portal to check the status
- Something might have happened with DNS (It’s always DNS!) > go to the portal to check the status
- Our web developer could have rolled out a new version during business hours > go to the portal to check the status
- Our engineering team might have started an emergency maintenance > go to the portal to check the status etc etc…
So, you guessed it, my next move was to log onto the Microsoft Azure portal to check on the website’s status…
I think my brain went blank while I was desperately trying to find the WebApp resource in Azure that was hosting our brand spanking new site!
Could I have been stripped of security rights? Was I looking in the wrong subscription, or even tenancy?
I went through a range of questions and emotions until, in disbelieve, I had to resign myself to the fact that both our old and new websites had been deleted.
After the initial shock dissipated, I thought “no problem, we have backups, so let’s start the restore process”. Think again Steph… Unfortunately, the storage account that contained (note the past tense here) the backups had been deleted as well!
What happened? Why had the new website been deleted? Why did the backup storage account get deleted too? Was it malicious?
So many questions at this stage run through your mind, but you have to set them aside and focus on the primary task: get the website back up and running, and deal with the rest later.
Luckily, we hosted our site on Azure Cloud.
Why lucky? Because Azure doesn’t permanently delete resources like WebApps, MySQL databases or Storage Accounts.
Can you hear the sigh of relief echoing through the internet?!? I still can!
Within two PowerShell commands, our website was recovered: Restore deleted apps – Azure App Service | Microsoft Docs
Within a few clicks on the portal, the backup storage account was recovered: Recover a deleted storage account – Azure Storage | Microsoft Docs
Finally, within a few more clicks, a couple of copy/pastes, and an API call, the MySQL database was recovered too: Restore a deleted Azure Database for MySQL server | Microsoft Docs
Of course, it took DNS a little while or so to replicate our new IP address across the internet (it’s always DNS!), but if it wasn’t for DNS, the entire site would have been back up and running within 15 minutes…
So, how did this happen anyway?
While one could make the mistake to think that it all happened because an engineer decided to delete the entire Resource Group as opposed to the individual resources within, I do think that it was a series of smaller mistakes that led to this specific incident.
Firstly, a Resource Lock with the appropriate comments should have been used to prevent this vital production item being deleted without prompting the engineer’s attention.
Secondly, the backup Storage Account should have never been created within the same Resource Group.
Thirdly, the new website was created in the same Resource Group as the old website, with ambiguous naming convention and no tags, making it very hard for anyone to differentiate between them.
And finally, assumptions were made that the engineer would know and understand the environment without prior exposure, without tags/notes to describe the resources, and under time pressure to “get it done”.
From the above, it seems pretty clear that this incident was the culmination of “a series of unfortunate events” and that mistakes can be made without anyone (including the person who made the mistake) ever knowing. These become time bombs that could either “sleep” forever, or create incidents within your production environment years later with a simple, innocuous click of a button.
What have we learned from this incident?
- Make sure that your Production Resources are appropriately tagged, so it’s very evident for someone who has never been involved in deploying or managing the environment, what resource does what. In the “olden days”, we use to do this through naming convention, but in Azure you can now easily use tags to extend the description and categorization:
Tag resources, resource groups, and subscriptions for logical organization – Azure Resource Manager | Microsoft Docs
- Running resources in Azure does not mean they suddenly become immutable, or will stay up and running. Always ensure you have good backups, redundancy and a recovery plan.
- Try to put backup related items (Storage Accounts, Backup Vault, etc.) in a separate Resource Group. This not only ensures that, if the entire application’s Resource Group gets deleted, your backup is still available (You should also create the Backup Resources in another region for extra safety).
- Make good use of the Resource Locks in Azure. Locks on Resources, Resource Groups and Subscriptions are a great feature that ensure things don’t get deleted inadvertently:
Lock resources to prevent changes – Azure Resource Manager | Microsoft Docs
- We all make mistakes, even the most highly skilled individuals (we’re not computers after all), but it’s very important to be open and honest when mistakes are made, so we can all learn from them.
Finally, it’s equally important to create a company culture where people are comfortable about being open with their mistake, but that’s a longer story for another blog post!
This article written by Stephane Budo – Vigilant.IT Director