At 8:16 this morning, a long-time client sent me an email saying that one of his customers couldn't was getting a strange bug in their scheduling application. They could see everything except for the tabbed UI control they needed to use. In other words, there was a hole in the screen where the data entry should have been.
Here's how the rest of the day went around this issue. It's the kind of thing that makes me proud to be an engineer, in the same way the guys who built Galloping Gertie were proud.
It all started when I updated a Windows Azure cloud service from the no-longer-supported SDK 1.7 running on Windows Server 2008 to the current SDK (2.2) and operating system (Windows Server 2012 R2). I also upgraded the language from C# 4.0 to C# 4.5.1, which is only possible on WS2012R2.
This upgrade started months ago, and proceeded slowly because both I and the clients had other priorities. I mean, who wants to spend a lot of money upgrading a platform without upgrading the application running on it? So the last build of the application went to production in October, and I haven't touched it since. I mean, it worked fine, why mess with it? Other than the fact that the operating system and Azure SDK are no longer supported.
Before pushing the update, I thoroughly tested the application. I mean, unit tests up the ying, with a tens-of-steps-long regression test on my local, and on an Azure test instance, before even looking askance at the Production instance. When I had tested everything I could imagine, I did this:
- Stopped the application, to ensure no one changed any data during the upgrade.
- Made a full copy of the production database ("CREATE DATABASE productioncopy AS COPY OF production")
- Once the data was fully copied, I uploaded the new bits to the Staging slot of the application.
- I updated the configuration info to the current standards.
- VIP swap! (I swapped the staging and production instances, so the old production instance was now in the staging slot.)
- And....it's running just fine. All that planning and testing worked!
So what happened? Well, it turns out there's one thing I didn't anticipate: Internet Explorer 8, released five years ago Thursday, and known to have difficulties with JavaScript. Plus, the controls we used when we orignally deployed in January 2008, made by Infragistics, have known incompatibilities with IE8, but again: the application has worked just fine the whole time.
Since everything worked just fine on earlier versions of the application, and since this update didn't directly change the UI, and since IE8 hasn't been supported in quite some time, I figured there wouldn't be any problems.
It turns out that a sizable portion of my client's customers use IE8, because they're big hospitals with big IT departments and little budgets for updates.
Once I realized with abject horror that the application was simply broken for most of the people using it, I resigned myself to rolling back to the previous release, which had worked just fine. When I got home, I started this task, and the following things happened:
- Once again, I stopped the application.
- The actual database restore went fine, as did the VIP swap putting the previous version back in the Production slot and the new version in the Staging slot.
- When the application started up, I realized I'd forgotten to roll back the configuration information for the logging and messaging component. So the application failed to start.
- I rolled back the config.
- The application again failed to start. Only now, because the logging and messaging component is the part that's failing, I can't see any diagnostics.
- Fortunately, I deployed the application with Remote Desktop enabled, so I tried connecting to the virtual machine directly.
- The Remote Desktop user account had expired.
- Fortunately I use great source control. In Mercurial, I updated to the last production build before the update, and loaded it into Visual Studio.
- Tried to load into Visual Studio, and failed. See, I no longer have the Azure SDK v1.7. I never installed it on this machine, in fact. I'm running SDK 2.2, and I have no easy way of running an older version.
So, as far as I knew at this point, there is simply no way to get into the application, and no way for me to re-upload the old version.
I decided to try a different tack. I rolled back the rollback and restarted the new version. I also started trying to get my last remaining Windows XP machine running so that I could confirm the bug, and start testing fixes on a Test instance running Windows Server 2012 R2.
Getting a 10-year-old laptop to boot, let me log in, stop wasting time with all the detritus it acquired in its years of service, connect to my network, and open up IE8, took 45 minutes.
Some time in there I walked Parker.
So now, I can see that the error exists in IE8, and I also have found an article on how to reset the RDP password expiration date. Only, I'm really tired, and I am worried I'll make stupid errors if I keep trying to debug this right now.
So I have two approaches I will try first thing in the morning: first, roll back to the October release, and manually update the RDP expiration date so I can remote in and debug the configuration problem. Then I'll have to re-create all the data my client added yesterday, which will take me at least an hour. If I'm supremely lucky I'll have this done by 8am. Since I've had no luck at all so far on this upgrade, I am not optimistic.
Second, I'll start removing the outdated Infragistics code. Believe it or not, jQuery works fine on IE8, despite it being pretty much the latest thing in user interface languages. It's the custom crap Infragistics pushed out years ago that fails. Unfortunately I won't be able to deploy this before leaving on Thursday morning. Fortunately the application isn't going to stop working suddenly; the OS and SDK are no longer supported, but they won't actually turn the OS off until June.
And there's the irony in a nutshell. I thought I did everything right in the deployment cycle, especially the part where I got three months ahead of the due date. The things that went wrong to get me to this state of frustration and exhaustion were numerous and tiny, kind of like the things that go wrong to cause an aviation accident. That said, the client has suffered no data loss, and I preserved a whole catalog of options to fix the problem (relatively) quickly. This isn't the disaster it would have been without the deployment tools you get with Azure.
Plus, I've learned to test everything on IE8 whenever health care companies are involved. Sheesh.