The Daily Parker

Politics, Weather, Photography, and the Dog

Scary software deployment

Jez Humble, who wrote the book on continuous delivery, believes deployments should be boring. I totally agree; it's one of the biggest reasons I like working with Microsoft Azure.

Occasionally, however, deploying software is not at all boring. Today, for example.

Because Microsoft has ended support for Windows Server 2008 as of next week, I've upgraded an old application that I first released to Azure in August 2012. Well, actually, I updated it back in March, so I could get ahead of the game, and the boring deployment turned horrifying when half of my client's customers couldn't use the application because the OS upgrade broke their Windows XP/IE8 user experience. Seriously.

All of my client's customers have now upgraded to Chrome, IE11, or Firefox, and I've tested the app on all three browsers. Everything works. But now I have to redeploy the upgrade, and I've got a real feeling of being once-bitten.

The hard part, the part that makes this a one-way upgrade, is a significant change to the database schema. All the application's lookup lists, event logging, auditing, and a few other data structures, are incompatible with the current Production version. Even if there weren't an OS upgrade involved, the database changes are overdue, so there is no going back after this.

Here are the steps that I will take for this deployment:

  1. Copy current Production database to new MigrationTest database
  2. Upgrade MigrationTest database
  3. Verify Test settings, connection strings, and storage keys
  4. Deploy Web project to Test instance (production slot)
  5. Validate Test instance
  6. Deploy Worker project to Test instance (production slot)
  7. Validate Worker instance
  8. Shut down Production instance
  9. Back up Production database to bacpac
  10. Copy Production database within SQL instance
  11. Upgrade Production database
  12. Verify Production settings, connection strings, and storage keys
  13. Deploy solution to Production instance (staging slot)
  14. Validate Production Web instance
  15. Validate Production Worker instance
  16. VIP swap to Production

Step 1 is already complete. Step 2 will be delayed for a moment while I apply a patch to Visual Studio over my painfully-slow internet connection (thanks, AT&T!). And I expect to be done with all of this in time for Game of Thrones.

Snopes on the Million Atari Cartridge Burial legend

Snopes just republished the legend of the E.T. game cartridges in light of the actual burial site being dug up recently. Forgetting for a moment the legend itself, the background story was a description of how Warner management killed Atari:

In 1982, Warner Communications could honestly claim to own a goose that laid golden eggs. Its money-producing fowl was called Atari, a video game company it purchased for $28 million in 1976 which had since burgeoned into a $2 billion concern. In the early 1980s Atari owned 80% of the video game market, it accounted for 70% of Warner's operating profits, and in the fourth quarter of 1982 the Wall Street "whisper number" concerning Atari's expected Atari symbol earnings predicted a 50% increase over the previous year.

The goose died at 3:04 P.M. EST on 7 December 1982, when Atari reported only a 10% to 15% increase in expected earnings, not the 50% figure so many people had been counting on. By the end of the following day Warner stock had plummeted to two-thirds of its previous value, and Warner closed out the quarter with its profits down a mind-boggling 56%. (Even worse, a minor scandal erupted when it was revealed that Atari's president and CEO had sold 5,000 shares of Warner stock a mere 23 minutes before announcing Atari's disappointing sales figures.) Atari racked up over half a billion dollars ($536 million) in losses in 1983, and by the end of 1984 Warner had sold the company.

What accounted for the sudden death of Warner's prized goose? A number of interrelated factors brought about its fatal illness...

The factors Snopes summarized highlight how acquisitions by incompatible companies can go wrong, among other things.

Waiting for software to deploy...

I'm uploading a couple of fixes to Inner-Drive.com right now, so I have a few minutes to read things people have sent me. It takes a while to deploy the site fully, because the Inner Drive Extensible Architecture™ documentation (reg.req.) is quite large—about 3,000 HTML pages. I'd like to web-deploy the changes, but the way Azure cloud services work, any changes deployed that way get overwritten as soon as the instance reboots.

All of the changes to Inner-Drive.com are under the hood. In fact, I didn't change anything at all in the website. But I made a bunch of changes to the Azure support classes, including a much better approach to logging inspired by a conversation I had with my colleague Igor Popirov a couple of weeks ago. I'll go into more details later, but suffice it to say, there are some people who can give you more ideas in one sentence than you can get in a year of reading blogs, and he's one of them.

So, while sitting here at my remote office waiting for bits to upload, I encountered these things:

  • The bartender's iPod played "Bette Davis Eyes" which immediately sent me back to this.
  • Andrew Sullivan pointed me (and everyone else who reads his blog) towards the ultimate Boomer fantasy, the live-foreverists. (At some point in the near future I'm going to write about how much X-ers hate picking up after both Boomers and Millennials, and how this fits right in. Just, not right now.)
  • Slate's Jamelle Bouie belives Wisconsin's voter rights decision is a win for our cause. ("Our" in this case includes those who believe retail voter fraud is so rare as to be a laughable excuse for denying a sizable portion of the population their voting rights, especially when the people denied voting rights tend to be the exact people who Republicans would prefer not to vote.)

OK, the software is deployed, and I need to walk Parker now. Maybe I'll read all these things after Game of Thrones.

Lunchtime reads

I may come back to these again:

Publishing the Inner Drive Extensible Architecture™ to NuGet is still coming up...just not this weekend.

Not embracing open source so much as shaking hands

The Inner Drive Extensible Architecture™ is about to get wider distribution.

After 11 years of development, I think it's finally ready for wider distribution. And, who knows, maybe I'll make a couple of bucks.

I've updated the pricing structure and the license agreement, and in the next week or so (after some additional testing), I'm going to release it to NuGet.

That doesn't make it free; that makes it available. (Actually, I am making it free for development and testing, but I'm charging for commercial production use.)

I'll have more to say on this once it's released.

Chicago-based Bswift gets $51 million in funding

Crains reports this morning that a local Chicago technology start-up (not mine) has just gotten a ton of money:

Bswift LLC, a healthcare-benefits software firm, has received $51 million from a private-equity fund to keep up with torrid growth.

The Chicago-based company has been growing at more than 40 percent annually for the past four years and is enjoying a surge in demand, in part because of the Affordable Care Act. Bswift's technology is used by companies to provide comparison shopping, enrollment and administration of health insurance benefits. It also operates insurance exchanges for private and public markets.

Bswift's business is exploding. Headcount at the firm, which is based in the West Loop, has soared to more than 300 from 165 a year ago. A year ago, the company had expected to hire 100 people over three years.

“We've added 45 people since the beginning of the year,” [Bswift CEO Rich] Gallun said. “We'll be over 400 by the end of the year.”

Wow. And whoa. And woe.

That's really good news for Bswift's owners and stakeholders. I'm concerned what it's like to work there, though. Managing any growth taxes the abilities of any manager or business owner. Growing staff by 5% every month—they've added 45 employees this year alone—has to be a strain.

I'm curious what it's like over there right now, and how they're managing the growth. With this infusion of cash, they're going to have a lot of pressure to grow even faster. How will they maintain their culture? How will they manage quality and delivery? What do their clients think?

Microsoft Azure TV ads

Microsoft has partnered with Lotus Formula 1 Racing to create a series of ads about Microsoft Azure:

Neowin reports:

The new ad, which has been running for the past few days on many U.S. TV networks and has been posted on YouTube, attempts to show how the Lotus Formula 1 racing team uses a number of Microsoft cloud services such as Azure, Office 365 and Dynamics to collect and analyze data from 200 sensors on the car. The ad's main them is that the cloud products offer the Lotus team a way to better understand how the F1 vehicle runs on each track and, therefore, give them an edge in winning races.

The new TV commercial comes even as rumors hit the Internet that Microsoft is planning to rebrand its Windows Azure cloud website hosting service to Microsoft Azure, in order to better reflect the fact that it can use software not made by the company like Linux.

The "rumors" are true, by the way. The service is now called Microsoft Azure.

It's not the good times they care about, it's the bad

The repercussions from Monday's data-recovery debacle continued through yesterday.

By the time business started Tuesday morning, I had restored the client's application and database to the state it had at the moment of the upgrade, and I'd entered most of their appointments, including all of them through tomorrow (Thursday). When the client started their day, everything seemed to be all right, except for one thing I also didn't know about their business: some of their customers pay them based on the appointment ID, which is nothing more than a SQL IDENTITY column in the database.

If you know how databases work, you know that IDENTITY columns are officially non-deterministic. In this specific case, the column increments by one every time it adds a row, but also in this specific case, I didn't re-enter the data in the same order it was originally entered, since I prioritized the earlier appointments.

We've gotten through the problem now, and the client no longer want to put my head on a spike, so I will now take a moment for an after-action review that might help other software developers in the future.

First, the things I did right:

  • When I deployed the upgrade Saturday, I preserved the state of the database and application at exactly that moment.
  • All of the data in the system, every field of it, was audited. It was trivially easy to produce a report of every change made to the system from roll-out Saturday afternoon through roll-back Monday night.
  • When I rolled back the upgrade Monday night, I preserved the state of the upgraded database and application at exactly that moment.
  • When the client first noticed the problem, I dropped everything else and worked out a plan with them. The plan centered around getting their business back up first, and then dealing with the technology.
  • Their customers were completely back to normal at the start of business Tuesday.
  • The application runs on Windows Azure, which made preserving the old application state not only easy, but possible.

So what should I have done better?

  • My biggest error was overconfidence in my ability to roll back the upgrade. No matter what other errors I made, this was the root of all of them.
  • The second major error was not testing the UI on Internet Explorer 8. Mitigating this was the fact that neither I nor my client was aware that the bulk of their customers used IE8. However, given that people using IE8 were totally unable to use the application, even if the numbers of customers using IE8 was very small, the large impact should have put IE8 near the top of my regression test checklist.
  • Instead of spending a couple of hours re-entering data, I should have written a script to do it.
  • I have always regretted (though never more than today) publicizing the appointments IDENTITY column to the end user, because it's normal they'd use this ID for business purposes. This illustrates the danger—not just the sloppy design—of using a single database field for two purposes. Any future version of the application will have an OrderID field that is not a database plumbing field.

All in all, the good things outweighed the bad, and I may get back in my client's good graces when I roll out the next update. You know, the one that works on IE8, but still solves the looming problem of the platform's age.

And the day started so well...

At 8:16 this morning, a long-time client sent me an email saying that one of his customers couldn't was getting a strange bug in their scheduling application. They could see everything except for the tabbed UI control they needed to use. In other words, there was a hole in the screen where the data entry should have been.

Here's how the rest of the day went around this issue. It's the kind of thing that makes me proud to be an engineer, in the same way the guys who built Galloping Gertie were proud.

It all started when I updated a Windows Azure cloud service from the no-longer-supported SDK 1.7 running on Windows Server 2008 to the current SDK (2.2) and operating system (Windows Server 2012 R2). I also upgraded the language from C# 4.0 to C# 4.5.1, which is only possible on WS2012R2.

This upgrade started months ago, and proceeded slowly because both I and the clients had other priorities. I mean, who wants to spend a lot of money upgrading a platform without upgrading the application running on it? So the last build of the application went to production in October, and I haven't touched it since. I mean, it worked fine, why mess with it? Other than the fact that the operating system and Azure SDK are no longer supported.

Before pushing the update, I thoroughly tested the application. I mean, unit tests up the ying, with a tens-of-steps-long regression test on my local, and on an Azure test instance, before even looking askance at the Production instance. When I had tested everything I could imagine, I did this:

  1. Stopped the application, to ensure no one changed any data during the upgrade.
  2. Made a full copy of the production database ("CREATE DATABASE productioncopy AS COPY OF production")
  3. Once the data was fully copied, I uploaded the new bits to the Staging slot of the application.
  4. I updated the configuration info to the current standards.
  5. VIP swap! (I swapped the staging and production instances, so the old production instance was now in the staging slot.)
  6. And....it's running just fine. All that planning and testing worked!

So what happened? Well, it turns out there's one thing I didn't anticipate: Internet Explorer 8, released five years ago Thursday, and known to have difficulties with JavaScript. Plus, the controls we used when we orignally deployed in January 2008, made by Infragistics, have known incompatibilities with IE8, but again: the application has worked just fine the whole time.

Since everything worked just fine on earlier versions of the application, and since this update didn't directly change the UI, and since IE8 hasn't been supported in quite some time, I figured there wouldn't be any problems.

It turns out that a sizable portion of my client's customers use IE8, because they're big hospitals with big IT departments and little budgets for updates.

Once I realized with abject horror that the application was simply broken for most of the people using it, I resigned myself to rolling back to the previous release, which had worked just fine. When I got home, I started this task, and the following things happened:

  1. Once again, I stopped the application.
  2. The actual database restore went fine, as did the VIP swap putting the previous version back in the Production slot and the new version in the Staging slot.
  3. When the application started up, I realized I'd forgotten to roll back the configuration information for the logging and messaging component. So the application failed to start.
  4. I rolled back the config.
  5. The application again failed to start. Only now, because the logging and messaging component is the part that's failing, I can't see any diagnostics.
  6. Fortunately, I deployed the application with Remote Desktop enabled, so I tried connecting to the virtual machine directly.
  7. The Remote Desktop user account had expired.
  8. Fortunately I use great source control. In Mercurial, I updated to the last production build before the update, and loaded it into Visual Studio.
  9. Tried to load into Visual Studio, and failed. See, I no longer have the Azure SDK v1.7. I never installed it on this machine, in fact. I'm running SDK 2.2, and I have no easy way of running an older version.

So, as far as I knew at this point, there is simply no way to get into the application, and no way for me to re-upload the old version.

I decided to try a different tack. I rolled back the rollback and restarted the new version. I also started trying to get my last remaining Windows XP machine running so that I could confirm the bug, and start testing fixes on a Test instance running Windows Server 2012 R2.

Getting a 10-year-old laptop to boot, let me log in, stop wasting time with all the detritus it acquired in its years of service, connect to my network, and open up IE8, took 45 minutes.

Some time in there I walked Parker.

So now, I can see that the error exists in IE8, and I also have found an article on how to reset the RDP password expiration date. Only, I'm really tired, and I am worried I'll make stupid errors if I keep trying to debug this right now.

So I have two approaches I will try first thing in the morning: first, roll back to the October release, and manually update the RDP expiration date so I can remote in and debug the configuration problem. Then I'll have to re-create all the data my client added yesterday, which will take me at least an hour. If I'm supremely lucky I'll have this done by 8am. Since I've had no luck at all so far on this upgrade, I am not optimistic.

Second, I'll start removing the outdated Infragistics code. Believe it or not, jQuery works fine on IE8, despite it being pretty much the latest thing in user interface languages. It's the custom crap Infragistics pushed out years ago that fails. Unfortunately I won't be able to deploy this before leaving on Thursday morning. Fortunately the application isn't going to stop working suddenly; the OS and SDK are no longer supported, but they won't actually turn the OS off until June.

And there's the irony in a nutshell. I thought I did everything right in the deployment cycle, especially the part where I got three months ahead of the due date. The things that went wrong to get me to this state of frustration and exhaustion were numerous and tiny, kind of like the things that go wrong to cause an aviation accident. That said, the client has suffered no data loss, and I preserved a whole catalog of options to fix the problem (relatively) quickly. This isn't the disaster it would have been without the deployment tools you get with Azure.

Plus, I've learned to test everything on IE8 whenever health care companies are involved. Sheesh.

Doomed to repeat it

The news recently and Krugman this morning have brought Tennyson to mind:

Theirs not to make reply,
Theirs not to reason why,
Theirs but to do and die:
Into the valley of Death
  Rode the six hundred.

Heroism has its place, but not when it takes everyone else through hell.