The repercussions from Monday's data-recovery debacle continued through yesterday.
By the time business started Tuesday morning, I had restored the client's application and database to the state it had at the moment of the upgrade, and I'd entered most of their appointments, including all of them through tomorrow (Thursday). When the client started their day, everything seemed to be all right, except for one thing I also didn't know about their business: some of their customers pay them based on the appointment ID, which is nothing more than a SQL IDENTITY column in the database.
If you know how databases work, you know that IDENTITY columns are officially non-deterministic. In this specific case, the column increments by one every time it adds a row, but also in this specific case, I didn't re-enter the data in the same order it was originally entered, since I prioritized the earlier appointments.
We've gotten through the problem now, and the client no longer want to put my head on a spike, so I will now take a moment for an after-action review that might help other software developers in the future.
First, the things I did right:
- When I deployed the upgrade Saturday, I preserved the state of the database and application at exactly that moment.
- All of the data in the system, every field of it, was audited. It was trivially easy to produce a report of every change made to the system from roll-out Saturday afternoon through roll-back Monday night.
- When I rolled back the upgrade Monday night, I preserved the state of the upgraded database and application at exactly that moment.
- When the client first noticed the problem, I dropped everything else and worked out a plan with them. The plan centered around getting their business back up first, and then dealing with the technology.
- Their customers were completely back to normal at the start of business Tuesday.
- The application runs on Windows Azure, which made preserving the old application state not only easy, but possible.
So what should I have done better?
- My biggest error was overconfidence in my ability to roll back the upgrade. No matter what other errors I made, this was the root of all of them.
- The second major error was not testing the UI on Internet Explorer 8. Mitigating this was the fact that neither I nor my client was aware that the bulk of their customers used IE8. However, given that people using IE8 were totally unable to use the application, even if the numbers of customers using IE8 was very small, the large impact should have put IE8 near the top of my regression test checklist.
- Instead of spending a couple of hours re-entering data, I should have written a script to do it.
- I have always regretted (though never more than today) publicizing the appointments IDENTITY column to the end user, because it's normal they'd use this ID for business purposes. This illustrates the danger—not just the sloppy design—of using a single database field for two purposes. Any future version of the application will have an OrderID field that is not a database plumbing field.
All in all, the good things outweighed the bad, and I may get back in my client's good graces when I roll out the next update. You know, the one that works on IE8, but still solves the looming problem of the platform's age.