T-SQL Tuesday #44 Second Chances
July 9, 2013 1 Comment
Bradley Ball (b|t) who also happens to be my moderator for the upcoming 24 Hours of PASS is the leader for this month’s T-SQL Tuesday, which is all about second chances. When I think about things I’ve screwed up in my career, and there are many, I always fall back to one, and it doesn’t even involve SQL Server, but another RDBMS and it gives a couple of lessons on how to be good DBAs.
It was late on a Friday afternoon, 1522 to be exact (here is lesson #1—never do anything involving production on a Friday afternoon unless you have to), and I got a call from a user, asking me to refresh the QA database with production data. This system was an environmental monitoring system, that monitored the atmosphere and surfaces for the biopharmaceutical manufacturing plant that I worked in, it wasn’t exactly mission critical, but it was still pretty important. Since the user was a friend of mine I jumped right on it. In the database I was working on, as opposed to what I’d probably do in SQL Server (which would be to restore a backup) there is a very easy to use import/export feature, that allows for easy logical restoration of a specific objects/schemas, etc. So this was my standard methodology for doing a refresh with prod data, but I had yet to script it (lesson #2—be lazy, if you have to do something more than once, script and automate it).
Anyway, I go ahead and take my export of production, and start to import back into the QA environment. Typically, my process would be to log into QA, drop the user that owned the objects, and then run the import. For whatever reason (and in this case it was probably a good thing), I didn’t do that. I started my import and noticed I was getting some errors—once again, not something I’d ordinarily do, but I cancelled the job and reran it with error suppression on (lesson #3—always read the errors, and never turn error suppression on). The job completed without error, I emailed the user back telling him that it was complete, and I went on with my Friday afternoon. About five minutes later, I got a phone call from the same user:
There’s this classic moment that happens in IT (and probably happens in each of these blog posts), that I like to call “the bead of sweat moment”, it’s that moment you realize you #$%ed up badly, but no one else has quite realized that you are responsible yet. I asked the user to get everyone out of the system. In ordinary companies an error and outage like this would not necessarily be a big deal, but this a control system in a pharmaceutical plant, so it was heavily regulated. What had happened was that I (accidentally) did an import from prod onto itself, those errors I suppressed were duplicate record errors.
So now, I had a production database filled with duplicate data. So I go tell me boss—I screwed up, and we’re going to be down for about 10-15 minutes. Fortunately, I knew from the log of the job, the exact time the records were inserted. Also, the database was in the equivalent of full recovery mode, so I was able to do a point in time restore to the second before I started the import job. This leads us to lesson #4—always have a backup, and even better if it’s near to the system (in this case it was on local disk, before it was zipped off to tape).
The lesson I learned were several, but the biggest is that if you have good backups (and regularly test your restoration process) you are protected from a lot of evils.