Wrong Deployment Disaster: Rollback Scenario and Lessons

Oh those deployments… Sometimes, it’s like life’s irony; just when everything seems to be going smoothly, the most unexpected thing tests you.

Recently, we experienced a “Wrong Deployment” incident which still makes me laugh when I recall. We released a new feature to production, right? Everything was tested, approved, and it was almost deemed “flawless” to go live.

But, you know, this crazy world of technology sometimes kicks your perfect plan so hard that you can’t even understand what hit you.

As you can imagine, the first reaction in such situations is always panic. Production! Live! Customers are affected! But our team managed to stay calm this time, or at least they appeared to from the outside. Our rollback plan was ready, considering every scenario. The worst-case scenario was to revert to the previous stable version. Isn’t that a wonderful idea?

Anyway, after the deployment was completed, the first alarm bells started ringing shortly after. User reports showed that things were much different than we expected. “What is going on?” I remember us looking at each other. It was as if the code we deployed hadn’t even touched the Production environment. Everything seemed to work as it should, but at the same time, nothing was working as expected. It was quite a strange situation.

Right then, after the initial shock, the famous question came to our minds: “Should we do a rollback?” Everything was ready; pressing the button was enough. But at that moment, uncertainty took over. It took about 15-20 minutes of hesitation. Was the issue in our code, or was it an infrastructural problem? Or… was it something else?

And that’s when the real comedy of the situation emerged. Turns out, our business side colleagues had changed a database table in Production simultaneously with our deployment. Isn’t that nice? Our code was actually working perfectly, but because of the changed table structure on the other side, everything got mixed up. Our deployment wasn’t actually ‘failed’; it just wasn’t in the right place at the right time.

This situation reminded me of a story from many years ago. I was developing a Windows application in C#. I was quite inexperienced then, trying to think of everything down to the finest detail. I had written a reporting module, which the users liked very much. One day, someone came and said, “Hey, why is this report like this?” I looked, and the report was actually working correctly, but the Excel file generated… Oh my God! It was because a user had changed the page layout in Excel, which had completely messed up the report.

Why am I telling this story? Because sometimes, the problem isn’t in the code we write but in external factors interacting with it. Or sometimes, what we call a ‘faulty deploy’ actually stems from changes made simultaneously by different teams. Therefore, it’s crucial to ensure that all changes, especially critical ones like database schema modifications, are synchronized with the deployment process.

Now, let’s get into the technical part. Before performing a rollback, what steps did we take to quickly identify the issue? First, we scrutinized the Production logs. We noted which service failed, when it happened, what was reported by the users. Then, we checked which database tables our deployed code touched. During this process, we also immediately contacted the team that made changes to that database table.

When we finally identified the root cause, we all took a deep breath. The database structure in Production was different from what we expected. Our code was trying to write data to a non-existent column, causing errors. Instead of rolling back, we reversed the database change and redeployed our code. Everything worked fine afterward. Isn’t that great?

After overcoming this incident, I also thought of another important point: using ‘Feature Flags’. If you have read about this before, you’ll understand what I mean. Feature Flags allow you to deploy new features to Production but keep them inactive. If an issue occurs, you only need to change a setting to disable the feature. It’s much faster and safer than doing a rollback. For more detailed info, you can search on Google.

Such incidents truly broaden your horizons. On one hand, you think, “Oh, why didn’t we be more careful?” and on the other, you realize “Fortunately, a major problem didn’t happen.” Every mistake in production is a learning opportunity to prevent bigger issues in the future. So, these “wrong deploy” scenarios should be viewed not as nightmares but as chances to learn.

We also held a team meeting following this incident. We decided that database changes and application deployments should never happen simultaneously and should proceed in a synchronized manner. We even considered automating this process with a pipeline. For more examples of CI/CD pipelines, check on YouTube.

Ultimately, deploying something to production is always exciting but requires patience. Even the best plans can encounter unexpected surprises. The key is to be prepared for these surprises, adapt quickly, and learn from each event. Of course, having a solid team is the most important part. 🙂

Sometimes I get lost in these technical stories. By the way, the weather in Bursa is very nice today, sunny. We plan to go to the seaside with my wife in the evening. Maybe we’ll grab a coffee and have a nice chat. Life is about writing code and spending quality time with family.

Did I say this was a technical fail story? Yes, recently, during a circuit design, I chose the wrong transistor, and the entire circuit’s performance dropped. I worked so hard on it, and when I finally found the mistake, I felt both embarrassed and learned a lesson. That day, I realized once again the importance of carefully reading datasheets. I spent about 5 hours just finding that mistake.

Leave a Reply Cancel reply