AboutBlogsSubscribe
Blogs >My First Software Rollback (September 04, 2025)
English | 中文

My First Software Rollback

Whoopsy!
September 04, 2025

What is a rollback? A rollback is when a software update is released, and then “Whoopsy” A serious bug is discovered: you have to quickly reverse the update and revert to a previous software version. When a rollback happens, it almost always means the problem is both serious and urgent. If it wasn't serious or urgent, you could take your time, figure out what went wrong, and then just push another update later. But if the bug is too big, if every second of its existence is an unbearable pain financially, then you have to quickly roll back to eliminate the bug you just created.

Because they are usually a severe bug, a rollback to the softwares engineers is like a failing grade on a final exam. You regret not taking more time, checking your work one last time before turning it in, and thinking that maybe things wouldn't have turned out so badly. This awful final exam happened to me during my first year with YouTube Shopping.

YouTube Shopping is incredibly popular in Korea. One company, a massive e-commerce platform often called the "Korean Amazon," generates a huge amount of sales on YouTube. Many people watch product reviews on YouTube and then purchase the products directly from the Korean Amazon. In a way, the Korean Amazon is the "sugar daddy" for many South Korean influencers. In fact, their sales were so high, and the number of participating videos and products was so massive, that they were the only merchant we couldn't properly calculate metrics for. Other merchants could go to our merchant center and see their sales, click-through rates, and conversion rates, but the Korean Amazon couldn't.

Why not? There was too much data, and we didn't have enough memory to process it. So why not just add more memory? Because this problem only existed in our product; it is hard to imagine that our data volume was the largest for all Google products.

Shortly after I joined, this "memory optimization" task was handed to me. At first, I thought it was a simple, entry-level assignment, but the more I worked on it, the more I felt like "something is fishy." A month passed, and I hadn't made any progress at all. The pressure was immense. Finally, I managed to push an algorithm within the deadline that increased CPU computation to reduce memory usage. It launched in early 2025.

After the launch, the Korean Amazon could finally use the website, although very slow. The memory crashes were gone, but the other merchants who never had a problem suddenly found website had slowed down by a factor of 10. In fixing one bug, I had accidentally created a huge one for every merchant.

I immediately called an emergency meeting with my manager and the tech leads of my team to explain what happened, which merchants were affected, how we were going to roll back, and when the Korean Amazon would be able to use the website again. Although I didn't know how I was going to fix it, I told them I needed another month to get the Korean Amazon fully online. I essentially dug myself into a hole; if I fail to deliver again or had to roll back a second time, it would have been a disaster for me.

In the following month I completely refactored the database and the codebase. I'm not sure if the tech lead had confidence in me or was just desperate since there was no turning back. He quickly reviewed my code, and in a single month, I suddenly evolved into a 10x engineer. A month later, I really did fix it. The memory problem was resolved, and every merchant could now use our Merchant Center website without any noticeable speed issues. I sent out the announcement that the Korea launch was official, my teammates all replied with congratulations, and my boss was thrilled; the task was finally complete.

It was only much later, a few months after the fact, that I learned this memory problem had been an unsolved issue within the team for a long time. It was a ticking time bomb that had been passed around until it landed on my desk when I joined. The reason I was able to join YouTube Shopping was that a previous team member had left in less than a year. I wonder if he had also received a similar ticking time bomb? And if I hadn't been able to solve it, would I have also left in less than a year?


For those who are interested: the memory issue was solved by replacing ARRAY_LENGTH with COUNT DISTINCT in the sql scripts to count unique items. Using an array to count items is memory intensive because you have to store all the items in the array. It is also compute intensive because every time you merge with another array you have to make sure there are no duplicate items. The benefit is that you can pass this array around in your query and join whenever with whatever, and the result will still be correct.

On the other hand, in order for COUNT DISTINCT to be correct, this command can only be done once after all the joins are done. Therefore, by comparison it is harder to use. However, compared to big arrays it is more memory and compute friendly as sql is efficient in joining and handling big tables. In order to migrate from ARRAY_LENGTH to COUNT DISTINCT, I ended up overhauling and refactoring sql scripts and schemas.

Continue Reading:
to be continued...

Subscribe to I-Tan's Blog

Tech & My Personal Updates