Boosting Efficiency: Technical Improvements for Boomset – Part 3

I’m sorry to say that this part was a small mistake causing big reactions just like part 1. I don’t think it will take too much time to explain so I’m hoping this one will be a very short one.

Many websites these days have an HTTP server to serve responses to requests made from various clients. Faster responses are directly related to better user experience so everyone wants to minimize the work done in a request-response cycle. To achieve this, some time consuming tasks are sent to a background worker to be process asynchronously. Some examples include “sending emails”, “generating activity streams”, “generating reports” and so on. When your primary programming language is Python a very popular choice is to use Celery to process background jobs.

When I joined Boomset the system had around 9 background worker instances. One of them was consuming close to 100% CPU all the time. This was not perceived as a problem back then, everyone including me was thinking that there are a lot of jobs to be processed and that this is probably normal.

After many months, when my responsibilities grew, I decided to deep dive into the workers and see if I could optimize something.

The situation was something like this:

  • 1 general worker consuming many queues at max CPU most of the time.
  • 8 separate workers instances consuming different special queues.
  • Whenever we needed some new background processes create a new EC-2 instance and boot up a new worker.

The problem: Multiple Celery Beats

When I deep dived into the general worker instance the first thing I did was to check the running processes with top or htop. It took me a couple of seconds to notice something was starting and dying in an infinite loop. We were using supervisorctl to boot workers within the instance and it tried to boot up celery beat got an error and tried again instantly. The error was that there was already a celery beat running and that it couldn’t boot so it tried to boot it up forever. This was causing the CPU to go crazy.

The solution: Don’t try to boot multiple Celery Beats

Well, as I said, both the problem and the solution are pretty easy and straight forward. I fixed the supervisorctl configuration so that only one celery beat would be started and everything went to normal in an instant. The CPU went from 100 to 15-20 just like that.

This CPU thing caused everyone to believe that the worker was actually really busy so the road map was always to create new instances for new background tasks. Like a bad chain reaction this caused the company to pay more for more instances.

The next thing I did was to remove all other instances and have only one worker bigger instance. These changes caused 33% savings on our AWS account and it helped me get a better bonus and raise.

Conclusion

This was another small mistake that lived under the radar for a very long time. I believe our software engineering ecosystem is full of such things. We should be really careful to do things right so that we don’t waste money and resources. My way of thinking in such cases is not primarily about saving someones money but about saving resources. In this example, the small mistake caused hundreds maybe thousands of hours of running EC2 instances, many many hours of software engineering time went into configuring multiple environments and deployment for new worker instances and lastly it caused the company to waste a lot of money. The amount of money spent was enough to pay for an extra engineer every month for our small team. So, please be careful and don’t shoot yourself in the foot 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *