Boosting Efficiency: Technical Improvements for Boomset – Part 2
In the event industry, one of the most important functions of your system is being able to register new guests. Sometimes new guests register at the event entrance on an IPad, other times they register online through the event page. The complexity of this process is medium/high relative to a simple registration page. You might have many custom questions and/or sessions in an event and in addition to visible things on the registration form there are other things being created/updated on the back end which are invisible to the end user.
The Problem: Deadlocks
So, in case there’s a problem with any of the many steps included in this flow we want to abort all the previous database operations so our data integrity stays clean.
The approach to this flow by the early engineering team was to create a single huge database transaction and guarantee that everything was created or aborted in a single easy step. That’s probably how I would have done it too if I was the one developing that part for the first time. What this causes is that now you have one of the worst problems a software engineer might encounter: Deadlocks.
It turns out that the seemingly innocent thing called transaction is not so innocent after all. It might work very well if you are the only person using the project but under just a little bit of production load it just crumbles and turns your life into a nightmare when you see hundreds of errors raining on sentry plus it’s a Sunday plus you are outside chilling on the seaside 🙂
This problem was detected on live events but it was hard for the team to replicate so it just stayed there. Here’s another issue, the way the team was trying to replicate this issue was this:
- Write a simple script that makes a request to the endpoint.
- Share the script with the team.
- Tell the team to run the script at the exact same time on 3, 2, 1 and go!
- See what happens.
Unfortunately, this process is not enough to trigger a deadlock unless your teams size is hundreds of people. In our case the team was six people and after the test everything seemed fine when it actually was not.
The Solution
From this point onward I decided to spend a weekend on the issue because it was not in our current sprint and I got that urge to fix stuff. I modified my local setup to make it similar to our production environment and created a JMeter configuration to trigger the deadlock. After everything was ready I got failed requests on the first try.
Okay, now the triggering the deadlocks issue was solved. It’s time to fix the actual issue now. After many stackoverflows and google searches I had some ideas about the deadlock problem. Here are the possible issues:
- The order of the database queries might be causing the issue.
- The size of the transaction might be a problem because of the time it blocks other requests.
Well, the order wasn’t the issue in our case and it was obvious that the huge transaction block was the problem. I decided to break the huge transaction into smaller transactions. Actually, I moved most of the code out of the transaction and just created transactions in some crucial spots.
A new problem is generated from all this. Now, when you have a failing step you end up with a little bit of corrupt data. You do a couple of inserts and something fails, you immediately return an error to the user. There are no more transactions to take care of your wrongfully inserted data. This is not the end and you have a couple of options here too:
- Manually clean things up.
- Just leave it there and don’t bother.
Manually cleaning things up might sound like the way to go but that’s not what I did :). To decide what to do, at first, I just didn’t bother and wanted to watch what happens. You know what? We never had anything fail ever again in that process. This statement is probably not 100% accurate but I think it’s at least 99.99% accurate. The amount of problems was so small and unrelated that going for the first option (manually cleaning things up) would have been a waste of time anyway.
Conclusion
I was lazy to write a conclusion so I asked ChatGPT and here it is:
In conclusion, the process of registering new guests at an event can be complex. The early engineering team attempted to solve this complexity by creating a single large database transaction, but this led to the problem of deadlocks under production load. After extensive testing and research, it was determined that breaking the transaction into smaller ones and moving code out of the transaction resolved the issue. While this led to the problem of potentially corrupt data in the event of a failing step, it was found that the amount of problems was so small that manually cleaning up was not necessary. This experience highlights the importance of thoroughly testing and refining systems to ensure data integrity and minimize potential problems.
Leave a Reply