15 Mar 2015 by dryobates
It is common to use queues to handle spike load and spread queries in time. On the other hand sometimes you want to do reverse: gather queries spread in time to run them in one batch request. Celery queues gives us both of it.
If many operations appear at short period your system might not be able to handle all of them. If this operations can be spread in time then it is wise to use queue system. Of course you have to adjust queue size to your needs and powers. E.g. if you're bottleneck is database then even if you can set up many workers you have to set up limits on how many queries will be run on database.
Not so typical spike load
In my work we had a little unusual variation on spike load problem. At some point in time there appeared about 70 requests. Some of this requests start long running transactions on MySQL database. MySQL isn't well suited to handle long running transactions, but it's the other story. The worst thing was that early in transactions there were made locks which were kept to the end of transaction. The first thing I have done was to try split this long transaction to smaller ones. I have managed to shorten it by a factor of 2 but it was still too long. If all of this requests have start at the same time and waited for a lock it could take half a day to finish all of them. OK, we can accept with pain that long time for finishing all of them. But so long transactions hanging on database make memory and/or disk usage to grow greatly and whole database slows down. We can't allow for this.
Limiting task rate
Here comes queue system for a rescue. We can use it to start queries with some delay and spread load over time. I have checked that with locks requests took 10 minutes on average. So let's start with rate limit 6 tasks per hour (60 minutes / 10 minutes = 6 / hour).
@app.task(rate_limit='6/h') def update_product_statistics(product_id): <task body>
After running this on live system I have found that when requests did not lock anymore tasks run 2 minutes on average. So we could safely change rate limit to 30 per hour (60 minutes / 2 minutes = 30 / hour).
@app.task(rate_limit='30/h') def update_product_statistics(product_id): <task body>
We have managed to go for less then 3 hours in total for all tasks.
Then appeared other problem. When our editors save changes to products then automatically are started the same tasks as above to update all data connected with that product. Everything would be OK if products were saved once after change. But we have found out our editor's usage pattern: they saved product after every fields change or even more often. That started much more tasks then was really needed. First think I have tried was to convince editors that we can give them two buttons: one for saving product and one for updating all connected data, which they will press after finishing all editing. They didn't want that so I have dig a little and found celery batches.  With batches we can wait a little before updating data connected to product and then do only one update for every product.
CELERYD_PREFETCH_MULTIPLIER = 0
from celery.contrib.batches import Batches @task(base=Batches, flush_every=100, flush_interval=5 * 60) def update_product_statistics_from_frontend(requests): product_ids = set(request.kwargs['product_id'] for request in requests) for product_id in product_ids: <task body>
In the above example we are waiting for 5 minutes or 100 requests and then run only one update for every product.
There's one caveat with above setup. It works only when you have one worker. If you have many workers then with CELERYD_PREFETCH_MULTIPLIER = 0 all requests can go to one worker and the other would be unused. If you set CELERYD_PREFETCH_MULTIPLIER to some small value requests can be spread across different workers which would work in parallel on same task. We have dedicated worker only for some tasks so it suits us enough. You can route task only to selected worker but that single worker become SPOF.  For our needs it is acceptable. We can easily set up now worker if other fail down for some reason. Even few hours of delay isn't disaster. That's a price we pay for using Batches.
|||Celery batches http://docs.celeryproject.org/en/latest/reference/celery.contrib.batches.html|
|||Single Point of Failure https://en.wikipedia.org/wiki/SPOF|