This is also a story about how the worker tier affected the database which affected the web tier.
Vinosmith is a Rails-based stack. There are 3 primary tiers:
- Web - Unicorn/nginx
- Resque - asynchronous workers
The Dashboard screen features a rollup data set which is calculated and stored in Redis for quick lookup. There is a scheduled job which runs at 6AM to calculate this data and insert into Redis.
The background job also clears out all rows and re-populates a database table as part of the re-calculation process.
Because I was doing an unqualified
DELETE on the table I thought that an
would be superior to a raw
DELETE -- so the
TRUNCATE has been in places for over 4 years now.
Most of our customers are based on the West Coast of California and also work a normal schedule of 9AM - 5PM.
Sporadically over the last few months I have gotten notifications from the AWS Elastic Load Balancer Cloud Watch alarm that no healthy hosts were available - thus bringing down the site.
All of these notifications came in about 6AM and then subsided about 2-3 minutes later. But during that 2-3 minute window the site was basically unusable. But it happened rarely enough that I kind of forgot about it and I chalked it up to EC2 or ELB flakiness.
I also assumed that no one really noticed the issue because it was at 6AM ... and I know how few (if any) of our customers are using the product that early in the morning...
But of course, eventually a Customer did notice and thus I prioritized finding the cause and a solution.
Looking in the nginx logs I can see entries where nginx cannot find an upstream Unicorn worker:
nginx.error.log:2017/12/29 14:02:11 [error] 1099#0: *13760186 upstream timed out (110: Connection timed out) while reading response header from upstream, client: www.xxx.yyy.zzz, server: *.vinosmith.com, request: "GET /health-check HTTP/1.1", upstream: "http://unix:/tmp/unicorn.vinosmith.sock/health-check", host: "www.xxx.yyy.zzz"
There are multiple web boxes in the cluster, thus multiple Unicorn/nginx instances and looking in the logs on each instance for that same time frame I can see similar log lines.
So that rules out one machine getting overloaded and thus maybe unable to serve requests. Which sounds crazy, but possible. But this was affecting all instances ...
I then looked in NewRelic and found that during that timeframe - shortly after 6:00AM - the Ruby VM garbage collector started going crazy, which is fine it on itself, but looking at the NewRelic report across all instances (web + worker) I traced it down to a particular Resque background job which happens to run at 6AM.
Looking at that job it became clear what the problem was:
- The job opens a DB transaction
- The job then executes a
TRUNCATEstatement to clear out the database table.
- The job then proceeds to run lots of other
INSERTs -- doing its thing, which is great.
The problem was the
TRUNCATEstatement and looking at the Postgres documentation I found the crux:
down in the Notes section: TRUNCATE acquires an ACCESS EXCLUSIVE lock on each table it operates on, which blocks all other concurrent operations on the table. When RESTART IDENTITY is specified, any sequences that are to be restarted are likewise locked exclusively. If concurrent access to a table is required, then the DELETE command should be used instead.
The key line being: If concurrent access to a table is required, then the DELETE command should be used instead.
So what was happening was that the
TRUNCATE was requireing an exclusive lock on the table; the web clients were hitting the Dashboard which in turn triggers a
request-from-cache-or-recalculate routine which also happens to hit that DB table. These requests were getting blocked in a spin loop and then tying up that Unicorn instance. Customers were either spawning a new browser tab and hitting the same screen - thinking their busy request was a fluke. The load balancer was unable to hit the Rails endpoint for a health-check and thus eventually it pulled out all instances from the cluster, bringing down the site.
Once I identified the issue the solution was clear:
- Switch to
- Not that it matters at this point, but move the job to a more off-time slot like 3AM instead of 6AM.