b5media uses a handful of web servers fronted by a load balancer. We’re getting quite busy in terms of traffic, so I finally got around to putting reverse proxy in front of the farm. A reverse proxy accepts the request from the user and checks its cache for the result. If no answer is found, the proxy asks a back end web server for the same page the user asked for, saves it, and passes it back to the client. squid is such a piece of software, it is used on wikipedia and other sites for both forward (caching the sites your users visit) and reverse proxies.
After swinging the traffic over to the proxies I was ecstatic. Just by caching images, css, and javascript for a few minutes each I was able to serve 40% of the hits from the proxies instead of going to a backend server, and knock 20% off the CPU usage of the servers. Then one of our celeb blogs got a huge wave of traffic that killed the proxies, I ended up pulling them out of service and going back to the old method.
Squid is an amazing piece of software used on much bigger sites than us, so I knew it had to be the way I had it configured.
I’m pretty sure the reason squid crapped out on us was because it ran out of file descriptors. For some strange reason, squid decides how many FD it’s going to use *at compile time*. When I rebuilt the Fedora RPMs I didn’t do anything special so it was using the default of 1024. At two FDs per connection (one for the inbound, one for the outbound) plus whatever is needed for pulling cache files from disk, we ran out in a hurry once we got busy and couldn’t make any socket connections.
The big reason I missed it earlier was because I had “cache_log none” which is roughly the same as apache’s error_log rather than the info on cache hits/misses I thought it was (that’s cache_store_log).
So the new RPM I’ve put together has it built for 200K FDs and has a ulimit command in all the startup scripts. I’ve also linked it against Google’s tcmalloc which apparently made a big difference at Wikipedia when they tried it.
Some useful links that might also cure insomnia:
Six Things First-Time Squid Administrators Should Know: http://www.onlamp.com/pub/a/onlamp/2004/02/12/squid.html
tcmalloc: http://goog-perftools.sourceforge.net/doc/tcmalloc.html
About wikipedia’s problems: http://wiki.wikked.net/wiki/Squid_memory_fragmentation_problem
MRTG is also graphing the FD usage for the next round.