Killing me softly: Keeping dispatchers alive

This is an intermediate publication to my long promised in depth review of me trying to scale a million dynamic page impressions a day on Rails.

When the site in question finally stabilized somewhat, a new problem crept up that I’ve been unable to fully resolve over the past weeks. The net effect is that my FastCGI dispatchers become unresponsive after a while, potentially after a huge traffic spike. Those sit there doing nothing and lighttpd is unable to talk to them.

The site is powered by 4 application servers running 7 dispatchers each and a dedicated lighttpd proxy. After a while, half of those dispatchers are unresponsive and as such no longer serving any requests. The page load times crawl to a halt.

Currently, I’m on Ruby 1.8.4, lighttpd 1.4.10 and Rails 1.0 on Linux 2.6.14.

I’ve tried everything from upgrading Ruby and all gems to debugging potentially exceeded TCP connection limits on my servers to even talking to weigon, the brains behind lighttpd. No avail.

The weird thing is, it doesn’t matter which end I restart, be it the dispatcher *or* lighttpd, everything goes back to normal. That way I cannot even tell for sure that it’s Ruby to blame or my application. It could just as well be lighttpd or my local machine configuration.

Since I was in desperate need of an operational site I whipped up a script to probe all the available dispatchers for responsiveness and kill them with brute-force if they aren’t. I’m using the process scripts, namely the spinner/spawner duo that comes with Rails. As such, the dispatcher is immediately restarted and becomes available for lighttpd to serve to within a couple of seconds.

As this is obviously more of a band aid than anything else, this script is provided as-is, with no claims being made about being functional for anyone else, being pretty, well documented or not eating your cat. You absolutely need Net::SSH installed in order to be able to kill dispatchers not running on localhost. I’m running the script inside of a screen session in order to keep an eye on what’s happening with my dispatchers and how often they get killed. Your mileage may vary.

In case you’re having similar issues with your Rails application, feel free to leave a comment. The script only takes care of dispatchers that are already hung. It is by no means meant as a final cure and I’m more than eager to find out what’s causing the freezes in the first place.

The script is available in the body of this article or as a download here.


#!/usr/bin/env ruby
#
# watch-listener.rb by Patrick Lenz 
# THIS SCRIPT IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND

require 'socket'
require 'timeout'
require 'rubygems'
require 'net/ssh'

HOSTS = %w{ 10.10.1.10 10.10.1.11 10.10.1.12 10.10.1.13 }

class WatchListener
  attr_accessor :host

  def initialize(host)
    self.host = host
    probe_ports
  end

  def probe_ports
    7000.upto(7006) do |port|
      begin
        socket = TCPSocket.new @host, port
        socket.puts "dummytext\n\ndummytext"

        begin
          Timeout.timeout(15) { socket.gets }
        rescue Timeout::Error
          log "%d IS HUNG! RESTARTING..." % port
          restart_listener_on port
        else
          log "%d working fine" % port if ENV['DEBUG']
        end
      rescue Errno::ECONNREFUSED
        log "%d refuses connection"
      end
    end
  end

  def restart_listener_on(port)
    exec %{
      PID=`netstat -a -n -p | grep #{port} | grep LISTEN | \
        awk '{print $7}' | cut -d'/' -f1`
      echo "killing $PID" && kill -9 $PID
    }
  end

  def exec(command)
    log "connecting.."
    Net::SSH.start(host) do |session|
      input, output, error = session.process.popen3(command)
      timeout(20) { log output.read } rescue nil
      input.puts "quit"
    end
    log "done"
  end

  def log(text)
    puts "[%s] (%s) %s" % [ Time.now.strftime("%H:%M:%S"), @host, text ]
  end
end

# Main loop
while true do
  HOSTS.each do |host|
    begin
      WatchListener.new host
    rescue => error
      puts "Exception raised: #{error}"
    end
  end

  print "Sleeping 300 seconds: "
  5.times do |i|
    print "%d .. " % (300 - i * 60)
    STDOUT.flush
    sleep 60
  end
  puts
end
Filed Under: Rails