Joe Williams home
I was recently introduced to yajl-ruby, ruby bindings to the C based yajl json parsing/encoding libraries. After discovering that it can parse HTTP streams it seemed like it would be a perfect fit for use with CouchDB. A while back I wrote some code to push update notifications to RabbitMQ and a commenter mentioned using the _changes feed instead. Combining the _changes feed and yajl-ruby's HttpStream seemed like a good way to do it. The _changes feed is a running list of all the documents that have changed in a database listed in order by sequence number. This is similar to update notifications but gives more information such as the document IDs and is HTTP based (with multiple feed styles) rather than stdout. Additionally you can create design document filters which can be specified as a query parameter to give you only the parts of the feed you want. All in all _changes is a pretty powerful feature. Now for the fun stuff, the code. There are a few dependencies I used to do this, specifically focused on making it fast. As such I used EventMachine based libraries for AMQP and HTTP requests. The first bit of code takes the _changes feed for the "test" database, parses the feed, uses the document ID to request that document and publish it to the queue. One key item to note is that this code requires the latest yajl-ruby from github to run properly. Additionally, this works nicely with feed=continuous so it grabs the documents as they are changed without a need for polling. Note that there is a variable for since, this allows you to start from a specific sequence number so you can skip over old changes. The next bit of code works from the other side of the queue. It subscribes to the queue, parses the JSON, performs some operations on it and puts the results back into another CouchDB database called "results". What could it be used for? My first thought is some sort of parallel computation, boot up a few dozen EC2 nodes and start dumping data into CouchDB. Have all those nodes pop messages off the queue, process them and dump the results back into Couch. Legitimately one could chain these together to process the results again. The queue ends up being a simple job management system with the EC2 nodes popping new messages as they finish processing them. With a little bit of work, features and the right use case I think could be a pretty powerful system. Check out the code, my other projects and follow me on twitter @williamsjoe. [edit: made a slight improvement to changes_sub.rb on 20100107]
Fork me on GitHub