Fri, 09 Jul 10

Mixing in MongoDB

I have started using MongoDB for specific use cases in production. I don’t see the world in black and white, and neither should you. My goal in writing this post is to note down the rationale and list some gotchas when using it with MongoMapper.

MongoDB provides document-oriented storage, with replication, indexing and rich query support. It blends aspects of distributed key-value stores by keeping most of the data in memory, pulls in fine-grained indexing and query language reminiscent of relational databases, provides map-reduce support for data processing similar to Big Table, while optimizing for fast in-place modifications. In the near term, it will provide horizontal scaling and sharding.

The tradeoff for MongoDB is in lack of transactions and being not fully ACID. This means MongoDB does not offer single-server durability. Data is only eventually consistent. Due to filesystem operational choices (fsync vs. append-only writes, commit log, etc.) MongoDB can lose data during hard server loss.

Realizing all theses aspects, why did I choose to go with MongoDB?

The Problem

I want to track a bunch of data for certain kinds of views and then display custom analytics. The data collected includes a combination of request environment and internal statistics correlated with request parameters. I did not want to write this to a traditional database for every request because a) the data is adjunct to the functionality, b) it involves a select+insert or select+update for each request and c) writes are expensive. Furthermore, the write is not critical enough to hold up the request, and definitely not worth adding a queue infrastructure.

Initially, I wrote a batch-mode logfile analyzer which periodically collates this information into a traditional database. It worked great, except it was not real-time, and every time I needed to track a new statistic I had to run a migration. As the app moved to a clustered environment, some of the data was in the front-end logfiles, and some in the app slices. Running different kinds of batch jobs on multiple servers was getting clunky.

I looked at several alternatives, and quickly eliminated most as not being a good fit for the problem, or sometimes for irrational reasons.

Redis - Persistence seems awkward and schema support is limited.
CouchDB - Requires reading up on map-reduce and ability to make views before you can use it.
Cassandra - Requires mucking with Java VM installs, XML configurations and jar files and grok the Thrift interface.
HBase , Voldemort, others - Great, and worth a look someday. But…

At some point, in going down the list, I hit on mongoDB and came across the wealth of documentation and support base that 10gen has created. As a Rubyist, I found in MongoMapper a good balance of ActiveRecord goodness and the MongoDB API characteristics. I also love the MongoDB API, because it is similar to a Javascript-style database API that I wrote for airdb. MongoDB also provides support for map/reduce operations, so you can use it for aggregate data processing when you need it.

Several others have documented their switch to MongoDB, and there was a strong alignment in the usage case, whether for logging, analytics or dealing with high volume, aggregate data - where the loss of a few hundred writes out of millions is not a big deal.

Finally, one aspect that sealed the choice was the notion of modifier operations, which allow fast, atomic, in-place incrementing of counters, something that I need to do 99% of the time for this problem.

Installing MongoDB with Ruby and MongoMapper was straightforward. There are good instructions for using it with Rails. I am not doing a complete switch, just mixing it in. The easiest way to do that, I found was to add the mongo configuration to database.yml.

  production:
    adapter: mysql
    host: m03
    database: myapp_production
    mongo_database: myapp_production
    mongo_host: m04

And then read in and set the MongoMapper.connection and MongoMapper.database using this information in config/initializer/mongo_connect.rb

dbconf = YAML::load(File.read("#{RAILS_ROOT}/config/database.yml"))
mongo_db = dbconf[Rails.env] ? dbconf[Rails.env]['mongo_database'] : false

if mongo_db
  mongo_host = dbconf[Rails.env]['mongo_host']
  $stderr.puts "MongoDB connecting to #{mongo_db} on #{mongo_host}"
  MongoMapper.connection = Mongo::Connection.new(mongo_host)
  MongoMapper.database = mongo_db
end

Don’t forget to add the per-worker process configuration to your initializer config if you are using Passenger Phusion.

Some other gotchas:

Line up your mongo_mapper, bson and bson_ext gems so they are in sync. One symptom of mismatch are non-obvious error messages such as not finding OrderedHash.
If you use Array data fields, then MongoMapper will create and populate the Array field on #create, but only if you provide a lambda default.
If you want to do upserts, you must drop down to the Mongo Ruby Driver. Fortunately doing this is as easy as invoking methods on the MongoMappedClass#collection method.
If your sole use-case is to do upserts, avoid Array fields. It is not possible to access a single Array field element using positional or dot-notation when doing upserts.

In sum, you can start using MongoDB alongside MySQL inside your Rails projects as appropriate and side-step the whole SQL vs. NoSQL wars. Here is to humongous fun while data processing.

Fri, 09 Jul 10