Mixing in MongoDB

MongoDB provides document-oriented storage, with replication, indexing and rich query support. It blends aspects of distributed key-value stores by keeping most of the data in memory, pulls in fine-grained indexing and query language reminiscent of relational databases, provides map-reduce support for data processing similar to Big Table, while optimizing for fast in-place modifications. In the near term, it will provide horizontal scaling and sharding.
The tradeoff for MongoDB is in lack of transactions and being not fully ACID. This means MongoDB does not offer single-server durability. Data is only eventually consistent. Due to filesystem operational choices (fsync vs. append-only writes, commit log, etc.) MongoDB can lose data during hard server loss.
Realizing all theses aspects, why did I choose to go with MongoDB? The Problem
I want to track a bunch of data for certain kinds of views and then display custom analytics. The data collected includes a combination of request environment and internal statistics correlated with request parameters. I did not want to write this to a traditional database for every request because a) the data is adjunct to the functionality, b) it involves a select+insert or select+update for each request and c) writes are expensive. Furthermore, the write is not critical enough to hold up the request, and definitely not worth adding a queue infrastructure.
Initially, I wrote a batch-mode logfile analyzer which periodically collates this information into a traditional database. It worked great, except it was not real-time, and every time I needed to track a new statistic I had to run a migration. As the app moved to a clustered environment, some of the data was in the front-end logfiles, and some in the app slices. Running different kinds of batch jobs on multiple servers was getting clunky.
I looked at several alternatives, and quickly eliminated most as not being a good fit for the problem, or sometimes for irrational reasons.
- Redis - Persistence seems awkward and schema support is limited.
- CouchDB - Requires reading up on map-reduce and ability to make views before you can use it.
- Cassandra - Requires mucking with Java VM installs, XML configurations and jar files and grok the Thrift interface.
- HBase , Voldemort, others - Great, and worth a look someday. But...
production: adapter: mysql host: m03 database: myapp_production mongo_database: myapp_production mongo_host: m04
dbconf = YAML::load(File.read("#{RAILS_ROOT}/config/database.yml"))
mongo_db = dbconf[Rails.env] ? dbconf[Rails.env]['mongo_database'] : false
if mongo_db
mongo_host = dbconf[Rails.env]['mongo_host']
$stderr.puts "MongoDB connecting to #{mongo_db} on #{mongo_host}"
MongoMapper.connection = Mongo::Connection.new(mongo_host)
MongoMapper.database = mongo_db
end- Line up your mongo_mapper, bson and bson_ext gems so they are in sync. One symptom of mismatch are non-obvious error messages such as not finding OrderedHash.
- If you use Array data fields, then MongoMapper will create and populate the Array field on #create, but only if you provide a lambda default.
- If you want to do upserts, you must drop down to the Mongo Ruby Driver. Fortunately doing this is as easy as invoking methods on the MongoMappedClass#collection method.
- If your sole use-case is to do upserts, avoid Array fields. It is not possible to access a single Array field element using positional or dot-notation when doing upserts.
