My experience with using MongoDB for great science.
EDIT: To clarify, because people don’t seem to get it. I experienced silent data corruption, both on 1.3.3, a development version, and 1.4.0. 64-bit. Can you guys please accept that MongoDB ate my data now? Thanks.
For a bit of introduction, last year I enrolled in the Machine Learning course of the University College London, and it is now time for me to start my MSc project. The topic is analysis of social networks, so I have to get a large amount of data from a well-known website and analyse it.
To this end, I wrote a small script to use the site’s API to retrieve the data (after getting the necessary permission and everything), and store it in a SQLite database. As you may know, SQLite is sufficiently fantastic, but I thought this would be a good chance to learn something new and read up about NoSQL databases.
Initially, I tried CouchDB, but I discovered it to be a bad fit for the purposes I needed it (it’s not really practical to run ad-hoc, single-use queries on it, as that takes a long time, so it’s not the DB to use when you don’t know what you want to do with your data). The good people in #couchdb suggested that I might want to use MongoDB instead, and I installed it.
At first, I was ecstatic. I didn’t need to declare schemas any more, which was great for my ever-changing data. I could just store whatever I wanted and run the simple queries I needed, and everything was very fast. I was really excited about it and decided to use MongoDB for my next project, whatever that may be.
After gathering some data, however, I hit my first snag. I queried for items where “realname” existed and got back all the fields for which realname didn’t exist! I hurried back to #mongodb to ask, and the developers confirmed this was a bug. “No big deal”, I thought, “bugs are bound to happen, especially in a project this new.” I looked at the version, and it was 1.3.3. “Oh well, no matter, it appears older than I thought but bugs are bugs”. I thought nothing of it, rebuilt my indexes and the problem went away.
A few days later, I go back to check on my script, and notice that, although the script was spidering data, the document count in the database kept shrinking. “What the hell did I do?”, I thought, and promptly stopped it. I ran some tests and discovered that, whenever I updated a document, the document was deleted with no trace whatsoever.
I rushed back in the IRC channel and gave some code to reproduce the behaviour, where I was nonchalantly informed that it “looks like [I] hit a bug”. Gee, thanks, I lost four days’ worth of data because I “hit a bug”. Very helpful. The suggestion was to upgrade to 1.4.0, because “1.3.3 isn’t very stable anyway”. I have no idea why it’s past version 1 or why it’s marked as “stable” on the site…
I decided to cut my losses, upgraded the DB and the problem went away, so I ran the script again, got more data, and then noticed that the document count stopped going up again. Cursing the moment I ever decided to use MongoDB, I went to see what’s going on. After a few tests and having data mysteriously get lost, I restarted the DB to see if that’ll fix anything, at which point MongoDB refused to let me connect because “I had reached the data limit for the 32-bit build”.
Seriously? Seriously? MongoDB dies after about 500,000 documents, silently corrupting my data, not issuing any warnings and then refusing to let me even read it? I’ve never seen such broken behaviour in any other piece of software I’ve used. I went back to the channel, seething (I can’t imagine the guys in there were very happy with providing free support to an angry person, but they were helpful nonetheless), and detailed my predicament. Obviously, the solution would be to reformat my server and install a 64-bit OS if I wanted to have more than 500k documents in the database.
My palms having left an indentation in my face, I decided to move back to SQLite or postgres, which consider anything below millions of rows trivial. Asking what I could do so I can read my data (short of reformatting), the devs told me to delete a file and rebuild the database, which would probably let me access the data (minus some that were in that file). I did that, and was able to query and count, and saw that all my documents were there.
I then asked if it was fine or if I was going to find corruption when I tried to read it, and they suggested I do db.repairDatabase() to rebuild everything. No sooner said than done, I ran the command and left it for a few hours to run. At some point I lost my SSH connection to the server but thought that it would probably keep running or, at least, leave the database in some half-consistent state so I could rerun the query. That’s exactly what it did, but it also deleted 90% of my data, which I couldn’t recover.
In summary, I would not let MongoDB near my children, let alone run it on a production environment (or even a testing one). It’s a great piece of software when it works, but that’s not very often. A database that doesn’t really store data is very, very dangerous.
EDIT: As readers have pointed out, 1.3.x was a development branch (odd-numbered minor revisions are development), but that’s the first I heard of that. I didn’t see it mentioned on the site or in the IRC channel, sadly. If I had, I would have used 1.2 (but it wouldn’t have made any difference with the 32-bit issue, in the end).