Nov 16, 2010

Build full text search capability upon MongoDB

Recently I am working on a content management system migration project. We have exported the existing system into a folder composed of 18k xml files organized into a set of sub folders. What we need to do is to understand the relationships among the files and folders, manipulate them by adding/removing/updating and then import into a new server.

Navigating through these huge amount of files is definitely not an easy task, and the embedded base64 encoded content makes this task almost an MIP. This is why I decide to build a full-text search tool to help us quickly locate files contain certain contents we are interesting at any time.

I pick up MongoDB because it is a document database, don't need restricted schema definition, and I am quite familiar with it as the author of play-morphia plugin. So what I am doing is to write a script (in perl) parse all XML files into the database. The script parse tags/attributes/text nodes/CDATAs of the XML files, decode base64 element if found. The interesting part is how to setup a full-text search capability in MongoDB. Following this article, I've created an array column called keywords for each document. The XML parser script will add into keywords array for each words found in tags/attributes/text/CDATA and decoded Base64 content.

Next is to implement a query for full text search, which is not specified in the article mentioned above. The key for full text query is use $all operator and regular expression. For example, if you want query documents including both "NSW" and "QLD", the query statement in javascript could be:

db.docs.find({_keywords:{$all:[/\bNSW\b/, /\bQLD\b/]}});

In java, the code snippet could be:

DBObject q = new BasicDBObject();
if (keys_.size() > 0) {
DBObject keys = new BasicDBObject();
List pl = new ArrayList();
for (String w: keys_) {
w = w.replace("\\s+", "\\s+");
pl.add(Pattern.compile(w));
}
keys.put("$all", pl.toArray(new Pattern[]{}));
q.put("_keywords", keys);
}
DBCursor c = col_.find(q);
if (0 < skip_) c.skip(skip_);
if (0 < limit_) c.limit(limit_);