Shahzad Bhatti

October 14, 2009

Querying and Indexing CouchDB documents using Lucene

Filed under: Uncategorized — admin @ 9:32 pm

I have been playing with CouchDB lately and was looking for a way to index documents stored in the CouchDB. So, I started an open source project DocuSearch. It includes both POJO based and REST based services for indexing and searching that are hosted in Jetty server.

Getting Started

To get started download the source using:

 svn checkout http://docusearch.googlecode.com/svn/trunk/ docusearch-read-only  
 or
 git clone git://github.com/bhatti/DocuSearch.git
 

You will need to install Java 1.6, Maven 2.0+ and CouchDB before start using the services. On Mac, you can install CouchDB via:

 sudo port install couchdb
 

Then manually start the CouchDB using

 sudo /opt/local/bin/couchdb
 

You can verify if CouchDB is running using http://localhost:5984/_utils/index.html.

Building

Type “mvn” to build the project. Maven will download a bunch of files that may take a few minutes and will cache those locally and will then proceed to compile, test and build war file.

Populating Database

You are free to choose your favorite way to add or import data into CouchDB, though the DocuSearch includes some ETL programs to add comma or tab delimited data into CouchDB. For example, let say you want to find authorized e-file providers for IRS, so you download some data from IRS that has following format:

 business_name,street_address_1,street_address_2,city,state,zip,zip_4,contact_first_name,contact_middle_name,contact_last_name,phone,flag1,flag2,flag3,flag4
 

You can import it to the couchdb using

 mvn exec:java -Dexec.mainClass="com.plexobject.docusearch.etl.DocumentLoader" \
 -Dexec.args="efile_providers data/wa.txt none business_name,street_address_1,street_address_2,city,state,zip,zip_4,contact_first_name,contact_middle_name,contact_last_name,phone"
 

Which takes following arguments:

  • name-of-database, e.g. efile_providers
  • name of comma delimited file, e.g. data/wa.txt
  • id-column or none if database ids will automatically be generated
  • comma-delimited list of fields to be imported

Once the data is loaded, you can create Lucene index, but before that you will have to specify the index policy, which is just another CouchDB document. The index policy specifies fields to be indexed, whether they should be stored in index, score and boost values. These policy configurations are stored in the_config database and you can add the policy using:

 curl -X PUT http://127.0.0.1:5984/the_config/index_policy_for_efile_providers -d \
 '{"_id":"index_policy_for_efile_providers","dbname":"the_config","score":0,"boost":0,"fields":[{"name":"business_name", "storeInIndex":"true"},{"name":"street_address_1"},{"name":"city"},{"name":"zip"},{"name":"contact_first_name"},{"name":"contact_last_name"}]}'
 

It will return

 {"ok":true,"id":"index_policy_for_efile_providers","rev":"1-0fd2f5b2e2012f898df677c68daf4592"}
 

Note that you will need to pass the “_rev” parameter if you need to update the index policy. Later, you can retrieve the policy using:

 curl http://localhost:5984/the_config/index_policy_for_efile_providers
 

Now you are ready to build the index but let’s first start the Jetty with the REST based services via

 mvn jetty:run-war
 

Now hop on to browser and point to

 http://localhost:8080
 

Finally, you can use curl to build the index via:

 curl -vX POST http://localhost:8080/api/index/primary/efile_providers
 

Before you can query, you will have to specify query policy that is also stored in CouchDB and specifies list of fields that are searched, e.g.

 curl -X PUT http://127.0.0.1:5984/the_config/query_policy_for_efile_providers -d \ '{"_id":"query_policy_for_efile_providers","dynamo":"the_config","fields":[{"name":"efile_providers.business_name", "boost":2},{"name":"efile_providers.street_address_1"},{"name":"efile_providers.city"},{"name":"efile_providers.zip"},{"name":"efile_providers.contact_first_name"},{"name":"efile_providers.contact_last_name"}]}'
 

Which will return

 {"ok":true,"id":"query_policy_for_efile_providers","rev":"1-618703c1fd66996f23b89c4414dd0842"}
 

Again, you will need to pass “_rev” parameter when updating the query policy. Next you can search contents of the index via:

 curl "http://localhost:8080/api/search/efile_providers?keywords=mike"
 

Which will return

 {"suggestions":[],"keywords":"mike","start":0,"limit":0,"totalHits":7,"docs":[{"_id":"0352d18145532a05714bfec2e1e649dd","dbname":"efile_providers","indexDate":"20091121","doc":"53","score":"0.0","owner":"*","efile_providers.business_name":"Mr Tax Man"},{"_id":"062d548eb394db3534782c5b6ded0529","dbname":"efile_providers","indexDate":"20091121","doc":"96","score":"0.0","owner":"*","efile_providers.business_name":"Liberty Tax Service"},{"_id":"1ddc6006a2315dd0b0119c0dbc22c1a7","dbname":"efile_providers","indexDate":"20091121","doc":"450","score":"0.0","owner":"*","efile_providers.business_name":"1040 PLUS INC"},{"_id":"3621cc7edde5f191bcc5f3a41160f61e","dbname":"efile_providers","indexDate":"20091121","doc":"793","score":"0.0","owner":"*","efile_providers.business_name":"MIKE A PASSECK CPA"},{"_id":"37a2a152ff120ac293ea67daac1a11aa","dbname":"efile_providers","indexDate":"20091121","doc":"811","score":"0.0","owner":"*","efile_providers.business_name":"Liberty Tax Service"},{"_id":"be0fd60800b9eed6d418601f8cba06f3","dbname":"efile_providers","indexDate":"20091121","doc":"2856","score":"0.0","owner":"*","efile_providers.business_name":"Liberty Tax Service"},{"_id":"dfa948236e87d0c6ba90c612cb166635","dbname":"efile_providers","indexDate":"20091121","doc":"3395","score":"0.0","owner":"*","efile_providers.business_name":"MIKE FOLEYS TAX SERVICE"}]}
 

This query functionality can also be tested through a simple html based interface by just pointing your browser to http://localhost:8080/, e.g.

The index stores id of the document that is indexed so you can also retrieve details of each link using

 http://localhost:8080/api/storage/efile_providers/0352d18145532a05714bfec2e1e649dd
 

This feature can be tested from HTML interface by clicking on details link, e.g.

You can also debug why certain results are showing up using following API

 http://localhost:8080/api/search/explain/efile_providers?keywords=mike
 

This feature can be tested from HTML interface by clicking on explain button, e.g.

Next, you can also find top terms used in the index using:

 http://localhost:8080/api/search/rank/efile_providers?limit=1000
 

Again, this feature can be tested from HTML interface by clicking on top terms button, e.g.

You can also find similar searches for a particular search using

 http://localhost:8080/api/search/similar/efile_providers?externalId=37a2a152ff120ac293ea67daac1a11aa&luceneId=811&detailedResults=true
 

Which will return

 {"externalId":"37a2a152ff120ac293ea67daac1a11aa","luceneId":811,"start":0,"limit":0,"totalHits":973,"docs":[{"zip":"98107","phone":"206\/782-2772","contact_first_name":"TOR","street_address_2":"","street_address_1":"5919 NW 15TH AVE","state":"WA","city":"SEATTLE","_rev":"1-6f14e2e9d2092e63173002cd95785963","business_name":"LIBERTY TAX SERVICE","_id":"00684037657ef8960ede2f155339420e","contact_middle_name":"","zip_4":"","dbname":"efile_providers","contact_last_name":"SLINNING"},{"zip":"98118","phone":"206\/850-0505","contact_first_name":"ANDREW","street_address_2":"","street_address_1":"5021 SOUTH BARTON","state":"WA","city":"SEATTLE","_rev":"1-f910f4736db188d05f24751a68070b86","business_name":"H&A TAX PREPARATION SVCS","_id":"00b6c05dd24c30c4740b7aa1257ef308","contact_middle_name":"H","zip_4":"5336","dbname":"efile_providers","contact_last_name":"HODGE"},{"zip":"98682","phone":"360\/891-6701","contact_first_name":"MARILYN","street_address_2":"","street_address_1":"5101 NE 121ST AVE #50","state":"WA","city":"VANCOUVER","_rev":"1-6ab3f77b03eee2c529f910c559236eb3","business_name":"AFFORDABLE BOOKKEEPING & TAX SERVIC","_id":"00e35b6bbfbfe68db8e962bc41ec6c99","contact_middle_name":"C","zip_4":"","dbname":"efile_providers","contact_last_name":"BOON"},{"zip":"98406","phone":"206\/322-2226","contact_first_name":"MAN","street_address_2":"","street_address_1":"602 6TH AVE","state":"WA","city":"TACOMA","_rev":"1-4efeefbdaadaf9dde2a49f7246f884b5","business_name":"INSTANT TAX PRO","_id":"00fe1df22fe5731e01515cada787efd2","contact_middle_name":"V","zip_4":"","dbname":"efile_providers","contact_last_name":"SAM"},{"zip":"98208","phone":"425\/338-0118","contact_first_name":"STEPHEN","street_address_2":"","street_address_1":"3615 100TH ST SE","state":"WA","city":"EVERETT","_rev":"1-381ea9171f405bf40f78597a91730588","business_name":"ADSUM TAX & BOOKKEEPING LLC","_id":"014548ee3e23e5d56d4521b76de8434a","contact_middle_name":"D","zip_4":"","dbname":"efile_providers","contact_last_name":"TANGEN"},{"zip":"99116","phone":"509\/633-3829","contact_first_name":"RICHARD","street_address_2":"","street_address_1":"102 STEVENS","state":"WA","city":"COULEE DAM","_rev":"1-1b767048def4829db756f04014733681","business_name":"MEYER TAX SERVICE","_id":"016a700252b54fc170ffc0f69c60ce93","contact_middle_name":"W","zip_4":"","dbname":"efile_providers","contact_last_name":"AVEY"},{"zip":"98391","phone":"253\/862-5573","contact_first_name":"Tim","street_address_2":"","street_address_1":"20616 SR 410 E","state":"WA","city":"Bonney Lake","_rev":"1-5b6ee0d167743b1679c8c3f84f16d78b","business_name":"Barrans Tax Service","_id":"017a48b555806eda9f7999b426b00d14","contact_middle_name":"","zip_4":"","dbname":"efile_providers","contact_last_name":"Barrans"},{"zip":"98503","phone":"360\/456-5084","contact_first_name":"THOMAS","street_address_2":"","street_address_1":"4440 PACIFIC AVE SE","state":"WA","city":"LACEY","_rev":"1-dcbfcb5c112e3ef1e109f7bbfd410e9a","business_name":"TAX CENTERS OF AMERICA","_id":"01a024fe186a0df2f313191a951dbb1c","contact_middle_name":"B","zip_4":"","dbname":"efile_providers","contact_last_name":"OTT"},{"zip":"WA","phone":"Stevenson","contact_first_name":"","street_address_2":"924 West S Circle","street_address_1":"LLC","state":"Washougal","city":"","_rev":"1-9a7678e5c46651998fb7c0c83c9018b1","business_name":"Columbia Tax","_id":"01aa9791195e915093ee207518e6bf34","contact_middle_name":"Gina","zip_4":"98671","dbname":"efile_providers","contact_last_name":"A"},{"zip":"98188","phone":"303\/888-1040","contact_first_name":"CARL","street_address_2":"","street_address_1":"17600 PACIFIC HWY S","state":"WA","city":"SEATTLE","_rev":"1-976ac1c5ba46f59a42b57786af76e9b2","business_name":"NEXT DAY TAX CASH","_id":"01b21a8653b83789b561040887be7a28","contact_middle_name":"","zip_4":"","dbname":"efile_providers","contact_last_name":"PALMER"},{"zip":"98032","phone":"253\/852-6182","contact_first_name":"TOM","street_address_2":"# A-148","street_address_1":"1819 CENTRAL AVE S","state":"WA","city":"KENT","_rev":"1-1d0a9dbc409944bfc4618f598541a97f","business_name":"TAX GALLERY\/ TOM COKE ASSOCIATES","_id":"02051caa9f2faa9ca8386792c9653ff6","contact_middle_name":"C","zip_4":"","dbname":"efile_providers","contact_last_name":"ARMON"},{"zip":"98686","phone":"702\/320-0727","contact_first_name":"ARMOGAST","street_address_2":"","street_address_1":"14605 NE 20TH AVE","state":"WA","city":"VANCOUVER","_rev":"1-adf4be610b7fa43794d4d7dd3f8dc7de","business_name":"SUPREME BOOKKEEPING & TAX LLC.","_id":"0220b0609b83dd5621a90d9f7fe342ca","contact_middle_name":"J","zip_4":"","dbname":"efile_providers","contact_last_name":"MWASHIGHADI"},{"zip":"98665","phone":"360\/896-9897","contact_first_name":"GERALD","street_address_2":"","street_address_1":"7700 HWY 99","state":"WA","city":"VANCOUVER","_rev":"1-78234fab75e3e44d79648dab756a7791","business_name":"JACKSON HEWITT TAX SERVICE","_id":"02c2c5983ca8b4a00173dca208cc86de","contact_middle_name":"D","zip_4":"","dbname":"efile_providers","contact_last_name":"BREUNIG"},{"zip":"98531","phone":"360\/556-4906","contact_first_name":"David","street_address_2":"SUITE A","street_address_1":"417 W. MAIN ST.","state":"WA","city":"CENTRALIA","_rev":"1-fcb886c533f0a223f835397a0d5cf773","business_name":"Liberty Tax Service","_id":"02c466c8097ee00a1ae2d27aafd808aa","contact_middle_name":"C","zip_4":"","dbname":"efile_providers","contact_last_name":"Dunsmore"},{"zip":"98626","phone":"909\/849-1174","contact_first_name":"CINDY","street_address_2":"","street_address_1":"2640 ROBERT CT","state":"WA","city":"Kelso","_rev":"1-5513b1057c7882a7657a79a3e888b21d","business_name":"THE TAX WARD","_id":"032a14eaca19a1548364609fd480a1b9","contact_middle_name":"J","zip_4":"","dbname":"efile_providers","contact_last_name":"WARD"},{"zip":"98036","phone":"425\/774-6633","contact_first_name":"Mike","street_address_2":"","street_address_1":"20015 HIGHWAY 99","state":"WA","city":"LYNNWOOD","_rev":"1-d7f17073757afa70e869e543099a7bf5","business_name":"Mr Tax Man","_id":"0352d18145532a05714bfec2e1e649dd","contact_middle_name":"C","zip_4":"6073","dbname":"efile_providers","contact_last_name":"McKinnon"},{"zip":"98284","phone":"360\/595-9138","contact_first_name":"LAURA","street_address_2":"","street_address_1":"765 SUMERSET WAY","state":"WA","city":"SEDRO WOOLLEY","_rev":"1-b2ebc210bbf481c2ed37751a45a2249e","business_name":"CAIN LAKE TAX SERVICE","_id":"03cc2bcb150f981519a8d93093a015ca","contact_middle_name":"L","zip_4":"","dbname":"efile_providers","contact_last_name":"COZZA"},{"zip":"99350","phone":"509\/786-1269","contact_first_name":"ERNEST","street_address_2":"","street_address_1":"1002 LILLIAN","state":"WA","city":"PROSSER","_rev":"1-cd626358950c19bd2519859fbd50bbce","business_name":"E & R TAX SERVICE","_id":"03f620e0ae8ac148af79cb7848c3bf41","contact_middle_name":"W","zip_4":"","dbname":"efile_providers","contact_last_name":"TROEMEL"},{"zip":"99301","phone":"509\/851-8808","contact_first_name":"Aaron","street_address_2":"SUITE E","street_address_1":"5024 NORTH ROAD 68","state":"WA","city":"PASCO","_rev":"1-e9c9843292f9233f46e07628105ee72c","business_name":"Liberty Tax Service of West Pasco","_id":"03fe45d715110d0aa2d3022bbe7325e7","contact_middle_name":"J","zip_4":"","dbname":"efile_providers","contact_last_name":"Welles"},{"zip":"98329","phone":"253\/884-3566","contact_first_name":"ROY","street_address_2":"","street_address_1":"13215 139TH AVE KPN","state":"WA","city":"GIG HARBOR","_rev":"1-70d070d37c72fb13b9162ba4523ab70f","business_name":"MYR-MAR ACCOUNTING SERVICE INC","_id":"040ab87550b5a4a200c97d2e6a6b96a7","contact_middle_name":"M","zip_4":"","dbname":"efile_providers","contact_last_name":"KEIZUR"}]}
 

This feature can be tested from HTML interface by clicking on similar link, e.g.

Conclusion

DocuSearch makes it easy to query documents on CouchDB, however I have also started adding support for Berkley DB if you choose to use it. I found CouchDB wastes a lot of space and is a bit slow so that may be alternative option for some. I also plan to add ngrams and stem based analyzers to create better search experience. I also welcome you to join the project. You can add yourself to http://code.google.com/p/docusearch/ or http://github.com/bhatti/DocuSearch/ project(s) and start contributing.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

You must be logged in to post a comment.

Powered by WordPress