Sunday, January 3, 2010

Developed an Apache Lucene component in Camel to perform indexed searches on a route

I have just submitted a new Lucene Component to the Apache Camel community that
  • builds a searchable index of documents when payloads are sent to the Lucene Endpoint
  • facilitates performing indexed searches in Camel when the payload header contains a QUERY.

http://issues.apache.org/activemq/browse/CAMEL-1472

The component works as follows

Creating a Searchable Document Index in Lucene using Camel
context.addRoutes(new RouteBuilder() {
public void configure() {
from("direct:start")
.to("lucene:stdQuotesIndex:insert?"
+ "analyzer=#stdAnalyzer"
+ "&indexDir=#std&srcDir=#load_dir")
.to("mock:result");

}
});


where each URI parameter setting does the following

  • analyzer: can be any valid implementation of Lucene Directory Analyzer (StandardAnalyzer, WhitespaceAnalyzer, StopAnalyzer... etc)
  • srcDir: an optional directory location for loading Text or XML documents at endpoint or Lucene Index creation. Once created the index can take any exchange body and store its contents in the index.


Important Note: Lucene stipulates that the index be created upfront and then used in a read only mode later for any querying. Hence the index cannot be in flux during query processing. This requires the Lucene Producer to have received its payloads upfront and created the index before any queries can be logged against it.

Since the URI settings cannot be directly passed (since they are object references or break the URI format), I pass them using the JNDI registry associated with the the Default Component (example shown below).

Providing URI values for Analyzer and Initial Load Directory

@Override
protected JndiRegistry createRegistry()
throws Exception {
JndiRegistry registry =
new JndiRegistry(createJndiContext());
registry.bind("std", new File("target/stdindexDir"));
registry.bind("load_dir",
new File("src/test/resources/sources"));
registry.bind("stdAnalyzer",
new StandardAnalyzer(Version.LUCENE_CURRENT));

return registry;
}


I have also added a QueryEndpoint and a Query Processor that is fully capable of running any queries (including wildcards etc) against a Lucene Document Index and present the results in a serialized Hits object (see example provided below for use)

Performing searches using a Query Endpoint

context.addRoutes(new RouteBuilder() {
public void configure() {

from("direct:start").
setHeader("QUERY", constant("Seinfeld"))
.to("lucene:searchIndex:query?"
+ "analyzer=#whitespaceAnalyzer"
+ "&indexDir=#whitespace"
+ "&maxHits=20")
.to("direct:next");

from("direct:next")
.process(new Processor() {
public void process(Exchange exchange)
throws Exception {
Hits hits =
exchange.getIn().getBody(Hits.class);
printResults(hits);
}

private void printResults(Hits hits) {
LOG.debug("Number of hits: "
+ hits.getNumberOfHits());
for (int i = 0; i < hits.getNumberOfHits(); i++) {
LOG.debug("Hit " + i + " Index Location:"
+ hits.getHit().get(i).getHitLocation());
LOG.debug("Hit " + i + " Score:"
+ hits.getHit().get(i).getScore());
LOG.debug("Hit " + i + " Data:"
+ hits.getHit().get(i).getData());
}
}
.to("mock:searchResult");
});

5 comments:

Unknown said...

Are you sure about:
"Lucene stipulates that the index be created upfront and then used in a read only mode later for any querying. Hence the index cannot be in flux during query processing."

Ashwin Karpe said...
This comment has been removed by the author.
Ashwin Karpe said...

Hi Otis,

Pretty sure. I am attaching a link for you to check out.

Lucene Directory JavaDoc

It states the following with regard to Lucene index directories "Files may be written once, when they are created. Once a file is created it may only be opened for read, or deleted."

My guess is that indexing of structures is an expensive operation performed on the entire data set consisting of multiple documents and may interfere with ongoing reads and writes.

Hope this helps.

Epnk said...

Thanks for the Lucene component!

One question: I'm more familiar with Solr paired with Lucene rather than Lucene by itself. How does this component "connect", if you will, to the Lucene index. Solr presents a server environment (IP:port, and protocol like http). I'm not sure how Lucene by itself works with this.

Anonymous said...

Hi Ashwin,

I know it's been a while since you posted this. Is the entire payload indexed into one field? Is there a way to have an exchange indexed with multiple fields?