Open Source of Knowledge: January 2010

I have just submitted a new Lucene Component to the Apache Camel community that

builds a searchable index of documents when payloads are sent to the Lucene Endpoint
facilitates performing indexed searches in Camel when the payload header contains a QUERY.

http://issues.apache.org/activemq/browse/CAMEL-1472

The component works as follows

Creating a Searchable Document Index in Lucene using Camel

context.addRoutes(new RouteBuilder() {
    public void configure() {
        from("direct:start")
            .to("lucene:stdQuotesIndex:insert?"
               + "analyzer=#stdAnalyzer"
               + "&indexDir=#std&srcDir=#load_dir")
            .to("mock:result");

    }
});

where each URI parameter setting does the following

analyzer: can be any valid implementation of Lucene Directory Analyzer (StandardAnalyzer, WhitespaceAnalyzer, StopAnalyzer... etc)
srcDir: an optional directory location for loading Text or XML documents at endpoint or Lucene Index creation. Once created the index can take any exchange body and store its contents in the index.

Important Note: Lucene stipulates that the index be created upfront and then used in a read only mode later for any querying. Hence the index cannot be in flux during query processing. This requires the Lucene Producer to have received its payloads upfront and created the index before any queries can be logged against it.

Since the URI settings cannot be directly passed (since they are object references or break the URI format), I pass them using the JNDI registry associated with the the Default Component (example shown below).

Providing URI values for Analyzer and Initial Load Directory


@Override
protected JndiRegistry createRegistry() 
   throws Exception {
   JndiRegistry registry =
       new JndiRegistry(createJndiContext());
   registry.bind("std", new File("target/stdindexDir"));
   registry.bind("load_dir",
       new File("src/test/resources/sources"));
   registry.bind("stdAnalyzer",
       new StandardAnalyzer(Version.LUCENE_CURRENT));

   return registry;
}

I have also added a QueryEndpoint and a Query Processor that is fully capable of running any queries (including wildcards etc) against a Lucene Document Index and present the results in a serialized Hits object (see example provided below for use)

Performing searches using a Query Endpoint


context.addRoutes(new RouteBuilder() {
   public void configure() {
            
     from("direct:start").
        setHeader("QUERY", constant("Seinfeld"))
        .to("lucene:searchIndex:query?"
            + "analyzer=#whitespaceAnalyzer"
            + "&indexDir=#whitespace"
            + "&maxHits=20")
        .to("direct:next");
            
     from("direct:next")
     .process(new Processor() {
        public void process(Exchange exchange)
           throws Exception {
           Hits hits = 
              exchange.getIn().getBody(Hits.class);
           printResults(hits);
        }

        private void printResults(Hits hits) {
           LOG.debug("Number of hits: " 
              + hits.getNumberOfHits());
           for (int i = 0; i < hits.getNumberOfHits(); i++) {
              LOG.debug("Hit " + i + " Index Location:" 
                 + hits.getHit().get(i).getHitLocation());
              LOG.debug("Hit " + i + " Score:"  
                 + hits.getHit().get(i).getScore());
              LOG.debug("Hit " + i + " Data:" 
                 + hits.getHit().get(i).getData());
        }
     }
     .to("mock:searchResult");
});

Open Source of Knowledge

Sunday, January 3, 2010

Developed an Apache Lucene component in Camel to perform indexed searches on a route