How do I use Lucene to index and search text files?
Author: Deron Eriksson
Description: This Java tutorial shows how to use Lucene to create an index based on text files in a directory and search that index.
Tutorial created using:
Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 2.0 (Eclipse 3.3.0)
If you'd like to add customized search capabilities to an application, Lucene can be a great choice. In fact, EclipseSW uses Lucene for its great search capabilities. One good way to start becoming familiar with Lucene is to begin with a simple application. In this tutorial, I'll create an index based on text files in a directory, and then I'll perform several searches on that index for various search terms. What is an index? An index is similar to an index at the back of the book, where you can look up search terms and find their corresponding pages in a book. Likewise, when we create an index based on documents, we can query the index to find out what documents match our search terms. This example will both create an index and perform searches against the index. These are conceptually two different tasks. The demonstration project's structure is shown here. We have a directory called "filesToIndex" that contains text files that we are going to index. We have a directory called "indexDirectory". This will hold the index that we create. The project utilizes that lucene-core jarW file. It has one class, LuceneDemo. Since Lucene is a fairly involved API, it can be a good idea to reference the Lucene source code and javadocs in your project build path, as shown here. Two text files in the "filesToIndex" directory will be indexed. The first one, deron-foods.txt, lists some foods that I like. deron-foods.txtHere are some foods that Deron likes: hamburger french fries steak mushrooms artichokes The second text file, nicole-foods.txt, lists some foods that Nicole likes. nicole-foods.txtHere are some foods that Nicole likes: apples bananas salad mushrooms cheese The LuceneDemo class is shown below. The first thing it does is to create an index via its createIndex() method. An IndexWriter object is used to create and update the index. IndexWriter has several constructors. I used a constructor that takes three arguments. The first argument is the directory location in the file system where the index files should be located. The second argument is a StandardAnalyzer object. An analyzer represents the rules for extracting index terms from text. The third argument is a boolean parameter set to true, which tells the IndexWriter to rebuild the index from scratch if it already exists. Next, we go through the files in the "filesToIndex" directory. For each file, we create a Lucene Document object, which is a collection of fields that can represent the content, metadata, and other data related to a document. We create two fields and add them to the document. The first field is used to store the canonical path to each text file in the index. We specify to store it in the index via the Field.Store.YES argument. We also specify to not let the Analyzer tokenize the path via the Field.Index.UN_TOKENIZED. This is so that the path stays whole and doesn't get chopped up by the Analyzer in the index. The second field represents the contents of the file. The contents get tokenized and indexed, but they do not get stored in the index. This is because we used a Reader as an argument to the Field constructor. If you'd like to store the contents in the index, you need to use a String rather than a Reader. Each Document object is added to the index via the IndexWriter's addDocument() method. After the documents have been added to the index, the index is then optimized and then closed. Our index has been created and now we can search it! LuceneDemo.javapackage avajava; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.io.Reader; import java.util.Iterator; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hit; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.LockObtainFailedException; public class LuceneDemo { public static final String FILES_TO_INDEX_DIRECTORY = "filesToIndex"; public static final String INDEX_DIRECTORY = "indexDirectory"; public static final String FIELD_PATH = "path"; public static final String FIELD_CONTENTS = "contents"; public static void main(String[] args) throws Exception { createIndex(); searchIndex("mushrooms"); searchIndex("steak"); searchIndex("steak AND cheese"); searchIndex("steak and cheese"); searchIndex("bacon OR cheese"); } public static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException { Analyzer analyzer = new StandardAnalyzer(); boolean recreateIndexIfExists = true; IndexWriter indexWriter = new IndexWriter(INDEX_DIRECTORY, analyzer, recreateIndexIfExists); File dir = new File(FILES_TO_INDEX_DIRECTORY); File[] files = dir.listFiles(); for (File file : files) { Document document = new Document(); String path = file.getCanonicalPath(); document.add(new Field(FIELD_PATH, path, Field.Store.YES, Field.Index.UN_TOKENIZED)); Reader reader = new FileReader(file); document.add(new Field(FIELD_CONTENTS, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } public static void searchIndex(String searchString) throws IOException, ParseException { System.out.println("Searching for '" + searchString + "'"); Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); Analyzer analyzer = new StandardAnalyzer(); QueryParser queryParser = new QueryParser(FIELD_CONTENTS, analyzer); Query query = queryParser.parse(searchString); Hits hits = indexSearcher.search(query); System.out.println("Number of hits: " + hits.length()); Iterator<Hit> it = hits.iterator(); while (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(FIELD_PATH); System.out.println("Hit: " + path); } } } Let's look at LuceneDemo's searchIndex() method, which takes a searchString parameter. We get the directory containing the index via a call to FSDirectory.getDirectory(). We get an IndexReader to read the index, and then we create an IndexSearcher object using the IndexReader. The IndexSearcher object is used to, not surprisingly, search the index. Next, we need to take our searchString and create a Query object. To do this, we create a QueryParser, specifying that we'd like to search the "contents" field. We use a StandardAnalyzer to analyze the text in searchString for search terms. We obtain a Query object by calling the parse() method on our QueryParser object with the searchString as an argument. Now, we can perform our search. We do this by calling the search() method of our IndexSearcher with the Query object as an argument. We get back a Hits object which represents the positive search results. We iterate over the Hits, obtaining each individual Hit object. We get a Document from each Hit object and then get the "path" field from the Document, representing the path to the file in the file system. (Continued on page 2) Related Tutorials: |