Inroduction to Lucene.Net

Lucene.NET is indexing and search server ported from famous Lucene that is developed for java platform. From Lucene.NET is not a search engine but it is a search engine library.Lucene itself is a class library, not an executable. You call Lucene functions to do the search.

To integrate Lucene, you have to build the following,

1) Build its searchable index. Lucene doesn't search my SQL Server database directly. Instead, it searches its own "database", its own index.
2) Sending the search query to Lucene.
3) Displaying the results.

Indexing
The Indexer is a simple console application that finds all HTML file in a given directory and adds them to the index.
The index is recreated when you run the Indexer.

Searching
Searcher is a sample ASP.NET application (C#) that uses the index generated by the Indexer to search our files.

Following console c# application Example shows implementation of a basic Lucene indexer which loop through XML nodes in an XML document.
1) book.xml
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>
An in-depth look at creating applications
with XML.
</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>
A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.
</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>
After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.
</description>
</book>
</catalog>

2) Program.cs
using System;
using System.Globalization;
using System.IO;
using System.Xml;
using System.Xml.XPath;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;

namespace IndexingBooks
{
internal class Program
{
private static void Main(string[] args)
{
// create a directory to store the index in
string indexPath = @"c:\LuceneSampleCatalog";
Directory.CreateDirectory(indexPath);

// index the books
IndexBooks(indexPath);

Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.KeywordAnalyzer();
var parser = new QueryParser("title", analyzer);

//create a query which searches through the author
var customQuery = string.Format("{0}:\"{1}\"", "author", "Thurman, Paula");
Lucene.Net.Search.Query query = parser.Parse(customQuery);

//Get files on provided path of directory
Lucene.Net.Store.Directory directory = Lucene.Net.Store.FSDirectory.GetDirectory(new System.IO.FileInfo(indexPath),false);

// create searcher
Lucene.Net.Search.Searcher searcher = new Lucene.Net.Search.IndexSearcher (Lucene.Net.Index.IndexReader.Open(directory));

// perform the search
var hits = searcher.Search(query);

Console.WriteLine(hits.Length());

//Get Data from the hits
for (int i = 0; i < hits.Length(); i++)
{
var doc = hits.Doc(i);
var author = doc.Get("author");
var id = doc.Get("id");
}

}
private static void IndexBooks(string indexPath)
{
DateTime startIndexing = DateTime.Now;
Console.WriteLine("start indexing at: " + startIndexing);

// read in the books xml
var booksXml = new XmlDocument();
booksXml.Load("books.xml");

// create the indexer with a standard analyzer
var indexWriter = new IndexWriter(indexPath, new StandardAnalyzer(), true);

try
{
// loop through all the books in the books.xml
foreach (XPathNavigator book in booksXml.CreateNavigator().Select("//book"))
{
// create a Lucene document for this book
var bookDocument = new Document();

// add the ID as stored but not indexed field, not used to query on
bookDocument.Add(new Field("id", book.GetAttribute("id", string.Empty), Field.Store.YES, Field.Index.UN_TOKENIZED, Field.TermVector.NO));

// add the title and genre as stored and un tokenized fields, the value is stored as is
bookDocument.Add(new Field("author", book.SelectSingleNode("author").Value, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.TermVector.NO));
bookDocument.Add(new Field("genre", book.SelectSingleNode("genre").Value, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.TermVector.NO));

// add the title and description as stored and tokenized fields, the analyzer processes the content
bookDocument.Add(new Field("title", book.SelectSingleNode("title").Value, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO));
bookDocument.Add(new Field("description", book.SelectSingleNode("description").Value, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO));

// add the publication date as stored and un tokenized field, note the special date handling
DateTime publicationDate = DateTime.Parse(book.SelectSingleNode("publish_date").Value, CultureInfo.InvariantCulture);
bookDocument.Add(new Field("publicationDate", DateField.DateToString(publicationDate), Field.Store.YES, Field.Index.UN_TOKENIZED, Field.TermVector.NO));

// add the document to the index
indexWriter.AddDocument(bookDocument);
}

// make lucene fast
indexWriter.Optimize();
}
finally
{
// close the index writer
indexWriter.Close();
}

DateTime endIndexing = DateTime.Now;
Console.WriteLine("end indexing at: " + endIndexing);
Console.WriteLine("Duration: " + (endIndexing - startIndexing).Seconds + " seconds");
Console.WriteLine("Number of indexed document: " + indexWriter.DocCount());

}
}
}

Inroduction to Lucene.Net

Wednesday, December 11, 2013

No comments:

Post a Comment