For a project we’re currently working on we needed to be able to search on different fields that are shown to the end user. We didn’t want to rely fully on the SQL server Full Text Search capabilities. Luckily I knew the Lucene.NET engine from some previous work with the Umbraco CMS.

We just wanted to receive from our search engine the ID’s of the objects where the search term in was found rather then complete objects. When working with a database it just doesn’t feel right to store all your data on 2 different places (the database and the search index store). We only wanted to add the fields where we would search on to the search index and let Lucene decide the match, afterwards we’ll pull the objects from the database using Entity Framework.

Setting up Lucene.NET

Setting up Lucene.Net is not that difficult and you can find a lot of blog posts on how to use Lucene (like Introducing Lucene.NET on Code Project) and with a Nuget package available it’s easy to add the engine to your project.

I’ve followed a few of those blog posts I could find. Off course these posts give you a starting point and for the simplicity all code is written in one or two classes. After the first implementation I started with some refactoring’s and decided it would be easier for future implementations if I created my own search DLL or package.

At the bottom you’ll find the link to the Github repository where you can browse and download the package.

Documents

If you want to be able to search in Lucene you’ll have to add the parameter to a Lucene document. To simplify the creation of such documents I’ve created an abstract class ADocument where you have to inherit from.

You’ll see that I’ve added the public property “Id” so that every class that will inherit should have that Id. In the setter of the ID you’ll see I add the field to the Lucene.Net Document with the AddParameterToDocument method.

Next to the private AddParameterToDocument method you’ll find two other methods, one with the possibility to store the parameter in the Lucene index and one to just analyze the parameter but not to store it in the Lucene index.

using System.Linq; using System.Reflection; using System.Text; using Lucene.Net.Documents; namespace LuceneWrapper { /// <summary> /// Abstract class as base for all Seach Documents /// </summary> public abstract class ADocument : IDocument { private int id; /// <summary> /// The ID of the item /// </summary> [SearchField] public int Id { set { id = value; AddParameterToDocument("Id", id, Field.Store.YES, Field.Index.NOT_ANALYZED); } get { return id; } } private readonly Document document; /// <summary> /// The Lucene Document /// </summary> public Document Document { get { return document; } } /// <summary> /// Constructor /// </summary> protected ADocument() { document = new Document(); } /// <summary> /// Method to add parameters to the Lucene Document /// Only parameters that are added to the Lucene document can be searched on /// </summary> /// <param name="name">The name of the parameter</param> /// <param name="value">The value of the parameter</param> /// <param name="store">The Store setting</param> /// <param name="index">The Index setting</param> private void AddParameterToDocument(string name, dynamic value, Field.Store store, Field.Index index) { document.Add(new Field(name, value.ToString(), store, index)); } protected void AddParameterToDocumentStoreParameter(string name, dynamic value) { AddParameterToDocument(name, value,Field.Store.YES,Field.Index.ANALYZED); } protected void AddParameterToDocumentNoStoreParameter(string name, dynamic value) { AddParameterToDocument(name, value,Field.Store.NO,Field.Index.ANALYZED); } }

SearchField

You’ll see in the ADocument class that I’ve decorated the “Id” with the “SearchField” attribute. This attribute was created to simplify the search on multiple fields that you’ll see when implementing the BaseSearcher. You can find the implementation of this attribute below.

/// <summary> /// Custom attribute to define the field that can be seached on /// </summary> [System.AttributeUsage(System.AttributeTargets.Field | System.AttributeTargets.Property)] public class SearchField : System.Attribute { public string[] CombinedSearchFields; public SearchField(params string[] values) { this.CombinedSearchFields = values; } }

BaseSearch

Both the writer as the searcher have to have access to the index that is written to the local file system. To avoid multiple implementations (check if the directory exists, load a FSDirectory object, …) I’ve created a BaseSearch class. Not much to see in this class, just some basic settings of the folder where the index is stored.

using System.IO; using Lucene.Net.Store; namespace LuceneWrapper { /// <summary> /// The abstract base class to be implemented by everything that uses the Lucene directory /// </summary> public abstract class BaseSearch { private const string LuceneIndexFolder = "LuceneIndex"; private readonly FSDirectory luceneDirectory; private readonly string dataFolder; /// <summary> /// The App Data folder - or the folder where the lucene folder is placed under /// </summary> public string DataFolder { get { return DataFolder; } } /// <summary> /// The App Data folder - or the folder where the lucene folder is placed under as FSDirectory object /// </summary> public FSDirectory LuceneDirectory { get { return luceneDirectory; } } /// <summary> /// Constructor that will initialise the LuceneDirectory /// </summary> /// <param name="dataFolder">The App Data folder - or the folder where the lucene folder is placed under</param> protected BaseSearch(string dataFolder) { this.dataFolder = dataFolder; var di = new DirectoryInfo(Path.Combine(dataFolder,LuceneIndexFolder)); if (!di.Exists) { di.Create(); } luceneDirectory = FSDirectory.Open(di.FullName); } } }

BaseWriter

before we can search we’ll have to add some items into the search index off course. Following the DRY principle we’ve created a new abstract base class, the “BaseWriter”. In this class we added the methods to add and update new or existing item(s) of the type of ADocument. Next to adding and updating we added the corresponding delete methods. Because we delete everything based on the Id property of the ADocument implementation it’s important its value is always set in the deriving class.

using System.Collections.Generic; using System.Linq; using log4net; using Lucene.Net.Analysis.Standard; using Lucene.Net.Index; using Lucene.Net.Search; using Lucene.Net.Util; namespace LuceneWrapper { /// <summary> /// Base abstract class that every Writer should implement /// </summary> public abstract class BaseWriter : BaseSearch { private static readonly ILog Log = LogManager.GetLogger(typeof(BaseWriter)); /// <summary> /// Constructor /// </summary> /// <param name="dataFolder"></param> protected BaseWriter(string dataFolder):base(dataFolder) { Log.DebugFormat("Initialisation Writer with folder {0}", dataFolder); } /// <summary> /// Private helper to add an item to the Index /// </summary> /// <param name="doc">A ADocument type, representing the values that have to be added to the index</param> /// <param name="writer">The Lucene writer</param> private void AddItemToIndex(ADocument doc, IndexWriter writer) { Log.DebugFormat("Adding document to index: Type {0}"); var query = new TermQuery(new Term("Id", doc.Id.ToString())); writer.DeleteDocuments(query); writer.AddDocument(doc.Document); } /// <summary> /// Adds or update items in the Lucene index /// </summary> /// <param name="docs">The documents that have to be updated or added in the database</param> protected void AddUpdateItemsToIndex(IEnumerable<ADocument> docs) { Log.DebugFormat("Adding {0} items to index",docs.Count()); var standardAnalyzer = new StandardAnalyzer(Version.LUCENE_30); using (var writer = new IndexWriter(LuceneDirectory, standardAnalyzer, IndexWriter.MaxFieldLength.UNLIMITED)) { foreach (var doc in docs) { Log.DebugFormat("Adding item to index: {0}: ",doc); AddItemToIndex(doc, writer); } standardAnalyzer.Close(); writer.Dispose(); } } /// <summary> /// Private helper to delete an item from the index /// </summary> /// <param name="doc">The document representing the item that has to be deleted</param> /// <param name="writer">The Lucene writer</param> private void DeleteItemFromIndex(ADocument doc, IndexWriter writer) { Log.DebugFormat("Deleting item {0} from index",doc); var query = new TermQuery(new Term("Id", doc.Id.ToString())); writer.DeleteDocuments(query); } /// <summary> /// Deletes ites from the Lucene index /// </summary> /// <param name="docs"></param> protected void DeleteItemsFromIndex(IEnumerable<ADocument> docs) { Log.DebugFormat("Deleting {0} items from index",docs.Count()); var standardAnalyzer = new StandardAnalyzer(Version.LUCENE_30); using (var writer = new IndexWriter(LuceneDirectory, standardAnalyzer, IndexWriter.MaxFieldLength.UNLIMITED)) { foreach (var doc in docs) { Log.DebugFormat("Deleting item from index: {0}",doc); DeleteItemFromIndex(doc, writer); } standardAnalyzer.Close(); writer.Dispose(); } } /// <summary> /// optimizes the Lucene Index /// </summary> protected void Optimize() { Log.Debug("optimizing Lucene search index"); var standardAnalyzer = new StandardAnalyzer(Version.LUCENE_30); using (var writer = new IndexWriter(LuceneDirectory, standardAnalyzer, IndexWriter.MaxFieldLength.UNLIMITED)) { standardAnalyzer.Close(); writer.Optimize(); writer.Dispose(); } } } }

NOTE: the log messages are added to be used with Log4Net.

BaseSearcher

Adding items to the Lucene index wasn’t that hard. When I first started to implement the search methods I wanted to be able to search on a specific property or on all properties available on that object.

To be able to search on all properties you’ll have to user a MultifieldQueryParser instead of the default QueryParser. But with the MultifieldQueryParser you have to enter all the fields where Lucene have to search on. I really didn’t wanted to have to implement the same search code for every type we have to add to the index.

We’ll no choice then to turn to reflection to fetch all the parameters. But the class could have more properties then the one we are searching on. To avoid these extra properties (and the errors because we’ll be searching on properties that are not indexed) I’ve added the SearchField attribute.

So the first thing we’ll do when searching is to fetch all properties from the class and add them in a list. By using the T parameter we can easily reuse the same method for different classes as long they inherit from ADocument.

protected SearchResult Search<T>(string field, string searchQuery) where T : IDocument { Log.DebugFormat("Searching for Type: {0} with query \"{1}\" for field \"{2}\"", typeof(T), searchQuery, field); //Fetch the possible fields to search on PropertyInfo[] properties = typeof(T).GetProperties(); var fields = new List<string>(); var fieldsToSearchOn = new List<string>(); foreach (PropertyInfo property in properties) { var attributes = property.GetCustomAttributes(true); foreach (var o in attributes) { var attr = o as SearchField; if (attr != null) { fields.Add(property.Name); if (attr.CombinedSearchFields.Any() && field == property.Name) { fieldsToSearchOn.Add(property.Name); for (int i = 0; i < attr.CombinedSearchFields.Count(); i++) { fieldsToSearchOn.Add(attr.CombinedSearchFields[i]); } } else if (field == property.Name) { fieldsToSearchOn.Add(property.Name); } } } }

With this list we can now implement the actual searching. If we search on a specific field we’ll use the default QueryParser and if we are searching on more then one field, the MultifieldQueryParser.

When you look into the code, you’ll see that even when searching on a specific field, there’s still a possibility that the MultiFieldQueryParser is used. I’ve added the option to search on multiple parameters in the Lucene index if they are related to each other.

For example: you have a registration number for each person that always starts with the year of registration, a number (sequence) and a suffix: 2014-0023-aaa. When you add this complete registration number to the index and you search on 2014 (without wildcards) the Lucene engine will return no results. To avoid that the end user have to use wildcards you can store the registration number in three different parts. But when you want to search on the field RegistrationNumber you’ll have to indicate that multiple fields have to be used.

Therefor can the SearchField parameter contain a array of strings that contain the other searchfields that have to be taken into account. (confused, see the TestApp project on github)

using (var searcher = new IndexSearcher(LuceneDirectory)) { Log.Debug("Starting new IndexSearcher"); var analyzer = new StandardAnalyzer(Version.LUCENE_30); var searchResults = new SearchResult { SearchTerm = searchQuery, SearchResultItems = new List<SearchResultItem>() }; ScoreDoc[] hits; if (!string.IsNullOrEmpty(field)) { if (!fields.Contains(field)) { throw new SearchException(string.Format("Field {0} is not a search field for type {1}", field, typeof(T))); } QueryParser parser = fieldsToSearchOn.Count == 1 ? new QueryParser(Version.LUCENE_30, fieldsToSearchOn.First(), analyzer) : new MultiFieldQueryParser(Version.LUCENE_30, fieldsToSearchOn.ToArray(), analyzer); var query = ParseQuery(searchQuery, parser); hits = searcher.Search(query, HitsLimit).ScoreDocs; } else { var parser = new MultiFieldQueryParser(Version.LUCENE_30, fields.ToArray(), analyzer); var query = ParseQuery(searchQuery, parser); hits = searcher.Search(query, null, HitsLimit, Sort.RELEVANCE).ScoreDocs; } if (hits != null) { Log.DebugFormat("Hits found: {0}", hits.Count()); searchResults.Hits = hits.Count(); foreach (var hit in hits) { var doc = searcher.Doc(hit.Doc); searchResults.SearchResultItems.Add(new SearchResultItem { Id = Convert.ToInt32(doc.Get("Id")), Score = hit.Score, }); } } else { Log.DebugFormat("No hits found"); } analyzer.Close(); searcher.Dispose(); return searchResults; }

ParseQuery

The search term that the end user enters has to be translated to a Lucene.NET Query. There a re build in methods to convert a string to a Query object. Although the method can throw a ParseException if invalid characters are used. Therefor I added a private ParseQuery method to catch those exceptions and to filter out the invalid characters.

/// <summary> /// Parse the givven query string to a Lucene Query object /// </summary> /// <param name="searchQuery">The query string</param> /// <param name="parser">The Lucense QueryParser</param> /// <returns>A Lucene Query object</returns> private Query ParseQuery(string searchQuery, QueryParser parser) { parser.AllowLeadingWildcard = true; Query q; try { q = parser.Parse(searchQuery); } catch (ParseException e) { Log.Error("Query parser exception", e); q = null; } if (q == null || string.IsNullOrEmpty(q.ToString())) { string cooked = Regex.Replace(searchQuery, @"[^\w\.@-]", " "); q = parser.Parse(cooked); } Log.DebugFormat("Parsed query for Lucene: \"{0}\"", q); return q; }

SearchResult

Because we only want to return the Ids of the objects we’re searching for we can use a generic SearchResult class to return the results. The amount of hits and the search term are added to the result to be shown in the UI.

using System.Collections.Generic; namespace LuceneWrapper { /// <summary> /// Class to represent the Searh results /// </summary> public class SearchResult { public string SearchTerm { get; set; } public List<SearchResultItem> SearchResultItems { get; set; } public int Hits { get; set; } } /// <summary> /// Class to represent the Search result item /// </summary> public class SearchResultItem { public int Id { get; set; } public float Score { get; set; } } }

Test application

The above classes (except a custom exception) are all the parts we need to start testing our search wrapper. On the Github repository you’ll find a TestApp project in the solution where the classes below are implemented.

Person

The example used is of a Person that will register for a service. For simplicity sake local classes are used as repository instead of a database but you’ll get the point.

We”’ll start with the Person class that is used in some application. All default stuff and an override for the ToString method to print out Person class.

using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace LuceneWrapper.TestApp { public class Person { public int Id { get; set; } public string FirstName { get; set; } public string LastName { get; set; } public string EmailAddress { get; set; } public List<Language> Languages { get; set; } public DateTime RegistrationDate { get; set; } public int RegistrationNumber { get; set; } public string RegistrationSuffix { get; set; } public override string ToString() { var sb = new StringBuilder(); sb.AppendLineFormat("Id: {0}", Id); sb.AppendLineFormat("FirstName: {0}", FirstName); sb.AppendLineFormat("LastName: {0}", LastName); sb.AppendLineFormat("EmailAddress: {0}", EmailAddress); foreach (var language in Languages) { sb.AppendLineFormat("Language: {0}", language.LanguageCode); } sb.AppendLineFormat("Registration: {0}-{1}-{2}", RegistrationDate.Year, RegistrationNumber, RegistrationSuffix); return sb.ToString(); } } public class Language { public int Id { get; set; } public string Description { get; set; } public string LanguageCode { get; set; } } }

PersonDocument

It are these Person classes we want to add to our search index. To add these we have to create a PersonDocument class that implements from our ADocument abstract class. We”’ll add all properties with private backing fields so we can call the ‘AddParameterToDocument’ methods. Next to adding the property to the index we’ll have to decorate the properties we want to search on. (See the RegistrationString property where we added multiple SearchFields)

At the bottom I added a operator method to cast the person object to a PersonDocument object so I don’t have to repeat the cast in the business logic.

using System.Collections.Generic; using System.Linq; namespace LuceneWrapper.TestApp { public class PersonDocument : ADocument { private string lastName; private string firstName; private IEnumerable<string> languages; private string regDate; private string regNr; private string regSuffix; [SearchField] public string LastName { get { return lastName; } set { lastName = value; AddParameterToDocumentNoStoreParameter("LastName", lastName); } } [SearchField] public string FirstName { get { return firstName; } set { firstName = value; AddParameterToDocumentNoStoreParameter("FirstName", firstName); } } [SearchField] public IEnumerable<string> Languages { get { return languages; } set { languages = value; foreach (var language in languages) { AddParameterToDocumentNoStoreParameter("Languages", language); } } } [SearchField] public string RegDate { get { return regDate; } set { regDate = value; AddParameterToDocumentNoStoreParameter("RegDate", regDate); } } [SearchField] public string RegNr { get { return regNr; } set { regNr = value; AddParameterToDocumentNoStoreParameter("RegNr", regNr); } } [SearchField] public string RegSuffix { get { return regSuffix; } set { regSuffix = value; AddParameterToDocumentNoStoreParameter("RegSuffix", regSuffix); } } [SearchField("RegDate", "RegNr", "RegSuffix")] public string RegistrationString { get; set; } public static explicit operator PersonDocument(Person person) { return new PersonDocument() { LastName = person.LastName, FirstName = person.FirstName, Languages = person.Languages.Select(l => l.LanguageCode), RegDate = person.RegistrationDate.Year.ToString(), RegNr = person.RegistrationNumber.ToString(), regSuffix = person.RegistrationSuffix, Id = person.Id }; } } }

PersonWriter

The PersonWriter class will be not that difficult to implement now our PersonDocument class is defined. We just have to add the add and update methods and the delete methods that will call the base class methods.

using System.Collections.Generic; using System.Linq; namespace LuceneWrapper.TestApp { public class PersonWriter : BaseWriter { public PersonWriter(string dataFolder) : base(dataFolder) { } public void AddUpdatePersonToIndex(Person person) { AddUpdateItemsToIndex(new List<PersonDocument> { (PersonDocument)person }); } public void AddUpdatePeopleToIndex(List<Person> people) { AddUpdateItemsToIndex(people.Select(p => (PersonDocument)p).ToList()); } public void DeletePersonFromIndex(Person person) { DeleteItemsFromIndex(new List<PersonDocument> { (PersonDocument)person }); } public void DeletePersonFromIndex(int id) { DeleteItemsFromIndex(new List<PersonDocument> { new PersonDocument { Id = id } }); } } }

PersonSearcher

Also the PersonSearcher class will not be that difficult with the Search method implemented in the BaseSeacher class.

namespace LuceneWrapper.TestApp { public class PersonSearcher:BaseSearcher { public PersonSearcher(string dataFolder) : base(dataFolder) { } public SearchResult SearchPeople(string searchTerm, string field) { return Search<PersonDocument>(field, searchTerm); } } }

Program

In the program.cs file you’ll find the creation of the multiple Person objects an how they are added to the index. Underneath you’ll find the different search methods and their results.

using System; using System.Collections.Generic; using System.Linq; namespace LuceneWrapper.TestApp { public class Program { private static List<Person> people; static void Main(string[] args) { string dataFolder = @"C:\Temp\LuceneWrapper"; LoadPeople(); var writer = new PersonWriter(dataFolder); writer.AddUpdatePeopleToIndex(people); var searcher = new PersonSearcher(dataFolder); Console.WriteLine("Search on first name Bart in FirstName field"); var res = searcher.SearchPeople("Bart", "FirstName"); PrintResult(res); Console.WriteLine("Search on 2014 in LastName field"); res = searcher.SearchPeople("2014", "LastName"); PrintResult(res); Console.WriteLine("Search on 2014 in RegistrationString field"); res = searcher.SearchPeople("2014", "RegistrationString"); PrintResult(res); Console.WriteLine("Search on nl in all fields"); res = searcher.SearchPeople("nl", string.Empty); PrintResult(res); Console.ReadKey(); } private static void PrintResult(SearchResult res) { Console.WriteLine(); Console.WriteLine("Resuts found: {0}", res.Hits); foreach (var item in res.SearchResultItems) { Console.WriteLine("Result with ID: {0}", item.Id); Console.WriteLine(people.First(p => p.Id == item.Id)); } } private static void LoadPeople() { var lang1 = new Language { Description = "Dutch", Id = 1, LanguageCode = "nl-BE" }; var lang2 = new Language { Description = "French", Id = 2, LanguageCode = "nl-FR" }; var lang3 = new Language { Description = "english", Id = 3, LanguageCode = "en-UK" }; people = new List<Person> { new Person { FirstName = "Bart", LastName = "De Meyer", EmailAddress = "test@localtest.me", Id = 1, RegistrationDate = new DateTime(2014, 1, 10), RegistrationNumber = 1, RegistrationSuffix = "a", Languages = new List<Language> {lang1} }, new Person { FirstName = "Eddy", LastName = "Janssens", EmailAddress = "eddy@janssens.me", Id = 2, RegistrationDate = new DateTime(2014, 1, 4), RegistrationNumber = 2, Languages = new List<Language> {lang2, lang3} }, new Person { FirstName = "Luc", LastName = "Peeters", EmailAddress = "luc@peeters.me", Id = 3, RegistrationDate = new DateTime(2013, 12, 15), RegistrationNumber = 3, Languages = new List<Language> {lang1, lang3} }, new Person { FirstName = "Heike", LastName = "Wouters", EmailAddress = "heike@wouters.me", Id = 4, RegistrationDate = new DateTime(2013, 11, 20), RegistrationNumber = 4, Languages = new List<Language> {lang1, lang2} } }; } } }

Conclusion

Although Lucene.NET has far more options then showed here in this blog post, will the created wrapper at least give you the basic search possibilities. Off course can you extend the base classes to add more search options, index options etc.

With the basic settings in the Document class that inherits from ADocument you’ll avoid to create numerous searchers or indexers.

All source code can be found on GitHub and feel free to fork or download.

Bart De Meyer – Blog

Blog about software development in .NET – ASP.NET – javascript – …

Tag Archives: indexing

Searching with a Lucene.NET wrapper