SOLR search component
- Mentor(s)
- Student
- Estimated workload
2-3 months
- Details
The objective of this project is to exploit the Apache SOLR search engine as indexing and search engine for XWiki.
XWiki is a very flexible wiki, in use in massive or small sites, with both highly structured and/or very textual content. This flexibility should be in the SOLR component:
- based on SOLR's schema and complementary information, the indexing process should be customizable to only index and store as little information as needed.
- through code customizability (exploiting the possibility of Groovy code in pages), the transformation of a user-query to a SOLR query should be adjustable, far beyond the simple text-parsing (enabling, for example, the prohibition of some spaces, or the conversion to multiple fields based on input parameters)
Finally, this component should support calibrating the search engine's parameter (such as the Dismax' parser coefficients, the analyzers usefulness, ...) by classical quantitative methods such as precision-and-recall, which any wiki master, or a collaborator, should be able to exploit and report with.
First impulse for this work exist:
- the XWiki Lucene component is there and in wide usage but is lacking flexibility
- a SOLR component has been started by Fabio Mancinelli, see his announce: http://lists.xwiki.org/pipermail/devs/2011-September/thread.html#48372.
- Developer profile
- Java programming
- Understanding of Information Retrieval principles
- experience with Apache SOLR or Apache Lucene a plus.
- Active
- Yes
- Year
2012
- Status
Selected
- Progress
Description
The idea of the project is to use Apache SOLR as the search engine for XWiki. XWiki is using Lucene as a core component for Wiki Search. Lucene is little hard to configure and doesn't support features like facet search, hit highlighting, customizing search relevancy using boost index out of the box. Solr stands out in its minimal configuration to implement the search engine.Few libraries and a couple of XML configuration files are sufficient to implement a well to do engine. Configuring multiple languages is easy in SOLR compared to Lucene.Using SOLR, one can customize the indexing process by using required analyzers with selected tokenizers and set of filters on the dataset to generate highly customizable relevancy index. Through the front end, the user can select or configure the fields to be searched for and their weight which contributes to the document score. The link to the Design Page is given below:
http://dev.xwiki.org/xwiki/bin/view/Design/SOLRSearchIntegration
Milestones
Week Days Description 25 April- 20 May Community Bonding period. Get familiar with XWiki Platform, Coding practices, come up with a good design proposal for the project and fix some JIRA issues. Week 1 21 May- 27 May Work on API by Speaking to mentors and community Week 2 28 May- 03 June Work on solr embedded server component Week 3 & 4 04 Jun-17 Jun Complete the solr embedded server component , basic front search gui and facet search implementation at the back end Week 5 & 6 18 Jun- 28 Jun Customizing fields using index, hit highlighting and partial indexing of attachments done. Milestone 1 June 29 Share the basic solr search component and get the feedback. Week 6 & 7 29 Jun- 08 Jul Documentation, Refactoring and code optimization. Improvise the Solr Component based on the feedback.
Implement facet search (GUI), Implement indexing of comments.Week 8 09 Jul - 15 Jul Complete the search component with customizable search fields.Integrate analyzers for different languages. Week 9-10 July 17- 28 Work on the Admin part Week 11 30 Jul-08 Aug Work on search filter, debug mode, Sorting based on relevancy, Auto suggest and Quick search bar Milestone 2 09 Aug Share the Admin part and Advanced search Week 12 10 Aug- 12 Aug Documentation, setup file, User guide Week 13 13 Aug - 18 Aug Testing with some real time time data and calibrate the indexes. Test the quality of search engine by creating a test suite creator and evaluator, Documentation on calibrating. Milestone 3 19 Aug Sharing the work with community Basic Implementation Steps
* Initialize the component and load the solr configuration on server start.
* Do a incremental indexing for wiki content and make it ready for querying the data.( In the existing set up indexing is done when the first search is made, can make this configurable - Having the indexing at the start or on the first search - good for small wikis )
* Register to the page events using xwiki-platform-observation, to reindex the documents for add/delete/edit operations.
* Allow the user to query with customizing the fields, to search only title, body,comments, attachments and other metadata.
* Allow the administrator to tweak the weights to boost relevancy on particular fields.
* Writing more JUnit tests and follow Test Driven development.
Recent code
Below is the link to the recent source code :
https://github.com/xwiki-contrib/xwiki-platform-solr
Running Instance
I have configured XWiki server in Amazon ec2 cloud to play around with Solr Search Component. Have included few documents in the following spaces : Programming, Places, Flora, Fauna . Below is the link to the running instance.
http://savitha.hoplahup.net/xwiki/bin/view/Main/AdvancedSearch
Detailed Progress
The detailed Progress of the Solr Project could be found here
https://docs.google.com/spreadsheet/pub?key=0AkC67pvTmc3zdHNaRldQdTVFaTJ1SkhpbVd2UnhOX0E&output=html
API
The API is given as a part of the Design page. Link to the Design Page is given below :
Features of Solr and Set up Documentation
The current features of Solr are explained in the link below:
http://dev.xwiki.org/xwiki/bin/view/GoogleSummerOfCode/SolrSearchApplication