Indexing in Apache Solr
The indexing is the management of documents or other entities in a systematic way. To locate information in a document, we use indexing.
- Indexing can be used to collect, parse, and store documents.
- It can be used to increase the speed and performance of the search query when we look for the required document.
Overview of the Solr indexing process
The indexing process in Apache Solr divided down to three essential tasks:
- Convert a document into a format supported by Solr from its native formats, such as XML or JSON.
- Add the document using one of several well-defined interfaces to Solr, such as HTTP POST.
- Apache Solr can be configured to apply transformations to the text in the document when indexing.
The figure below provides a high-level overview of these three necessary steps to getting your document indexed in Solr.
Solr supports different indexing formats for our document, including JSON, XML and CSV. In the above figure, we selected XML because the self-describing format of XML makes it easy to adapt and understand. Following is the example, how our tweet would look using the Solr XML format.
As you can see, each field shown in the XML format and the syntax is straightforward; we only defined the field name and value for all the fields. What we don’t notice is anything about the analysis of the text or type of file. This is because we define how fields are analyzed in the schema.xml Document shown in the above figure. As we all know, Solr provides a basic HTTP-based interface to all of its core services, including a document-update service for adding and updating documents. At the top left of the figure, we depict sending the XML for our tweet example with an HTTP POST to an update service of Document in Solr. Further, we will see how to add specific document types such as JSON, CSV, and XML later in this tutorial. We will now understand the document-update service that can be used to validate the contents of all the fields in a document, and after that, it invokes the process of text-analysis. When all the fields are analyzed, the resulting text will be added to the index, making the document available for search.
The schema.xml file defines the fields and field types for any documents. For any simple application, the fields to search and their types may be required. Though, it can be used to do some up-front planning about your schema.
Designing your schema
With our example, the microblog application of the search will dive right inside and define the document we want to index. Practically, this process isn’t always apparent for a real application, so it helps to do some up-front design and planning work. Now, we will learn about fundamental design considerations for search applications. Specifically, we’ll learn to answer the given critical questions about our searching software:
- What are the documents in our index?
- How all the documents are identified uniquely.
- What are the fields in our documents generally searched by users?
- What are the fields that should be displayed to the users inside search results?
Let’s determine the appropriate granularity of a document in your search application, which impacts how you answer the other questions.
Adding document using Post Command
Inside the bin directory of Solr, there is a post command. We can index various formats of files such as CSV, JSON, and XML in Solr.
We can browse through the bin directory of Solr and run the -h option of the post command, as given below in the code block.
When we execute the above command, we will get a list of options for the post command, as shown below.
Example
Let’s us take a file named as sample.csv with the given content (in the bin directory).
Student ID | First Name | Phone | City |
---|---|---|---|
001 | Olivia | +148022337 | California |
002 | Emma | +148022338 | Hawaii |
003 | Sophia | +148022339 | Florida |
004 | Emily | +148022330 | Texas |
005 | Harper | +148022336 | Kansas |
006 | Scarlett | +148022335 | Kentucky |
The above data table contains personal details like Student id, first name, phone, and city name. The CSV file of the data-files is given below. Here, we must note that you need to mention the schema, documenting its first line.
id | first_name | phone_no | location |
---|---|---|---|
001 | Olivia | +848022337 | Michigan |
002 | Emma | +848022338 | Minnesota |
003 | Sophia | +848022339 | North Carolina |
004 | Emily | +848022330 | Ohio |
005 | Harper | +848022336 | Oregon |
006 | Scarlett | +848022335 | Pennsylvania |
You can index these data in the core named “sample_Solr” using the post command given below:
When we execute the above command, the given document will be indexed under the specified core and generates the given output.
Go to the homepage of Solr Web User Interface using the given URL:
http://localhost:8983/
Select the core Solr_sample on the homepage. Without making any modifications, click on the ExecuteQuery button at the bottom of the page.
When we execute the query, we can observe the data of the indexed CSV document in default format (JSON), as displayed in the following screenshot.
Note – Similarly, we can index different file formats like XML, CSV, JSON, etc.
Adding Documents using the Solr Web Interface
You can also index documents using the web interface provided by Solr. Let us see how to index the following JSON document.
Step 1: Go to the Solr web interface using the given URL – http://localhost:8983/
Step 2: Choose the core “Solr_sample.” The values of the fields- Request Handler, Common Within, Overwrite, and Boost is/update, 1000, true, and 1.0 respectively by default, as shown in the below image.
Step 3: Finally, select the document format you want from CSV, XML, JSON etc. Enter the document you want to be indexed under the text area and click over the Submit Document button, as displayed in the following screenshot.
Adding Documents using Java Client API
As you can see, below is the Java source code to add documents to the Solr index. Save this program with the name AddDocument.java.
The above code can be compiled by executing the following commands on the terminal –
When we run the above command, we will receive the following output on our display.
Output:
Documents added.