Indexing in Apache Solr

The indexing is the management of documents or other entities in a systematic way. To locate information in a document, we use indexing.

Indexing can be used to collect, parse, and store documents.
It can be used to increase the speed and performance of the search query when we look for the required document.

Overview of the Solr indexing process

The indexing process in Apache Solr divided down to three essential tasks:

Convert a document into a format supported by Solr from its native formats, such as XML or JSON.
Add the document using one of several well-defined interfaces to Solr, such as HTTP POST.
Apache Solr can be configured to apply transformations to the text in the document when indexing.

The figure below provides a high-level overview of these three necessary steps to getting your document indexed in Solr.

Indexing in Apache Solr

Solr supports different indexing formats for our document, including JSON, XML and CSV. In the above figure, we selected XML because the self-describing format of XML makes it easy to adapt and understand. Following is the example, how our tweet would look using the Solr XML format.

  <add>  <doc>  <field name=”id”>1</field>  <field name=”screen_name”>@thelabdude</field>  <field name=”type”>post</field>  <field name=”timestamp”>2012-05-22T09:30:22Z</field>  <field name=”lang”>en</field>  <field name=”user_id”>99991234567890</field>  <field name=”favorites_count”>10</field>  <field name=”text”>#Yummm :) Drinking Cappuccino  Grecco in SF?s historic North Beach… Learning text  analysis with #SolrInAction by @tutoraspire on my i-Phone</field>  </doc>  </add>  

As you can see, each field shown in the XML format and the syntax is straightforward; we only defined the field name and value for all the fields. What we don’t notice is anything about the analysis of the text or type of file. This is because we define how fields are analyzed in the schema.xml Document shown in the above figure. As we all know, Solr provides a basic HTTP-based interface to all of its core services, including a document-update service for adding and updating documents. At the top left of the figure, we depict sending the XML for our tweet example with an HTTP POST to an update service of Document in Solr. Further, we will see how to add specific document types such as JSON, CSV, and XML later in this tutorial. We will now understand the document-update service that can be used to validate the contents of all the fields in a document, and after that, it invokes the process of text-analysis. When all the fields are analyzed, the resulting text will be added to the index, making the document available for search.

The schema.xml file defines the fields and field types for any documents. For any simple application, the fields to search and their types may be required. Though, it can be used to do some up-front planning about your schema.

Designing your schema

With our example, the microblog application of the search will dive right inside and define the document we want to index. Practically, this process isn’t always apparent for a real application, so it helps to do some up-front design and planning work. Now, we will learn about fundamental design considerations for search applications. Specifically, we’ll learn to answer the given critical questions about our searching software:

What are the documents in our index?
How all the documents are identified uniquely.
What are the fields in our documents generally searched by users?
What are the fields that should be displayed to the users inside search results?

Let’s determine the appropriate granularity of a document in your search application, which impacts how you answer the other questions.

Adding document using Post Command

Inside the bin directory of Solr, there is a post command. We can index various formats of files such as CSV, JSON, and XML in Solr.

We can browse through the bin directory of Solr and run the -h option of the post command, as given below in the code block.

When we execute the above command, we will get a list of options for the post command, as shown below.

  Usage: post -c <collection> [OPTIONS] <files|directories|urls|-d [“..”]>   or post ?help       collection name defaults to DEFAULT_SOLR_COLLECTION if not specified    OPTIONS   =======   Solr options:      -url <base Solr update URL> (overrides collection, host, and port)      -host <host> (default: localhost)      -p or -port <port> (default: 8983)      -commit yes|no (default: yes)      Web crawl options:       -recursive <depth> (default: 1)      -delay <seconds> (default: 10)      Directory crawl options:      -delay <seconds> (default: 0)      stdin/args options:      -type <content/type> (default: application/xml)      Other options:      -filetypes <type>[,<type>,…] (default:        xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,     rtf,htm,html,txt,log)      -params “<key> = <value>[&<key> = <value>…]” (values must be      URL-encoded; these pass through to Solr update request)      -out yes|no (default: no; yes outputs Solr response to console)      -format Solr (sends application/json content as Solr commands      to /update instead of /update/json/docs)    

Example

Let’s us take a file named as sample.csv with the given content (in the bin directory).

Student ID	First Name	Phone	City
001	Olivia	+148022337	California
002	Emma	+148022338	Hawaii
003	Sophia	+148022339	Florida
004	Emily	+148022330	Texas
005	Harper	+148022336	Kansas
006	Scarlett	+148022335	Kentucky

The above data table contains personal details like Student id, first name, phone, and city name. The CSV file of the data-files is given below. Here, we must note that you need to mention the schema, documenting its first line.

id	first_name	phone_no	location
001	Olivia	+848022337	Michigan
002	Emma	+848022338	Minnesota
003	Sophia	+848022339	North Carolina
004	Emily	+848022330	Ohio
005	Harper	+848022336	Oregon
006	Scarlett	+848022335	Pennsylvania

You can index these data in the core named “sample_Solr” using the post command given below:

When we execute the above command, the given document will be indexed under the specified core and generates the given output.

  /home/Hadoop/java/bin/java -classpath /home/Hadoop/Solr/dist/Solr-core  6.2.0.jar -Dauto = yes -Dc = Solr_sample -Ddata = files   org.apache.Solr.util.SimplePostTool sample.csv   SimplePostTool version 5.0.0   Posting files to [base] url http://localhost:8983/Solr/Solr_sample/update…   Entering auto mode. File endings considered are   xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,  htm,html,txt,log   POSTing file sample.csv (text/csv) to [base]   1 files indexed.   COMMITting Solr index changes to   http://localhost:8983/Solr/Solr_sample/update…   Time spent: 0:00:00.228  

Go to the homepage of Solr Web User Interface using the given URL:

http://localhost:8983/

Select the core Solr_sample on the homepage. Without making any modifications, click on the ExecuteQuery button at the bottom of the page.

Indexing in Apache Solr

When we execute the query, we can observe the data of the indexed CSV document in default format (JSON), as displayed in the following screenshot.

Indexing in Apache Solr

Note – Similarly, we can index different file formats like XML, CSV, JSON, etc.

Adding Documents using the Solr Web Interface

You can also index documents using the web interface provided by Solr. Let us see how to index the following JSON document.

  [      {         “id” : “001”,         “name” : “Emma”,         “age” : 25,         “Designation” : “Executive”,         “Location” : “Texas”,      },      {         “id” : “002”,         “name” : “Robert”,         “age” : 43,         “Designation” : “SR.Programmer”,         “Location” : “New York”,      },      {         “id” : “003”,         “name” : “John”,         “age” : 25,         “Designation” : “JR.Programmer”,         “Location” : “California”,      }   ]   

Step 1: Go to the Solr web interface using the given URL – http://localhost:8983/

Indexing in Apache Solr

Step 2: Choose the core “Solr_sample.” The values of the fields- Request Handler, Common Within, Overwrite, and Boost is/update, 1000, true, and 1.0 respectively by default, as shown in the below image.

Step 3: Finally, select the document format you want from CSV, XML, JSON etc. Enter the document you want to be indexed under the text area and click over the Submit Document button, as displayed in the following screenshot.

Indexing in Apache Solr

Adding Documents using Java Client API

As you can see, below is the Java source code to add documents to the Solr index. Save this program with the name AddDocument.java.

  import java.io.IOException;      import org.apache.Solr.client.Solrj.SolrClient;   import org.apache.Solr.client.Solrj.SolrServerException;   import org.apache.Solr.client.Solrj.impl.HttpSolrClient;   import org.apache.Solr.common.SolrInputDocument;     public class AddingDocument {      public static void main(String args[]) throws Exception {         //Prepare the Solr client         String urlString = “http://localhost:8983/Solr/my_core”;         SolrClient Solr = new HttpSolrClient.Builder(urlString).build();                   //Prepare the Solr doc.         SolrInputDocument doc = new SolrInputDocument();              //Add fields to the doc.        doc.addField(“id”, “003”);         doc.addField(“name”, “Rajaman”);         doc.addField(“age”,”34″);         doc.addField(“addr”,”vishakapatnam”);                    //Add the doc. to Solr         Solr.add(doc);                            //Save the changes         Solr.commit();         System.out.println(“Documents added”);      }   }  

The above code can be compiled by executing the following commands on the terminal –

When we run the above command, we will receive the following output on our display.

Output:

Documents added.

Next TopicApache Solr Text Analysis

Indexing in Apache Solr

Indexing in Apache Solr

Designing your schema

Adding document using Post Command

Note – Similarly, we can index different file formats like XML, CSV, JSON, etc.

Adding Documents using the Solr Web Interface

Adding Documents using Java Client API

Angular + Spring Search Field Example

CodeIgniter url

You may also like