.
Analysis and Analyzers
Elastic Search breaks (tokenizes) data in the document and build index of words (tokens). For each token, it points to the documents that matches the token. The words (tokens) are transformed a particular manner before being stored in the index. This process of breaking the document in a set of words and then transforming each word as per the specification, is known as Analysis. Analysis is done by configured piece of software processes commonly referred to as Analyzers. Analyzers come bundled with Elastic Search (like Standard Analyzer) but can also be configured to act in custom manner, i.e. Custom Analyzers.
We can instruct Elastic Search to apply analysis on input documents in the way we want it. This process is called defining custom analyzers and we can do this by updating an existing index as well as while creating new index.
Defining Custom Analyzer During Index Creation
Below is the structure of request that will create an index, map individual field with proper data type and define a custom analyzer –
PUT /members_idx
{
"settings": {
"index": { },
"analysis": {
"char_filter": { },
"tokenizer": { },
"filter": { },
"analyzer": { }
}
},
"mappings": {
"members": { }
}
}
The request above defines an index and associate analyzer(s) with it. The settings section defines the characteristics of the index and analyzers while the mappings section defines the schema of “type” and link individual fields with analyzers.
The index section could be some thing like below –
"index": {
"number_of_shards": 2,
"number_of_replicas": 1,
"refresh_interval": "2s"
}
In the index sub-section within settings, we can specify many different settings, though specifying shards and replicas are most common and significant.
The settings top level section contains the analysis sub-section, which groups all configuration needed for defining analyzers. There are 4 types of configurations that are grouped under analysis- character filters, tokenizers, filters and analyzers.
Character filters are invoked first of all when the analysis begins. If defined, we can define 1 or more character filters in the char_filter section. Each character filter’s properties must be defined under it’s name, e.g. in the below example, I have define a character filter “symbol_to_word” which acts on the input string and basically replace special characters like- “!”, “<" etc. to their word equivalents, like- "not", "less than" etc.
"char_filter": {
"symbol_to_word": {
"type": "mapping",
"mappings": [
"/=>or",
"!=>not",
"less than",
">=>greater than",
"*=>x",
"%=>percent"
]
}
}
Character Filters are optional to define and executes on raw input string before tokenization happens. Tokenizers, clearly are not optional and we must define exactly 1 tokenizer to tell elastic search, how to break the input string into terms. We can use built-in standard tokenizers, but at the same time we can define a new tokenizer as per our needs –
"tokenizer": {
"semicolon_tokenizer": {
"type": "pattern",
"pattern": ";+"
}
}
In this example, I have defined a tokenizer that will break the input string by semicolon “;”. The pattern is actually a regular expression that says- look for 1 or more or semicolons in the input string and break them into terms. If no semicolon is found in the input string, then the whole string is treated as one term. Therefore, a string like “Michael Jackson; Indiana Jones; Jeff Richter; Vipul Pathak“, will be broken by semicolon and each Name will become a term.
The filter block within the analysis block, defines the third building block of the analyzer. There can be 1 or more filters defined, if we include this section. Having said that, defining a filter explicitly itself is optional. The filter is applied in the chain after tokenization is complete. The filter(s) are called for each tokenized word/string, exactly in the order they are included in the analyzer. The goal of a filter is to transform or remove the term, before it can be included in the index. There are various types of filters that can be defined in a single filter block –
"filter": {
"the_sgn_filter": {
"type": "stop",
"stopwords": [ "CANTT", "MAKKE", "ITT" ]
},
"eng_hindi_syn_filter": {
"type": "synonym",
"ignore_case": true,
"synonyms_path": "eng_hindi_synonyms.txt"
},
"engram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 17
}
}
The filter of type “stop” is used as a Stop Filter and it matches the list of word it has with the terms. If a match is found, the term is dropped and not included in the index. An “edgeNGram” filter generates an edge ngram of terms, i.e. nGrams that are between 2 and 17 characters of length and their one end are stick to the edge of the term, e.g. “typeahead” with generate nGrams like these- “ty”, “typ”, “type”, “typea”, “typeah”, “typeahe”, “typeahea” and “typeahead” (all starting from t) but not like- “ype” or “ahead”.
A synonym type filter will generate synonyms of the terms used and at search time will match the original or the synonym identically. In this way, the chance of matching increases. The list of synonyms can be supplied inline as well like stop words, if the list is small (or) a separate synonym file can be supplied that list all the synonyms in the format specified below –
sgn => shri ganeshay namah
aum => om
swagat => welcome
namaskar => greetings
ghadi => watch
sanganak => computer
guru => teacher
In the above synonym list, I am trying to match hindi spoken words to their english equivalents. ![]()
Finally, all the three building blocks of the analyzer needs to be included in the analyzer. The analyzer block can contain one or more custom analyzer(s). The custom analyzer basically group the use of character filers, tokenizer and term filters. Since, the supplied character filters and term filters can be more than one, they are accepted as an array, i.e. the “char_filter” field in side the custom analyzer can be supplied in square brackets, “[” and “]”.
"analyzer": {
"description_analyzer": {
"type": "custom",
"char_filter": [ "html_strip", "symbol_to_word" ],
"tokenizer": "standard",
"filter": [ "uppercase", "the_sgn_filter", "eng_hindi_syn_filter" ]
},
"name_analyzer": {
"type": "custom",
"char_filter": [ "symbol_to_word" ],
"tokenizer": "standard",
"filter": [ "uppercase", "engram_filter" ]
},
"email_analyzer": {
"type": "custom",
"char_filter": [ "html_strip" ],
"tokenizer": "semicolon_tokenizer",
"filter": [ "lowercase", "trim" ]
}
}
Here, I am trying to define 3 custom analyzers. Every custom analyzer is defined with the type “custom”. Then, we specify which char_filter(s) to use, which tokenizer to use and what all term filter(s) to use. I have defined one analyzer to apply on the description field, called description_analyzer. I used 2 character filters, one that removes html markup entities (like & and <) and the other one converts certain characters to specific english words, like- ! is converted to not and < is converted to less than. It uses standard as tokenizer which break the word by whitespace or punctuation, but we can use an custom tokenizer as well, like we used a custom tokenizer (semicolon_tokenizer) in the email_analyzer. The semicolon_analyzer tokenize the string by ; character instead of by whitespace. The term filters like lowercase, uppercase etc. converts the case of the terms.
Lets define schema for a type and apply the analyzers we have defined so far. The index I have create was named as members_idx and the type I am going to create is members. In RDBMS parlance, we can think that an Index is a kind of database and a type is a kind of table. We define a schema by doing an HTTP post on the Index URI. This request below, puts everything we discussed so far together, including the settings part and the mappings part –
The mappings block above, defines the schema of the members type. The mappings block define each table along with its data type and analyzer, for example- looking at the description field above, it’s data type is set to string and the analyzer we want to execute on this field is description_analyzer.
Any field within the mapping section, whose index setting is set to not_analyzed is stored in the index but no analysis is done. If the index setting is set to no, then it is not even stored. The only other possible value for index setting is analyzed which enable storage as well as analysis. The phone field above is not_analyzed in our case.
We can specify a copy_to setting in any field. The idea of copy to is to copy the content of that field into a destination field of same data type. In our example above, the fields state and country are copied to a new field called my_meta. We can use the my_meta field to query as well.
We defined a field email_address of data type string. The type of data that this field may have is like this- “m_jackson@example.com; i_jones@anotherexample.com; j_richter@example.com; v_pathak@yetanotherexample.com“. We have applied the analyzer email_analyzer on this field. The expected way for this analyzer to work on this email data is to tokenize the string by “;” and then convert the terms into lowercase and trim spaces around each term. The tokenized email addresses are then stored in the index.
There are special meta fields that elastic search offers and that starts with an underscore character _. I have enabled some of these special fields, like- _all, _timestamp and _ttl. The field _all stores content from all the fields and it is searched when the query doesn’t specify which field to search. The _timestamp field is the special field as well and it is populated by elastic search with the value of current time when a document is indexed. We can override this behavior by specifying the field from which the value should be picked up instead of current time. The _ttl define the maximum age of a document, which is useful in expiring the document automatically based on time.
Finally, elastic search provides us the capability of define different analyzers for index time and for query time. This is what we used in the definition of our name field. We can ask elastic search to execute one analyzer when document is indexed and another analyzer when a document is queried. The settings index_analyzer and search_analyzer are used to specify these analyzers, though in our example- I specified the same analyzer in both of these settings. For the name field, the name_analyzer uses edgeNgram filter which allows type ahead kind of functionality. For a name Indiana, the edgeNgram filter will generate terms of increasing length that starts with the first letter of the word, i.e. “In“, “Ind“, “Indi“, “India“, “Indian” and “Indiana“. Therefore, if “Ind” was specified during a type ahead search, it will match with “Indiana“.
.
Leave a comment