Improving search relevance in Sphinx

Sphinx is a search engine for fast fulltextsearch, can obtain data from mysql, oracle and mssql can be itself a repository(realtime indexes). Sphinx also has a mode of operation through the api and via sphinxql — analog Protocol sql(with some restrictions), enabling you to connect to the sphinx search through the website with minimal code changes. This is one of the few great, large and open projects developed in Russia. In my life I saw how sphinx handles 100-200 search queries on 2 million records from mysql and the server is freely breathing and not throwing up, mysql starts to die already 10 requests per second on a similar config.

The main problem of the documentation of sphinx in my opinion a small number of examples for most interesting settings, today I will try to tell in the examples about them. Options which I'll touch on this concern the algorithms and variations of the search. Everyone who works closely with sphinx not learn anything new, and newcomers will hopefully be able to improve the quality of search on their sites.

Sphinx contains two independent programs indexer and searchd. The first is building indexes on data taken from the database, produces the second search index is built. And now to the search settings in sphinx.

morphology

Allows you to specify the morphology of words, I only use stemming. Algorithm stemming with a set of rules for the language cuts the endings and suffixes. Stemming does not use ready database of words and rules-based circumcision for the language, which makes it small and fast, but it also adds to his disadvantages as it can make mistakes.

An example of the normalization of the word stemming in Russian.
The words “Apple”, “Apple”, “Apple” will be truncated at the “apples” and any search query with a variation of the word “Apple” will also be normalized and will find records with words that were described above.

For the English word “dogs” and “dog” are normalized to “dog”.
For example, in sphinx have to put in the index the word curly in the index will get the word curly and there will be variations of curly, curly and etc.
Enable stemming for Russian, English or both languages

morphology = stem_en
morphology = stem_ru
morphology = stem_enru

You can also use the Soundex and Metaphone they allow you to use the English language, given the sound of the words. Don't use data algorithms of morphology so if someone knows a lot about them I will be glad to read. For the Russian language, these algorithms would obtain from the words “sun” and “sun” normalized form “the sun”, which is based on the sound and pronunciation of these words.

morphology = stem_enru, Soundex, Metaphone

You can connect and external engines for morphology or to write your own.

Wordforms


Allows you to connect the words or word-forms, is well used on specialized niche sites, is a good example in the documentation.

core 2 duo > c2d
e6600 > c2d
core 2duo > c2d

Will allow you to find an article about core 2 duo for any search query from the model to the variations of the name.

hemp > the grass
dope > pot
my precious > pot
the grass of freedom > pot
th smoke > pot
there th > pot

And this dictionary will allow your user to easily find information about the grass on the site.

For word-forms used files in the format of ispell or myspell dictionaries(which can be done in Open Office)

wordforms = /usr/local/sphinx/data/wordforms.txt

enable_star


Allows the use of asterisks in queries, for example on request *PR* will be found the prospectus, Hey, approximation, etc.

enable_star = 1

expand_keywords


Automatically expands the search query to three queries

running- > (running | *running* | =running )

Just a word morphology, word with asterisks and full word match. Previously, this option was not to look for stars, I had to manually make an additional request, now all included in one option. Just a gun, a full match will be in search results above the search with stars, and morphology.
expand_keywords = 1

index_exact_words


Allows a number of morphologically normalized form and keep the original word in the index. This greatly increases the size of the index, but with the previous option allows you to produce results more relevant.

For example, there are three words “melon”, “melon”, “melon” without this option, all three words will be stored in the index as melon and at the request of “the melon” will be issued in the order they are added to the index, that is, “melon”, “melon”, “melon”.
However, if you enable the option expand_keywords and index_exact_words on request to “melon” will be more relevant results “melon”, “melon”, “melon.”

index_exact_words = 1

min_infix_len


Allows you to index part of the word infixes, and then find them with the use of * like search*, *search and *search*.
For example at min_infix_len = 2 and to hit in the index the word “test” will be stored in the index “te”, “es”, “St”, “TES”, “est”, “test” and on demand “EC” is found the word.

I usually use

min_infix_len = 3

A smaller value generates too much garbage and remember that this option greatly increases the index.

min_prefix_len


Is a child min_infix_len and does basically the same thing only saves the beginning of words or prefixes.
For example at min_infix_len = 2 and to hit in the index the word “test” will be stored in the index “te”, “TES”, “test” and on demand “EC” is found the word.
min_prefix_len = 3

min_word_len



The minimum word size to index, defaults to 1 and index all words.
Usually use
min_word_len = 3
Words of smaller size usually do not carry meaning.

html_strip


Cuts out all the html tags and html comments. This option is important if you build your google on the base of the Sphinx. Launched spider was spars website, drove it into a database, indexer and incited this option will allow you to get rid of trash in the form of html tags and look only at the content of the site.

He unfortunately did not use, but the documentation says that it can mess up with all sorts of xml and not standard html(for example anywhere of opening and closing tags, etc).

html_strip = 1

I will appreciate any questions and clarifications.
Offsite sphinxsearch.com.
If you interesting please don't be lazy plyusanut.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Fresh hay from the cow, or 3000 icons submitted!

Knowledge base. Part 2. Freebase: make requests to the Google Knowledge Graph

Group edit the resources (documents) using MIGXDB