Solr stemming file




















There are quite a few. Note: For a good background on Lucene Analysis, it's recommended that you read the following sections in Lucene In Action :. Analyzers are documented in the Solr Reference Guide section Analyzers. Tokenizers are documented in the Solr Reference Guide section Tokenizers.

The ultimate decision depends largely on what Tokenizer you are using, and whether you need to "out smart" it by preprocessing the stream of characters. For example, maybe you have a tokenizer such as StandardTokenizer and you are pretty happy with how it works overall, but you want to customize how some specific characters behave.

In such a situation you could modify the rules and re-build your own tokenizer with javacc, but perhaps its easier to simply map some of the characters before tokenization with a CharFilter. Documentation at MappingCharFilterFactory. Documentation at Keyword Tokenizer. Documentation at Letter Tokenizer. Documentation at White Space Tokenizer.

Documentation at Lower Case Tokenizer. Documentation at Standard Tokenizer. Example: "I. Documentation at Classic Tokenizer. Documentation at Regular Expression Pattern Tokenizer. Documentation at ICU Tokenizer. Overall documented at Filter Descriptions. Documentation at Classic Filter. Documentation at Lower Case Filter.

Documented at Type Token Filter. Documented at Trim Filter. Documentation at Pattern Replace Filter. Connect and share knowledge within a single location that is structured and easy to search. If I understand correctly, the SynonymFilterFactory does not stem synonyms in any way. I see that the SynonymFilterFactory has an optional argument where it can accept an analyzer. If analyzer is specified, then tokenizerFactory may not be, and vice versa.

I suspect that compiling an extension analyzer. Is there a way to define a named analyzer in configuration, or another method to accomplish this goal? This does not answer my original question about how to do this via configuration only , but is the solution I ended up using in the event that anyone else wants to do it.

First, a custom analyzer that will be used to pre-process the synonyms coming in from the Synonym filter most importantly, stemming them with Snowball :. This is extracted as a. Next, make sure to tell SOLR to use this analyzer in your synonym filter usually in schema.

With this example, documents' keywords fields will be stemmed in the index. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected.

To configure, add the KeywordRepeatFilterFactory early in the analysis chain. It is recommended to also include RemoveDuplicatesTokenFilterFactory to avoid duplicates when tokens are not stemmed.

Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified by stemmers. A customized mapping of words to stems, in a tab-separated file, can be specified to the dictionary attribute in the schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer. This filter splits, or decompounds , compound words into individual words using a dictionary of the component words.

Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position. Assume that germanwords. Out: "Donaudampfschiff" 1 , "Donau" 1 , "dampf" 1 , "schiff" 1 , "dummkopf" 2 , "dumm" 2 , "kopf" 2. Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search purposes. CollationField and solr. ICUCollationField field type classes provide this functionality.

ICUCollationField , which is backed by the ICU4J library , provides more flexible configuration, has more locales, is significantly faster, and requires less memory and less index space, since its keys are smaller than those produced by the JDK implementation that backs solr.

To use solr. ICUCollationField , you must add additional. ICUCollationField and solr. CollationField fields can be created in two ways:. Arguments for solr. Locales are typically defined as a combination of language and country, but you can specify just the language if you want. For example, if you specify "de" as the language, you will get sorting that works well for the German language. If you specify "de" as the language and "CH" as the country, you will get German sorting specifically tailored for Switzerland.

In the example above, we defined the strength as "primary". The strength of the collation determines how strict the sort order will be, but it also depends upon the language. For example, in English, "primary" strength ignores differences in case and accents. The type will be used for the fields where the data contains Polish text. The "secondary" strength will ignore case differences, but, unlike "primary" strength, a letter with diacritic s will be sorted differently from the same base letter without diacritics.

There are two approaches to supporting multiple languages: if there is a small list of languages you wish to support, consider defining collated fields for each language and using copyField. However, adding a large number of sort fields can increase disk and indexing costs. An alternative approach is to use the Unicode default collator.

To use the default locale, simply define the locale as the empty string. This Unicode default sort is still significantly more advanced than the standard Solr sort.

You can define your own set of sorting rules. In the example below, we create a custom rule set for German called DIN This example shows how to create a custom rule set for solr. ICUCollationField and dump it to a file:. The principles of JDK Collation are the same as those of ICU Collation; you just specify language , country and variant arguments instead of the combined locale argument. This can increase recall by causing more matches. On the other hand, it can reduce precision because language-specific character differences may be lost.

This filter converts any character in the Unicode "Decimal Number" general category Nd into their equivalent Basic Latin digits In addition to these analysis components, Solr also provides an update request processor to extract named entities - see Update Processor Factories That Can Be Loaded as Plugins. To use the OpenNLP components, you must add additional. The OpenNLP Tokenizer takes two language-specific binary model files as parameters: a sentence detector model and a tokenizer model.

The last token in each sentence is flagged, so that following OpenNLP-based filters can use this information to apply operations to tokens one sentence at a time. See the OpenNLP website for information on downloading pre-trained models. Only index nouns - the keep.

Usage of stemming in search increases recall. Pretty simple! Yes it is, but not as simple as it seems to be. So, the first thing one should do is to have stemming at query time also.

It will return all the results as expected. Next, there are different stemming algorithms resulting in different stemmers. We tried out the Porter and KStem stemmers that come with Solr, since we are only working with English text.

Each of this has its own advantages and limitations. In general, Porter gives good recall but lesser precision. Porter and KStem also have certain limitations. Similarly, KStem is not capable of finding lemmas for all words.

Synonyms Matching synonyms is another way to increase recall of a search. Depending on the context, synonyms could be generic synonyms at a language level or specific domain related synonym terms.

Apache Solr provides a flexible mechanism for synonym searches in the form of its SynonymFilter. Synonyms can be configured using a simple text file with a list of words let us call them keywords and their synonyms.

However, the flexibility comes with a price—a bit of complexity in understanding the mechanism! Here are two important options that contribute to its flexibility:. Expand is False default : By default, when there is no expansion, SynonymFilter collapses synonyms to their keyword. What does that mean? Expand is True: When synonyms expansion is set to true, SynonymFilter expands the word to all its synonym words.

When to apply SynonymFilter SynonymFilter can be applied either at index time, at query time, or both index time and query time.

Together with the expand option this gives rise to a lot of combinations. For example, SynonymFilter could be applied only at index time with expand option set to false. Let us see what happens in this scenario. In most practical scenarios, this is not the case.



0コメント

  • 1000 / 1000