Elasticsearch(简称 ES)提供了多种预置的分词器(Analyzer),用于对文本进行分词处理。分词器通常由字符过滤器(Character Filters)、分词器(Tokenizer)和词元过滤器(Token Filters)组成。以下是一些常用的预置分词器及其示例:
1. Standard Analyzer(标准分词器)
- 默认分词器,适用于大多数语言。
- 处理步骤:
- 使用标准分词器(Standard Tokenizer)按空格和标点符号分词。
- 应用小写过滤器(Lowercase Token Filter)将词元转换为小写。
- 示例:
输出:POST _analyze { "analyzer": "standard", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog's", "bone"]
2. Simple Analyzer(简单分词器)
- 按非字母字符(如数字、标点符号)分词,并将词元转换为小写。
- 示例:
输出:POST _analyze { "analyzer": "simple", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
["the", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog", "s", "bone"]
3. Whitespace Analyzer(空格分词器)
- 仅按空格分词,不转换大小写,不处理标点符号。
- 示例:
输出:POST _analyze { "analyzer": "whitespace", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
["The", "2", "QUICK", "Brown-Foxes", "jumped", "over", "the", "lazy", "dog's", "bone."]
4. Keyword Analyzer(关键词分词器)
- 将整个文本作为一个单独的词元,不做任何分词处理。
- 示例:
输出:POST _analyze { "analyzer": "keyword", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."]
5. Stop Analyzer(停用词分词器)
- 类似于简单分词器,但会过滤掉常见的停用词(如 “the”, “and”, “a” 等)。
- 示例:
输出:POST _analyze { "analyzer": "stop", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
["quick", "brown", "foxes", "jumped", "over", "lazy", "dog", "s", "bone"]
6. Pattern Analyzer(正则分词器)
- 使用正则表达式定义分词规则。
- 示例:
默认按非字母字符分词,并转换为小写:POST _analyze { "analyzer": "pattern", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog", "s", "bone"]
7. Language Analyzer(语言分词器)
- 针对特定语言优化,支持多种语言(如英语、中文、法语等)。
- 示例(英语):
输出:POST _analyze { "analyzer": "english", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
["2", "quick", "brown", "fox", "jump", "over", "lazi", "dog", "bone"]
8. ICU Analyzer(国际化分词器)
- 基于 ICU(International Components for Unicode)库,支持多语言分词。
- 示例:
输出:POST _analyze { "analyzer": "icu_analyzer", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog's", "bone"]
9. Fingerprint Analyzer(指纹分词器)
- 对文本进行分词、去重、排序,并生成唯一的“指纹”。
- 示例:
输出:POST _analyze { "analyzer": "fingerprint", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
["2", "bone", "brown", "dog", "foxes", "jumped", "lazy", "over", "quick", "the"]
总结
Elasticsearch 的预置分词器适用于不同的场景,开发者可以根据需求选择合适的分析器,或者自定义分词器以满足特定需求。