ES的预置分词器-EW帮帮网

Elasticsearch（简称 ES）提供了多种预置的分词器（Analyzer），用于对文本进行分词处理。分词器通常由字符过滤器（Character Filters）、分词器（Tokenizer）和词元过滤器（Token Filters）组成。以下是一些常用的预置分词器及其示例：

1. Standard Analyzer（标准分词器）

默认分词器，适用于大多数语言。
处理步骤：
1. 使用标准分词器（Standard Tokenizer）按空格和标点符号分词。
2. 应用小写过滤器（Lowercase Token Filter）将词元转换为小写。

示例：

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog's", "bone"]

2. Simple Analyzer（简单分词器）

按非字母字符（如数字、标点符号）分词，并将词元转换为小写。

示例：

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["the", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog", "s", "bone"]

3. Whitespace Analyzer（空格分词器）

仅按空格分词，不转换大小写，不处理标点符号。

示例：

POST _analyze
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["The", "2", "QUICK", "Brown-Foxes", "jumped", "over", "the", "lazy", "dog's", "bone."]

4. Keyword Analyzer（关键词分词器）

将整个文本作为一个单独的词元，不做任何分词处理。

示例：

POST _analyze
{
  "analyzer": "keyword",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."]

5. Stop Analyzer（停用词分词器）

类似于简单分词器，但会过滤掉常见的停用词（如 “the”, “and”, “a” 等）。

示例：

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["quick", "brown", "foxes", "jumped", "over", "lazy", "dog", "s", "bone"]

6. Pattern Analyzer（正则分词器）

使用正则表达式定义分词规则。

示例：

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

默认按非字母字符分词，并转换为小写：

["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog", "s", "bone"]

7. Language Analyzer（语言分词器）

针对特定语言优化，支持多种语言（如英语、中文、法语等）。

示例（英语）：

POST _analyze
{
  "analyzer": "english",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["2", "quick", "brown", "fox", "jump", "over", "lazi", "dog", "bone"]

8. ICU Analyzer（国际化分词器）

基于 ICU（International Components for Unicode）库，支持多语言分词。

示例：

POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog's", "bone"]

9. Fingerprint Analyzer（指纹分词器）

对文本进行分词、去重、排序，并生成唯一的“指纹”。

示例：

POST _analyze
{
  "analyzer": "fingerprint",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["2", "bone", "brown", "dog", "foxes", "jumped", "lazy", "over", "quick", "the"]

总结

Elasticsearch 的预置分词器适用于不同的场景，开发者可以根据需求选择合适的分析器，或者自定义分词器以满足特定需求。

ES的预置分词器

1. Standard Analyzer（标准分词器）

2. Simple Analyzer（简单分词器）

3. Whitespace Analyzer（空格分词器）

4. Keyword Analyzer（关键词分词器）

5. Stop Analyzer（停用词分词器）

6. Pattern Analyzer（正则分词器）

7. Language Analyzer（语言分词器）

8. ICU Analyzer（国际化分词器）

9. Fingerprint Analyzer（指纹分词器）

总结

网站公告

今日签到

热门文章

最新发布