‌Elasticsearch（es）自定义分词器，根据特殊符号分词或分词后保留特殊符号-EW帮帮网

1.需求：

现在有个原始文本："电阻@CAL-CHIP@55@55w@±1%@330Ω@1/8W@55℃@xiaolong(电脑)@3000HF"，如果使用官方分词器，分词之后的结果如下：

{
    "tokens": [
        {
            "token": "电",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "阻",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "cal",
            "start_offset": 3,
            "end_offset": 6,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "chip",
            "start_offset": 7,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "55",
            "start_offset": 12,
            "end_offset": 14,
            "type": "<NUM>",
            "position": 4
        },
        {
            "token": "55w",
            "start_offset": 15,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "1",
            "start_offset": 20,
            "end_offset": 21,
            "type": "<NUM>",
            "position": 6
        },
        {
            "token": "330ω",
            "start_offset": 23,
            "end_offset": 27,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "1",
            "start_offset": 28,
            "end_offset": 29,
            "type": "<NUM>",
            "position": 8
        },
        {
            "token": "8w",
            "start_offset": 30,
            "end_offset": 32,
            "type": "<ALPHANUM>",
            "position": 9
        },
        {
            "token": "55",
            "start_offset": 33,
            "end_offset": 35,
            "type": "<NUM>",
            "position": 10
        },
        {
            "token": "xiaolong",
            "start_offset": 37,
            "end_offset": 45,
            "type": "<ALPHANUM>",
            "position": 11
        },
        {
            "token": "电",
            "start_offset": 46,
            "end_offset": 47,
            "type": "<IDEOGRAPHIC>",
            "position": 12
        },
        {
            "token": "脑",
            "start_offset": 47,
            "end_offset": 48,
            "type": "<IDEOGRAPHIC>",
            "position": 13
        },
        {
            "token": "3000hf",
            "start_offset": 50,
            "end_offset": 56,
            "type": "<ALPHANUM>",
            "position": 14
        }
    ]
}

可以看到，这个分词之后的结果是以：@，-，±，%，/，℃ 这些符号为界限把词给分开，并且遇到中文的时候，分词的结果是要逐字分词，现在我需要实现的效果是：原来分词的效果不变，只是不能把 ℃ 这个符号给去掉，也就是说理想的结果是这样的：

 ........
{
            "token": "8w",
            "start_offset": 30,
            "end_offset": 32,
            "type": "<ALPHANUM>",
            "position": 9
        },
        {
            "token": "55℃",
            "start_offset": 33,
            "end_offset": 35,
            "type": "<NUM>",
            "position": 10
        },
        {
            "token": "xiaolong",
            "start_offset": 37,
            "end_offset": 45,
            "type": "<ALPHANUM>",
            "position": 11
        },
.....

2.通过自定义分词器来实现：

{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "combined_tokenizer",
          "char_filter": ["chinese_space_char_filter"]
        }
      },
      "tokenizer": {
        "combined_tokenizer": {
          "type": "pattern",
          "pattern": [
            "-|@|,|!|?|=|/|±| |(|)|?"
          ]
        }
      },
      "char_filter": {
        "chinese_space_char_filter": {
          "type": "pattern_replace",
          "pattern": "([\\u4e00-\\u9fa5])",
          "replacement": " $1 "
        }
      }
    }
  },
  "mappings": {
      "properties": {
        "parameterSplicing": {
          "type": "text",
          "analyzer": "custom_analyzer",
          "index": true,
          "store": false
        }
      }
    }

}

通过这个来作为分词器，然后分词效果如下：

{
    "analyzer": "custom_analyzer",
    "text": "电阻@CAL-CHIP@55@55w@±1%@330Ω@1/8W@55℃@xiaolong(小龙牌电脑)@3000HF"
}



分词结果：
{
    "tokens": [
        {
            "token": "电",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
            "token": "阻",
            "start_offset": 1,
            "end_offset": 1,
            "type": "word",
            "position": 1
        },
        {
            "token": "CAL",
            "start_offset": 3,
            "end_offset": 6,
            "type": "word",
            "position": 2
        },
        {
            "token": "CHIP",
            "start_offset": 7,
            "end_offset": 11,
            "type": "word",
            "position": 3
        },
        {
            "token": "55",
            "start_offset": 12,
            "end_offset": 14,
            "type": "word",
            "position": 4
        },
        {
            "token": "55w",
            "start_offset": 15,
            "end_offset": 18,
            "type": "word",
            "position": 5
        },
        {
            "token": "1%",
            "start_offset": 20,
            "end_offset": 22,
            "type": "word",
            "position": 6
        },
        {
            "token": "330Ω",
            "start_offset": 23,
            "end_offset": 27,
            "type": "word",
            "position": 7
        },
        {
            "token": "1",
            "start_offset": 28,
            "end_offset": 29,
            "type": "word",
            "position": 8
        },
        {
            "token": "8W",
            "start_offset": 30,
            "end_offset": 32,
            "type": "word",
            "position": 9
        },
        {
            "token": "55℃",
            "start_offset": 33,
            "end_offset": 36,
            "type": "word",
            "position": 10
        },
        {
            "token": "xiaolong",
            "start_offset": 37,
            "end_offset": 45,
            "type": "word",
            "position": 11
        },
        {
            "token": "小",
            "start_offset": 46,
            "end_offset": 46,
            "type": "word",
            "position": 12
        },
        {
            "token": "龙",
            "start_offset": 47,
            "end_offset": 47,
            "type": "word",
            "position": 13
        },
        {
            "token": "牌",
            "start_offset": 48,
            "end_offset": 48,
            "type": "word",
            "position": 14
        },
        {
            "token": "电",
            "start_offset": 49,
            "end_offset": 49,
            "type": "word",
            "position": 15
        },
        {
            "token": "脑",
            "start_offset": 50,
            "end_offset": 50,
            "type": "word",
            "position": 16
        },
        {
            "token": "3000HF",
            "start_offset": 53,
            "end_offset": 59,
            "type": "word",
            "position": 17
        }
    ]
}

可以看到，这个 55℃ 被完整的保留下来了，

3.解释：

1. settings 部分

analysis 节点：这是整个分析器相关配置的核心节点，用于定义各种分析组件，像分析器（analyzer）、分词器（tokenizer）以及字符过滤器（char_filter）等。

analyzer 节点（自定义分析器定义）：

custom_analyzer：这是自定义的一个分析器名称，它的类型被指定为 custom，意味着需要自行组合各种组件（分词器、字符过滤器等）来构建其功能。

组件配置：它使用了名为 combined_tokenizer 的分词器，并且关联了一个名为 chinese_space_char_filter 的字符过滤器。通过这样的搭配，文本在经过这个分析器处理时，会先由 combined_tokenizer 进行初步的分词操作，然后再经过 chinese_space_char_filter 做进一步的文本处理（后文会详细介绍具体处理内容）。

tokenizer 节点（分词器定义）：

combined_tokenizer：具体定义了一个名为 combined_tokenizer 的分词器，其类型是 pattern，也就是基于正则表达式模式来进行分词操作。

pattern 配置：其 pattern 属性值为 -|@|,|!|?|=|/|±| |(|)|?，这是一个正则表达式模式，含义是以 “或” 的关系罗列了一系列用于分词的标识符号。具体来说，文本中一旦出现 -、@、，、!、?、=、/、±、空格、(、) 这些符号中的任意一个，分词器就会在该符号出现的位置将文本分割成不同的词项。例如，对于文本 "电阻@CAL-CHIP@0805@±1%@330Ω@1/8W@55℃" ，就会依据这些符号进行相应的拆分，像根据 @ 把各个不同部分拆分开等。

char_filter 节点（字符过滤器定义）：

chinese_space_char_filter：定义了一个字符过滤器，类型为 pattern_replace，主要用于对文本中的中文字符进行特定处理。

pattern 和 replacement 配置：pattern 属性值为 ([\\u4e00-\\u9fa5])，这是利用 Unicode 编码范围来匹配任意单个中文字符，并且使用括号进行了分组捕获；replacement 属性值为 $1，表示将匹配到的单个中文字符（也就是前面分组捕获的内容）前后都添加一个空格。这样做的目的是在分词之后，对于文本里出现的中文部分，每个中文字符都能以添加前后空格的形式存在，方便后续可能的进一步文本处理或者索引、查询匹配等操作，使其在文本结构上更清晰、便于区分。

在这里需要重点说明一下，之所以遇到中文可以逐字分词，那是因为通过字符过滤器，在分词之前把中文的每一个字前后都加上了空格，然后在分词器里面有定义：遇到空格就进行分词，所以就可以做到分词之后的效果是逐字分词

‌Elasticsearch（es）自定义分词器，根据特殊符号分词或分词后保留特殊符号

1.需求：

2.通过自定义分词器来实现：

3.解释：

1. `settings` 部分

网站公告

今日签到

热门文章

最新发布

‌Elasticsearch（es）自定义分词器，根据特殊符号分词或分词后保留特殊符号

1.需求：

2.通过自定义分词器来实现：

3.解释：

1. settings 部分

网站公告

今日签到

热门文章

最新发布

1. `settings` 部分