1.需求:
现在有个原始文本:"电阻@CAL-CHIP@55@55w@±1%@330Ω@1/8W@55℃@xiaolong(电脑)@3000HF",如果使用官方分词器,分词之后的结果如下:
{
"tokens": [
{
"token": "电",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "阻",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "cal",
"start_offset": 3,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "chip",
"start_offset": 7,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "55",
"start_offset": 12,
"end_offset": 14,
"type": "<NUM>",
"position": 4
},
{
"token": "55w",
"start_offset": 15,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "1",
"start_offset": 20,
"end_offset": 21,
"type": "<NUM>",
"position": 6
},
{
"token": "330ω",
"start_offset": 23,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "1",
"start_offset": 28,
"end_offset": 29,
"type": "<NUM>",
"position": 8
},
{
"token": "8w",
"start_offset": 30,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "55",
"start_offset": 33,
"end_offset": 35,
"type": "<NUM>",
"position": 10
},
{
"token": "xiaolong",
"start_offset": 37,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 11
},
{
"token": "电",
"start_offset": 46,
"end_offset": 47,
"type": "<IDEOGRAPHIC>",
"position": 12
},
{
"token": "脑",
"start_offset": 47,
"end_offset": 48,
"type": "<IDEOGRAPHIC>",
"position": 13
},
{
"token": "3000hf",
"start_offset": 50,
"end_offset": 56,
"type": "<ALPHANUM>",
"position": 14
}
]
}
可以看到,这个分词之后的结果是以:@,-,±,%,/,℃ 这些符号为界限把词给分开,并且遇到中文的时候,分词的结果是要逐字分词,现在我需要实现的效果是:原来分词的效果不变,只是不能把 ℃ 这个符号给去掉,也就是说理想的结果是这样的:
........
{
"token": "8w",
"start_offset": 30,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "55℃",
"start_offset": 33,
"end_offset": 35,
"type": "<NUM>",
"position": 10
},
{
"token": "xiaolong",
"start_offset": 37,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 11
},
.....
2.通过自定义分词器来实现:
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "combined_tokenizer",
"char_filter": ["chinese_space_char_filter"]
}
},
"tokenizer": {
"combined_tokenizer": {
"type": "pattern",
"pattern": [
"-|@|,|!|?|=|/|±| |(|)|?"
]
}
},
"char_filter": {
"chinese_space_char_filter": {
"type": "pattern_replace",
"pattern": "([\\u4e00-\\u9fa5])",
"replacement": " $1 "
}
}
}
},
"mappings": {
"properties": {
"parameterSplicing": {
"type": "text",
"analyzer": "custom_analyzer",
"index": true,
"store": false
}
}
}
}
通过这个来作为分词器,然后分词效果如下:
{
"analyzer": "custom_analyzer",
"text": "电阻@CAL-CHIP@55@55w@±1%@330Ω@1/8W@55℃@xiaolong(小龙牌电脑)@3000HF"
}
分词结果:
{
"tokens": [
{
"token": "电",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "阻",
"start_offset": 1,
"end_offset": 1,
"type": "word",
"position": 1
},
{
"token": "CAL",
"start_offset": 3,
"end_offset": 6,
"type": "word",
"position": 2
},
{
"token": "CHIP",
"start_offset": 7,
"end_offset": 11,
"type": "word",
"position": 3
},
{
"token": "55",
"start_offset": 12,
"end_offset": 14,
"type": "word",
"position": 4
},
{
"token": "55w",
"start_offset": 15,
"end_offset": 18,
"type": "word",
"position": 5
},
{
"token": "1%",
"start_offset": 20,
"end_offset": 22,
"type": "word",
"position": 6
},
{
"token": "330Ω",
"start_offset": 23,
"end_offset": 27,
"type": "word",
"position": 7
},
{
"token": "1",
"start_offset": 28,
"end_offset": 29,
"type": "word",
"position": 8
},
{
"token": "8W",
"start_offset": 30,
"end_offset": 32,
"type": "word",
"position": 9
},
{
"token": "55℃",
"start_offset": 33,
"end_offset": 36,
"type": "word",
"position": 10
},
{
"token": "xiaolong",
"start_offset": 37,
"end_offset": 45,
"type": "word",
"position": 11
},
{
"token": "小",
"start_offset": 46,
"end_offset": 46,
"type": "word",
"position": 12
},
{
"token": "龙",
"start_offset": 47,
"end_offset": 47,
"type": "word",
"position": 13
},
{
"token": "牌",
"start_offset": 48,
"end_offset": 48,
"type": "word",
"position": 14
},
{
"token": "电",
"start_offset": 49,
"end_offset": 49,
"type": "word",
"position": 15
},
{
"token": "脑",
"start_offset": 50,
"end_offset": 50,
"type": "word",
"position": 16
},
{
"token": "3000HF",
"start_offset": 53,
"end_offset": 59,
"type": "word",
"position": 17
}
]
}
可以看到,这个 55℃ 被完整的保留下来了,
3.解释:
1.
settings
部分
analysis
节点:这是整个分析器相关配置的核心节点,用于定义各种分析组件,像分析器(analyzer
)、分词器(tokenizer
)以及字符过滤器(char_filter
)等。
analyzer
节点(自定义分析器定义):
custom_analyzer
:这是自定义的一个分析器名称,它的类型被指定为custom
,意味着需要自行组合各种组件(分词器、字符过滤器等)来构建其功能。- 组件配置:它使用了名为
combined_tokenizer
的分词器,并且关联了一个名为chinese_space_char_filter
的字符过滤器。通过这样的搭配,文本在经过这个分析器处理时,会先由combined_tokenizer
进行初步的分词操作,然后再经过chinese_space_char_filter
做进一步的文本处理(后文会详细介绍具体处理内容)。tokenizer
节点(分词器定义):
combined_tokenizer
:具体定义了一个名为combined_tokenizer
的分词器,其类型是pattern
,也就是基于正则表达式模式来进行分词操作。pattern
配置:其pattern
属性值为-|@|,|!|?|=|/|±| |(|)|?
,这是一个正则表达式模式,含义是以 “或” 的关系罗列了一系列用于分词的标识符号。具体来说,文本中一旦出现-
、@
、,
、!
、?
、=
、/
、±
、空格、(
、)
这些符号中的任意一个,分词器就会在该符号出现的位置将文本分割成不同的词项。例如,对于文本"电阻@CAL-CHIP@0805@±1%@330Ω@1/8W@55℃"
,就会依据这些符号进行相应的拆分,像根据@
把各个不同部分拆分开等。char_filter
节点(字符过滤器定义):
chinese_space_char_filter
:定义了一个字符过滤器,类型为pattern_replace
,主要用于对文本中的中文字符进行特定处理。pattern
和replacement
配置:pattern
属性值为([\\u4e00-\\u9fa5])
,这是利用 Unicode 编码范围来匹配任意单个中文字符,并且使用括号进行了分组捕获;replacement
属性值为$1
,表示将匹配到的单个中文字符(也就是前面分组捕获的内容)前后都添加一个空格。这样做的目的是在分词之后,对于文本里出现的中文部分,每个中文字符都能以添加前后空格的形式存在,方便后续可能的进一步文本处理或者索引、查询匹配等操作,使其在文本结构上更清晰、便于区分。
在这里需要重点说明一下,之所以遇到中文可以逐字分词,那是因为通过字符过滤器,在分词之前把中文的每一个字前后都加上了空格,然后在分词器里面有定义:遇到空格就进行分词,所以就可以做到分词之后的效果是逐字分词