文章目录
一、安装中文分词插件
IK Analysis for Elasticsearch
是开源社区比较流行的中文分词插件
官网:https://github.com/medcl/elasticsearch-analysis-ik
本来想用这两种方法安装:
.\bin\elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.4.2/elasticsearch-analysis-ik-5.4.2.zip
bin\elasticsearch-plugin install file:///C:\Users\jk\Desktop\elasticsearch-analysis-ik-5.4.2.zip
但是却总是报错:
ERROR: `elasticsearch` directory is missing in the plugin zip
后来只能用了这种方法:在Elasticsearch安装目录下的文件夹plugins中新建文件夹ik,将elasticsearch-analysis-ik-5.4.2.zip
解压到这里即可,其实官网里已经说明了低于5.5.1版本的用解压的方式安装了:
参考:Elasticsearch5.x安装IK分词器以及使用
Linux安装7.14.1版本:
下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases
[hadoop@node01 ~]$ elasticsearch-7.14.1/bin/elasticsearch-plugin install file:///mnt/elasticsearch-analysis-ik-7.14.1.zip
ik_smart
:会做最粗粒度的拆分
ik_max_word
:会将文本做最细粒度的拆分。
测试1:ik_smart
GET /_analyze
{
"analyzer": "ik_smart",
"text":"中华人民共和国"
}
结果:
{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
}
]
}
测试2:ik_max_word
GET /_analyze
{
"analyzer": "ik_max_word",
"text":"中华人民共和国"
}
结果:
{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
{
"token": "中华人民",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "中华",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 2
},
{
"token": "华人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "人民共和国",
"start_offset": 2,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 5
},
{
"token": "共和国",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 6
},
{
"token": "共和",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 7
},
{
"token": "国",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 8
}
]
}
GET /_analyze
{
"analyzer": "ik_max_word",
"text":"I love you"
}
结果:
{
"tokens" : [
{
"token" : "i",
"start_offset" : 0,
"end_offset" : 1,
"type" : "ENGLISH",
"position" : 0
},
{
"token" : "love",
"start_offset" : 2,
"end_offset" : 6,
"type" : "ENGLISH",
"position" : 1
},
{
"token" : "you",
"start_offset" : 7,
"end_offset" : 10,
"type" : "ENGLISH",
"position" : 2
}
]
}
参考:https://blog.csdn.net/wenxindiaolong061/article/details/82562450
二、es内置的分词器:
- standard analyzer
- simple analyzer
- whitespace analyzer
- language analyzer(特定的语言的分词器)
例句:Set the shape to semi-transparent by calling set_trans(5)
不同分词器的分词结果:
- standard analyzer:set, the, shape, to, semi, transparent, by, calling, set_trans, 5(默认的是standard)
- simple analyzer:set, the, shape, to, semi, transparent, by, calling, set, trans
- whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
- language analyzer(特定的语言的分词器,比如说,english,英语分词器):set, shape, semi, transpar, call, set_tran, 5
分词器测试:
GET /_analyze
{
"analyzer": "standard",
"text":"The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone."
}
结果:
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "brown",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "foxes",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "jumped",
"start_offset" : 24,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 31,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "the",
"start_offset" : 36,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "lazy",
"start_offset" : 40,
"end_offset" : 44,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "dog’s",
"start_offset" : 45,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "bone",
"start_offset" : 51,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
可以看出是按照空格、非字母的方式对输入的文本进行了转换,比如对 Java 做了转小写,对一些停用词也没有去掉,比如 in,其中 token 为分词结果;start_offset
为起始偏移;end_offset
为结束偏移;position
为分词位置。可配置项
选项 | 描述 |
---|---|
max_token_length | 最大令牌长度。如果看到令牌超过此长度,则将其max_token_length间隔分割。默认为255。 |
stopwords | 预定义的停用词列表,例如english或包含停用词列表的数组。默认为none。 |
stopwords_path | 包含停用词的文件的路径。 |
COPY{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
不同的 Analyzer 会有不同的分词结果,内置的分词器有以下几种,基本上内置的 Analyzer 包括 Language Analyzers 在内,对中文的分词都不够友好,中文分词需要安装其它 Analyzer
分析器 | 描述 | 分词对象 | 结果 |
---|---|---|---|
standard | 标准分析器是默认的分析器,如果没有指定,则使用该分析器。它提供了基于文法的标记化(基于 Unicode 文本分割算法,如 Unicode 标准附件 # 29所规定) ,并且对大多数语言都有效。 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog’s, bone ] |
simple | 简单分析器将文本分解为任何非字母字符的标记,如数字、空格、连字符和撇号、放弃非字母字符,并将大写字母更改为小写字母。 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ] |
whitespace | 空格分析器在遇到空白字符时将文本分解为术语 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone. ] |
stop | 停止分析器与简单分析器相同,但增加了删除停止字的支持。默认使用的是 english 停止词。 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [ quick, brown, foxes, jumped, over, lazy, dog, s, bone ] |
keyword | 不分词,把整个字段当做一个整体返回 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.] |
pattern | 模式分析器使用正则表达式将文本拆分为术语。正则表达式应该匹配令牌分隔符,而不是令牌本身。正则表达式默认为 w+ (或所有非单词字符)。 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ] |
多种西语系 arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english等等 | 一组旨在分析特定语言文本的分析程序。 |
中文分词器最简单的是ik分词器,还有jieba分词,哈工大分词器等
分析器 | 描述 | 分词对象 | 结果 |
---|---|---|---|
ik_smart | ik分词器中的简单分词器,支持自定义字典,远程字典 | 学如逆水行舟,不进则退 | [学如逆水行舟,不进则退] |
ik_max_word | ik_分词器的全量分词器,支持自定义字典,远程字典 | 学如逆水行舟,不进则退 | [学如逆水行舟,学如逆水,逆水行舟,逆水,行舟,不进则退,不进,则,退] |
参考:
【9种】ElasticSearch分词器详解,一文get!!!| 博学谷狂野架构师
[转]中英文停止词表(stopword)
三、拼音插件安装以及(IK+pinyin使用)
有时在淘宝搜索商品的时候,会发现使用汉字、拼音、或者拼音混合汉字都会出来想要的搜索结果,其实是通过拼音搜索插件实现的。
地址:https://github.com/infinilabs/analysis-pinyin/releases
选择对应的版本,版本与 ES 版本一致,建议直接下载编译后的 zip 包;若是下载源码包,则需要自己编码打包 mvn clean package
生成 zip 包。联网安装:elasticsearch-7.14.1/bin/elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-pinyin/7.14.1
使用自定义拼音分词器创建索引:
PUT /medcl/
{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true
}
}
}
}
}
测试:
POST medcl/_analyze
{
"analyzer": "pinyin_analyzer",
"text": "刘德华"
}
结果:
{
"tokens" : [
{
"token" : "liu",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "刘德华",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "ldh",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "de",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
},
{
"token" : "hua",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2
}
]
}
配置 IK + pinyin 分词配置
settings 设置:
curl -XPUT "http://localhost:9200/medcl/" -d'
{
"index": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "ik_max_word"
},
"pinyin_analyzer": {
"tokenizer": "shopmall_pinyin"
}
},
"tokenizer": {
"shopmall_pinyin": {
"keep_joined_full_pinyin": "true",
"keep_first_letter": "true",
"keep_separate_first_letter": "false",
"lowercase": "true",
"type": "pinyin",
"limit_first_letter_length": "16",
"keep_original": "true",
"keep_full_pinyin": "true"
}
}
}
}
}'
创建 mapping:
curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
"folks": {
"properties": {
"name": {
"type": "text",
"analyzer": "ik_max_word",
"include_in_all": true,
"fields": {
"pinyin": {
"type": "text",
"analyzer": "pinyin_analyzer"
}
}
}
}
}
}'
添加测试文档:
curl -XPOST http://localhost:9200/medcl/folks/ -d'{"name":"刘德华"}'
curl -XPOST http://localhost:9200/medcl/folks/ -d'{"name":"中华人民共和国国歌"}'
拼音分词效果:
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu"
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:de"
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:hua"
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh"
中文分词测试:
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name:刘"
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name:刘德"
注意:用户输入搜索内容,根据正则匹配分成中文、拼音、中文+拼音、中文+拼音+数字+特殊符号等情况进行搜索,如下:
- 若是汉字搜索,没有搜索结果,转化为拼音再搜索一次,按拼音搜索还是无结果,则按模糊搜索再搜一次,若是还无结果,可考虑推荐
- 若是拼音搜索,没有搜索结果,则按模糊搜索再搜一次
- 若是汉字+拼音搜索,暂且按拼音处理
- 拼音、数字、特殊字符,暂且按拼音处理