ElasticSearch 分词器-EW帮帮网

文章目录

一、安装中文分词插件

IK Analysis for Elasticsearch是开源社区比较流行的中文分词插件
官网：https://github.com/medcl/elasticsearch-analysis-ik

本来想用这两种方法安装：

.\bin\elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.4.2/elasticsearch-analysis-ik-5.4.2.zip
bin\elasticsearch-plugin install file:///C:\Users\jk\Desktop\elasticsearch-analysis-ik-5.4.2.zip

但是却总是报错：

ERROR: `elasticsearch` directory is missing in the plugin zip

后来只能用了这种方法：在Elasticsearch安装目录下的文件夹plugins中新建文件夹ik，将elasticsearch-analysis-ik-5.4.2.zip解压到这里即可，其实官网里已经说明了低于5.5.1版本的用解压的方式安装了：
在这里插入图片描述
参考：Elasticsearch5.x安装IK分词器以及使用

Linux安装7.14.1版本：

下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases

[hadoop@node01 ~]$ elasticsearch-7.14.1/bin/elasticsearch-plugin install file:///mnt/elasticsearch-analysis-ik-7.14.1.zip

在这里插入图片描述
ik_smart：会做最粗粒度的拆分
ik_max_word：会将文本做最细粒度的拆分。

测试1：ik_smart

GET /_analyze
{
  "analyzer": "ik_smart",
  "text":"中华人民共和国"
}

结果：

{
  "tokens": [
    {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}

测试2：ik_max_word

GET /_analyze
{
  "analyzer": "ik_max_word",
  "text":"中华人民共和国"
}

结果：

{
  "tokens": [
    {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "中华人民",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "中华",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "华人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民共和国",
      "start_offset": 2,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "共和国",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "共和",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "国",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 8
    }
  ]
}

GET /_analyze
{
  "analyzer": "ik_max_word",
  "text":"I love you"
}
结果：
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "love",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "you",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 2
    }
  ]
}

参考：https://blog.csdn.net/wenxindiaolong061/article/details/82562450

二、es内置的分词器：

standard analyzer
simple analyzer
whitespace analyzer
language analyzer(特定的语言的分词器)

例句：Set the shape to semi-transparent by calling set_trans(5)
不同分词器的分词结果：

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）
simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5

分词器测试：

GET /_analyze
{
  "analyzer": "standard",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone."
}

结果：

  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "dog’s",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "bone",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

可以看出是按照空格、非字母的方式对输入的文本进行了转换，比如对 Java 做了转小写，对一些停用词也没有去掉，比如 in，其中 token 为分词结果；start_offset 为起始偏移；end_offset 为结束偏移；position 为分词位置。可配置项

选项	描述
max_token_length	最大令牌长度。如果看到令牌超过此长度，则将其max_token_length间隔分割。默认为255。
stopwords	预定义的停用词列表，例如english或包含停用词列表的数组。默认为none。
stopwords_path	包含停用词的文件的路径。

COPY{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_english_analyzer": {
                    "type": "standard",
                    "max_token_length": 5,
                    "stopwords": "_english_"
                }
            }
        }
    }
}

不同的 Analyzer 会有不同的分词结果，内置的分词器有以下几种，基本上内置的 Analyzer 包括 Language Analyzers 在内，对中文的分词都不够友好，中文分词需要安装其它 Analyzer

分析器	描述	分词对象	结果
standard	标准分析器是默认的分析器，如果没有指定，则使用该分析器。它提供了基于文法的标记化(基于 Unicode 文本分割算法，如 Unicode 标准附件 # 29所规定) ，并且对大多数语言都有效。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog’s, bone ]
simple	简单分析器将文本分解为任何非字母字符的标记，如数字、空格、连字符和撇号、放弃非字母字符，并将大写字母更改为小写字母。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
whitespace	空格分析器在遇到空白字符时将文本分解为术语	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone. ]
stop	停止分析器与简单分析器相同，但增加了删除停止字的支持。默认使用的是 english 停止词。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
keyword	不分词，把整个字段当做一个整体返回	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.]
pattern	模式分析器使用正则表达式将文本拆分为术语。正则表达式应该匹配令牌分隔符，而不是令牌本身。正则表达式默认为 w+ (或所有非单词字符)。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
多种西语系 arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english等等	一组旨在分析特定语言文本的分析程序。

中文分词器最简单的是ik分词器，还有jieba分词，哈工大分词器等

分析器	描述	分词对象	结果
ik_smart	ik分词器中的简单分词器，支持自定义字典，远程字典	学如逆水行舟，不进则退	[学如逆水行舟,不进则退]
ik_max_word	ik_分词器的全量分词器，支持自定义字典，远程字典	学如逆水行舟，不进则退	[学如逆水行舟,学如逆水,逆水行舟,逆水,行舟,不进则退,不进,则,退]

参考：
【9种】ElasticSearch分词器详解，一文get！！！| 博学谷狂野架构师
 [转]中英文停止词表（stopword）

三、拼音插件安装以及（IK+pinyin使用）

有时在淘宝搜索商品的时候，会发现使用汉字、拼音、或者拼音混合汉字都会出来想要的搜索结果，其实是通过拼音搜索插件实现的。

地址：https://github.com/infinilabs/analysis-pinyin/releases

选择对应的版本，版本与 ES 版本一致，建议直接下载编译后的 zip 包；若是下载源码包，则需要自己编码打包 mvn clean package 生成 zip 包。联网安装：elasticsearch-7.14.1/bin/elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-pinyin/7.14.1

在这里插入图片描述
使用自定义拼音分词器创建索引：

PUT /medcl/ 
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true
                }
            }
        }
    }
}

测试：

POST medcl/_analyze
{
  "analyzer": "pinyin_analyzer",
  "text": "刘德华"
}

结果：

{
  "tokens" : [
    {
      "token" : "liu",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "刘德华",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "de",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "hua",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    }
  ]
}

配置 IK + pinyin 分词配置

settings 设置：

curl -XPUT "http://localhost:9200/medcl/" -d'
{
	"index": {
		"analysis": {
			"analyzer": {
				"default": {
					"tokenizer": "ik_max_word"
				},
				"pinyin_analyzer": {
					"tokenizer": "shopmall_pinyin"
				}
			},
			"tokenizer": {
				"shopmall_pinyin": {
					"keep_joined_full_pinyin": "true",
					"keep_first_letter": "true",
					"keep_separate_first_letter": "false",
					"lowercase": "true",
					"type": "pinyin",
					"limit_first_letter_length": "16",
					"keep_original": "true",
					"keep_full_pinyin": "true"
				}
			}
		}
	}
}'

创建 mapping：

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
	"folks": {
		"properties": {
			"name": {
				"type": "text",
				"analyzer": "ik_max_word",
				"include_in_all": true,
				"fields": {
					"pinyin": {
						"type": "text",
						"analyzer": "pinyin_analyzer"
					}
				}
			}
		}
	}
}'

添加测试文档：

curl -XPOST http://localhost:9200/medcl/folks/ -d'{"name":"刘德华"}'

curl -XPOST http://localhost:9200/medcl/folks/ -d'{"name":"中华人民共和国国歌"}'

拼音分词效果：

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu"
 
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:de"
 
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:hua"
 
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh"

中文分词测试：

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name:刘"
 
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name:刘德"

注意：用户输入搜索内容，根据正则匹配分成中文、拼音、中文+拼音、中文+拼音+数字+特殊符号等情况进行搜索，如下：

若是汉字搜索，没有搜索结果，转化为拼音再搜索一次，按拼音搜索还是无结果，则按模糊搜索再搜一次，若是还无结果，可考虑推荐
若是拼音搜索，没有搜索结果，则按模糊搜索再搜一次
若是汉字+拼音搜索，暂且按拼音处理
拼音、数字、特殊字符，暂且按拼音处理

参考：elasticsearch拼音插件安装以及（IK+pinyin使用）

ElasticSearch 分词器

文章目录

一、安装中文分词插件

Linux安装7.14.1版本：

测试1：ik_smart

测试2：ik_max_word

二、es内置的分词器：

三、拼音插件安装以及（IK+pinyin使用）

配置 IK + pinyin 分词配置

网站公告

今日签到

热门文章

最新发布