Elastaticsearch与SpringBoot集成的互联网的实战分享

发布于:2024-06-28 ⋅ 阅读:(19) ⋅ 点赞:(0)

以前有过类似的文章,今天升级版分享重磅内容,Elastaticsearch与SpringBoot集成的互联网的实战。

一、需求分析:

起因是这样的,产品有这样一个需求:数据中的标题、内容、关键词等实现结构化搜索和半结构化搜索、数据时空检索、查询理解、意图识别、全拼音查询、拼音首字母查询、热点查询记录及推荐前N位关键字。如图显示:

1、拼音模糊查询:

或者

先根据拼音或者首字母模糊匹配查询前10名的关键字推荐查询,在选择关键字去查询想要的数据。

2、 查询理解

查询理解(Query Understanding)功能是指通过自然语言处理技术理解用户输入的查询意图,并将其转化为有效的搜索请求。在 Elasticsearch 中,我们可以通过结合自然语言处理(NLP)库和 Elasticsearch 的搜索能力来实现查询理解搜索功能。

3、意图识别

实现意图识别搜索功能通常需要结合自然语言处理(NLP)技术和 Elasticsearch 的搜索能力。在本示例中,我们将使用开源的自然语言处理工具 OpenNLP 来识别用户查询的意图,并基于识别结果执行相应的 Elasticsearch 搜索请求。此需求点需要进行模型训练。

4、数据时空检索

输入时间和经纬度信息查询数据:在处理和检索半结构化时空数据时,Elasticsearch 提供了地理空间(geo-spatial)和时间(temporal)相关的数据类型和查询能力。以下是通过 Java 语言实现半结构化时空数据检索的示例代码。

5、其他查询记录

还有简单的模糊查询、记录查询关键字等信息

二、技术选型

根据需求我们进行进行选型,首先ES版本需要支持NLP自然语言处理,有相应版本的ES分词插件、中文分词插件、拼音分词插件;结合JDK版本。刚开始准备用7.17.22版本,后来发现没有7.17.22版本的分词插件,最新的免费使用的是7.17.18版本,于是确定用此版本。

1、下载安装ES和Kibana

参考往期文章

把版本换成7.17.18就行。

2、下载安装中文分词插件

一定要下载相应版本的插件,否则启动报错,下载地址:参考

在elasticsearch安装目录的plugins文件夹下创建文件夹ik,复制压缩包内的文件到ik文件夹。

安装后重启es生效。 

3、下载安装拼音分词插件

下载地址:参考

或者所有版本

将解压后的内容手动复制到Elasticsearch的plugins目录elasticsearch-analysis-pinyin中

如图:

4、下载安装多语言分词插件

analysis-icu-7.17.18 版本

指定下载

将解压后的插件目录移动到Elasticsearch的plugins目录analysis-icu中

5、查询安装是否成功

1)ES服务器查询

es的bin目录下cmd,输入elasticsearch-plugin list可查看安装的插件

注意:必须将 Elasticsearch 安装目录的 bin 文件夹添加到系统的 PATH 环境变量中 

2)Kibana客户端查询

GET /_cat/plugins?v

3)浏览器

http://localhost:9200/_cat/plugins?v

4)服务器上通过API

curl -X GET "localhost:9200/_cat/plugins?v"

三、SpringBoot 集成ES

1、基础配置

参考往期文章

仅仅多了NLP包:

        <!-- Apache OpenNLP -->
        <dependency>
            <groupId>org.apache.opennlp</groupId>
            <artifactId>opennlp-tools</artifactId>
            <version>1.9.3</version>
        </dependency>

2、理解查询需求

 public boolean queryComprehendData(String userQuery ) throws IOException {
        // 用户输入的查询
        userQuery = "Find latest iPhone and Samsung phones";

        // 使用 OpenNLP 进行简单的查询理解
        SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
        Span[] tokenSpans = tokenizer.tokenizePos(userQuery);
        String[] tokens = Span.spansToStrings(tokenSpans, userQuery);

        // 打印分词结果
        for (String token : tokens) {
            System.out.println(token);
        }

        // 构建 Elasticsearch 查询
        BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
        for (String token : tokens) {
            MatchQueryBuilder matchQuery = QueryBuilders.matchQuery("description", token);
            boolQuery.should(matchQuery);
        }

        // 创建搜索请求
        SearchRequest searchRequest = new SearchRequest("products");
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.query(boolQuery);
        searchRequest.source(searchSourceBuilder);

        // 执行搜索
        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
        System.out.println(searchResponse);

        return true;
    }

这里我们是结合自然语言处理(NLP)库和 Elasticsearch 的搜索能力来实现查询理解搜索功能。 

1)kibana创建索引和添加数据

PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "description": {
        "type": "text"
      },
      "price": {
        "type": "double"
      },
      "category": {
        "type": "keyword"
      }
    }
  }
}

POST /products/_doc/1
{
  "name": "Apple iPhone 13",
  "description": "Latest model of Apple iPhone with A15 Bionic chip.",
  "price": 999.99,
  "category": "Electronics"
}

POST /products/_doc/2
{
  "name": "Samsung Galaxy S21",
  "description": "Latest model of Samsung Galaxy with Exynos 2100 chip.",
  "price": 799.99,
  "category": "Electronics"
}

POST /products/_doc/3
{
  "name": "Sony WH-1000XM4",
  "description": "Noise cancelling wireless headphones from Sony.",
  "price": 349.99,
  "category": "Electronics"
}

3、意图识别需求

 // 简单的意图识别方法(示例)
    private static String recognizeIntent(String[] tokens) {
        for (String token : tokens) {
            if (token.equalsIgnoreCase("find")) {
                return "findProducts";
            } else if (token.equalsIgnoreCase("under") || token.equalsIgnoreCase("below")) {
                return "findCheaperProducts";
            }
        }
        return "unknownIntent";
    }

    public boolean queryIntentRecognitionData(String indexName) throws IOException {
        // 用户输入的查询
        String userQuery = "Find latest smartphones under 1000 dollars";

        // 使用 OpenNLP 进行句子检测和分词,训练的插件
        InputStream modelIn = new FileInputStream("en-sent.bin");
        SentenceModel model = new SentenceModel(modelIn);
        SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
        String[] sentences = sentenceDetector.sentDetect(userQuery);

        // 获取第一个句子进行意图识别
        String querySentence = sentences[0];
        SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
        Span[] tokenSpans = tokenizer.tokenizePos(querySentence);
        String[] tokens = Span.spansToStrings(tokenSpans, querySentence);

        // 打印分词结果
        System.out.println("Tokens:");
        for (String token : tokens) {
            System.out.println(token);
        }

        // 识别意图
        String intent = recognizeIntent(tokens);

        // 构建 Elasticsearch 查询
        SearchRequest searchRequest = new SearchRequest("products");
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();

        // 根据意图构建查询
        switch (intent) {
            case "findProducts":
                // 查询包含相关关键词的产品
                BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
                for (String token : tokens) {
                    MatchQueryBuilder matchQuery = QueryBuilders.matchQuery("description", token);
                    boolQuery.should(matchQuery);
                }
                searchSourceBuilder.query(boolQuery);
                break;
            case "findCheaperProducts":
                // 查询价格低于指定值的产品
                searchSourceBuilder.query(QueryBuilders.rangeQuery("price").lte(1000));
                break;
            default:
                System.out.println("Intent not recognized.");
                break;
        }

        // 设置搜索请求源并执行搜索
        searchRequest.source(searchSourceBuilder);
        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
        System.out.println("Search response:");
        System.out.println(searchResponse);
        return false;
    }

4、记录关键词查询的次数

public boolean saveKeywordAndPinYinQuery(String keyword) throws IOException {
        /**
         * 记录关键词查询次数
         */
    //    String INDEX_NAME = "my_pinyin_index";
        String INDEX_NAME = "keyword_stats_3";

        // 设置脚本参数
        Map<String, Object> params = new HashMap<>();
        params.put("increment", 1);

        // 创建脚本
        String scriptSource = "if (ctx._source.query_count == null) { ctx._source.query_count = params.increment } else { ctx._source.query_count += params.increment }";
        Script script = new Script(ScriptType.INLINE, "painless", scriptSource, params);

        // 更新请求
        UpdateRequest updateRequest = new UpdateRequest(INDEX_NAME, keyword)
                .script(script)
                .upsert(Collections.singletonMap("query_count", 1));

        // 执行更新
        restHighLevelClient.update(updateRequest, RequestOptions.DEFAULT);

        /**
         *查询某个关键词记录的次数
         */
        // 创建获取请求
        GetRequest getRequest = new GetRequest(INDEX_NAME, keyword);

        // 执行获取请求
        GetResponse getResponse = restHighLevelClient.get(getRequest, RequestOptions.DEFAULT);

        // 检查文档是否存在并提取查询次数
        if (getResponse.isExists()) {
            Integer queryCount = (Integer) getResponse.getSource().get("query_count");
            Integer queryCount1 = (Integer) getResponse.getSourceAsMap().get("query_count");
            System.out.println(queryCount+":"+queryCount1);
          //  return queryCount != null ? queryCount : 0;
        } else {
          //  return 0; // 如果文档不存在,则返回0
        }


       

        return false;
    }

优化新增

       // 要更新的字段和值
        Map<String, Object> updatedFields = new HashMap<>();
        updatedFields.put("keyword", keyword);
        updatedFields.put("query_count", 1);

        // 更新请求
        UpdateRequest updateRequest = new UpdateRequest(INDEX_NAME, keyword)
                .script(script)
                .upsert(updatedFields);

 keyword 要加上,后续查询会用到。

1)设置索引和映射

PUT /keyword_stats
{
  "mappings": {
    "properties": {
      "keyword": {
        "type": "keyword"
      },
      "query_count": {
        "type": "integer"
      }
    }
  }
}

此索引不支持模糊查询因为"type": "keyword" 原因。

2)

3) 

5、根据条件查询前几名的热点关键词

public boolean saveKeywordAndPinYinQuery(String keyword) throws IOException {
      

        /**
         *查询排名前10位的热点关键字
         */
        // 创建搜索请求
        SearchRequest searchRequest = new SearchRequest(INDEX_NAME);

        // 构建搜索源
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.size(10); // 限制返回文档数
      //  searchSourceBuilder.query(QueryBuilders.matchQuery("content.pinyin","bei"));
       // searchSourceBuilder.query(QueryBuilders.matchQuery("keyword","北"));
        searchSourceBuilder.sort(SortBuilders.fieldSort("query_count").order(SortOrder.DESC));
      //  searchSourceBuilder.query(QueryBuilders.matchQuery("keyword", "北京"));//模糊搜索匹配查询ok了、还需要拼音查询,思路一样
        searchSourceBuilder.query(QueryBuilders.matchQuery("keyword", "beijing"));
       // searchSourceBuilder.query(QueryBuilders.matchQuery("keyword", queryKeyword));

        searchRequest.source(searchSourceBuilder);

        // 执行搜索请求
        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        // 处理搜索结果
        System.out.println("Top " + 10 + " hot keywords:");
        for (SearchHit hit : searchResponse.getHits().getHits()) {
            String keyword2 = hit.getId();
            String keyword1 = (String) hit.getSourceAsMap().get("keyword");
            Integer queryCount1 = (Integer) hit.getSourceAsMap().get("query_count");
           // Integer queryCount = (Integer) getResponse.getSource().get("query_count");
            System.out.println("Keyword: " + keyword1 +":queryCount1:"+queryCount1);
        }

        return false;
    }

1)用Kibana查询


POST /keyword_stats_3/_search
{
  "size": 0,
  "query": {
    "match": {
      "keyword": "bei"  
    }
  },
  "aggs": {
    "top_keywords": {
      "terms": {
        "field": "keyword.raw",  
        "order": {
          "sum_query_count": "desc"
        },
        "size": 10
      },
      "aggs": {
        "sum_query_count": {
          "sum": {
            "field": "query_count"
          }
        }
      }
    }
  }
}

2)kibana创建测试数据


POST /keyword_stats_3/_doc/1
{
  "keyword": "北京",
  "query_count": 10
}

POST /keyword_stats_3/_doc/2
{
  "keyword": "北京包邮",
  "query_count": 8
}

POST /keyword_stats_3/_doc/3
{
  "keyword": "美丽北京",
  "query_count": 5
}

POST /keyword_stats_3/_doc/4
{
  "keyword": "昌平打印",
  "query_count": 12
}

POST /keyword_stats_3/_doc/5
{
  "keyword": "big machine learning_big",
  "query_count": 7
}

POST /keyword_stats_3/_doc/6
{
  "keyword": "杯子",
  "query_count": 7
}

3) kibana创建索引和配置


PUT /keyword_stats_3
{
  "settings": {
    "analysis": {
      "analyzer": {
        "pinyin_analyzer": {
          "tokenizer": "my_pinyin",
          "filter": [
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "my_pinyin": {
          "type": "pinyin",
          "first_letter": "none",
          "padding_char": " "
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "keyword": {
        "type": "text",
        "analyzer": "pinyin_analyzer",
        "search_analyzer": "pinyin_analyzer",
		    "fields": {
          "raw": {
            "type": "keyword"
          }
        }
      },
      "query_count": {
        "type": "integer"
      }
    }
  }
}

keyword 字段被定义为 text 类型,并且具有一个名为 rawkeyword 类型子字段。

4)不支持拼音的索引和配置

PUT /keyword_stats
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "standard"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "keyword": {
        "type": "text",
        "fields": {
          "raw": {
            "type": "keyword"
          }
        }
      },
      "query_count": {
        "type": "integer"
      }
    }
  }
}

或者

PUT /keyword_stats
{
  "mappings": {
    "properties": {
      "keyword": {
        "type": "text",
        "fields": {
          "raw": {
            "type": "keyword"
          }
        }
      },
      "query_count": {
        "type": "integer"
      }
    }
  }
}

 kibana增加查询次数

POST /keyword_stats/_update_by_query
{
  "script": {
    "source": "ctx._source.query_count += params.count",
    "lang": "painless",
    "params": {
      "count": 1
    }
  },
  "query": {
    "term": {
      "keyword.raw": "java"
    }
  }
}

 对应的查询:

POST /keyword_stats/_search
{
  "size": 0,
  "query": {
    "match": {
      "keyword": "your_search_keyword"
    }
  },
  "aggs": {
    "top_keywords": {
      "terms": {
        "field": "keyword.raw",  // 使用 keyword 字段的 keyword 类型子字段
        "order": {
          "sum_query_count": "desc"
        },
        "size": 10
      },
      "aggs": {
        "sum_query_count": {
          "sum": {
            "field": "query_count"
          }
        }
      }
    }
  }
}

6、空间数据查询

 public boolean queryGeoDoc(String indexName) throws IOException {
        // 创建搜索请求
        SearchRequest searchRequest = new SearchRequest("spatial_data");

        // 构建查询
        BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();

        // 时间范围查询
        RangeQueryBuilder rangeQuery = QueryBuilders.rangeQuery("timestamp")
                .from("2023-03-15T00:00:00Z")
                .to("2023-03-16T23:59:59Z");
        boolQuery.must(rangeQuery);

        // 地理空间查询
        GeoBoundingBoxQueryBuilder geoQuery = QueryBuilders.geoBoundingBoxQuery("location")
                .setCorners(40.9176, -73.7004, 40.4774, -74.2591); // 纽约市边界
        boolQuery.must(geoQuery);

        // 构建搜索请求源
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        sourceBuilder.query(boolQuery);
        searchRequest.source(sourceBuilder);

        // 执行搜索
        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
        System.out.println(searchResponse);

        return false;
    }

在 Elasticsearch 中,使用 geo_point 类型来表示地理位置,使用 date 类型来表示时间。

时间范围查询:使用 RangeQueryBuilder 构建时间范围查询,查询指定时间段内的数据。
地理空间查询:使用 GeoBoundingBoxQueryBuilder 构建地理空间查询,查询位于指定地理边界内的数据。
组合查询:使用 BoolQueryBuilder 将时间范围查询和地理空间查询组合在一起。

1)kibana创建索引

PUT /spatial_data
{
  "mappings": {
    "properties": {
      "location": {
        "type": "geo_point"
      },
      "timestamp": {
        "type": "date"
      },
      "data": {
        "type": "text"
      }
    }
  }
}

2)java代码插入时空数据

 IndexRequest request = new IndexRequest("spatial_data");
 String jsonString = "{" +
                    "\"location\": {\"lat\": 40.7128, \"lon\": -74.0060}," +
                    "\"timestamp\": \"2023-03-16T12:00:00Z\"," +
                    "\"data\": \"Example data\"" +
                    "}";
 request.source(jsonString, XContentType.JSON);
 client.index(request, RequestOptions.DEFAULT);

 7、拼音查询

    public boolean queryPYDoc(String indexName) throws IOException {
        // 查询文档数据
        SearchRequest searchRequest = new SearchRequest("pinyin_index");
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.query(QueryBuilders.matchQuery("name", indexName));  // 查询拼音首字母
        searchRequest.source(searchSourceBuilder);

        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
        searchResponse.getHits().forEach(hit -> System.out.println(hit.getSourceAsString()));

        return false;
    }

要在 Elasticsearch 中实现拼音查询功能,需要借助于 Elasticsearch 的分词器和分析器来处理中文文本,并使用支持拼音转换的插件或库。在这个示例中,我们将使用 Elasticsearch 的中文分析器和 ICU Analyzer 插件来实现拼音查询功能。

1)java语言创建索引

import org.apache.http.HttpHost;
import org.elasticsearch.client.Request;
import org.elasticsearch.client.Response;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;

public class PinyinIndexConfig {
    public static void main(String[] args) {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(new HttpHost("localhost", 9200, "http")));

        try {
            // 创建索引并配置映射
            Request request = new Request("PUT", "/pinyin_index");
            String jsonString = "{\n" +
                    "  \"settings\": {\n" +
                    "    \"analysis\": {\n" +
                    "      \"analyzer\": {\n" +
                    "        \"pinyin_analyzer\": {\n" +
                    "          \"type\": \"custom\",\n" +
                    "          \"tokenizer\": \"my_pinyin\",\n" +
                    "          \"filter\": [\n" +
                    "            \"word_delimiter\"\n" +
                    "          ]\n" +
                    "        }\n" +
                    "      },\n" +
                    "      \"tokenizer\": {\n" +
                    "        \"my_pinyin\": {\n" +
                    "          \"type\": \"pinyin\",\n" +
                    "          \"keep_first_letter\": true,\n" +
                    "          \"keep_separate_first_letter\": false,\n" +
                    "          \"keep_full_pinyin\": true,\n" +
                    "          \"keep_original\": false,\n" +
                    "          \"limit_first_letter_length\": 16,\n" +
                    "          \"lowercase\": true,\n" +
                    "          \"trim_whitespace\": true\n" +
                    "        }\n" +
                    "      }\n" +
                    "    }\n" +
                    "  },\n" +
                    "  \"mappings\": {\n" +
                    "    \"properties\": {\n" +
                    "      \"name\": {\n" +
                    "        \"type\": \"text\",\n" +
                    "        \"analyzer\": \"pinyin_analyzer\",\n" +
                    "        \"search_analyzer\": \"pinyin_analyzer\"\n" +
                    "      }\n" +
                    "    }\n" +
                    "  }\n" +
                    "}";
            request.setJsonEntity(jsonString);

            Response response = client.getLowLevelClient().performRequest(request);
            System.out.println(response.getStatusLine().getStatusCode());

        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                client.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

Kibana创建索引

PUT /my_pinyin_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "pinyin_analyzer": {
          "tokenizer": "my_pinyin",
          "filter": [
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "my_pinyin": {
          "type": "pinyin",
          "first_letter": "none",
          "padding_char": " "
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "pinyin_analyzer",
        "search_analyzer": "pinyin_analyzer"
      }
    }
  }
}

2)java语言添加数据

 
            IndexRequest request = new IndexRequest("pinyin_index");
            request.id("1");
            String jsonString = "{ \"name\": \"张三\" }";
            request.source(jsonString, XContentType.JSON);

            IndexResponse indexResponse = client.index(request, RequestOptions.DEFAULT);
            System.out.println(indexResponse.getResult().name());

Kibana 添加数据

POST /my_pinyin_index/_doc/1
{
  "name": "张三"
}

POST /my_pinyin_index/_doc/2
{
  "name": "李四"
}

POST /my_pinyin_index/_doc/3
{
  "name": "王五"
}

 Kibana 查询数据

GET /my_pinyin_index/_search
{
  "query": {
    "match": {
      "name": "zhangsan"
    }
  }
}

9、关键字进行匹配查询

            String userInput = "北京 天安门 广场";

            // 构建搜索请求
            SearchRequest searchRequest = new SearchRequest("articles");
            SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();

            // 使用标准分析器进行查询构建
            searchSourceBuilder.query(QueryBuilders.matchQuery("content", userInput));

            // 设置搜索请求源并执行搜索
            searchRequest.source(searchSourceBuilder);
            SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);

            // 处理搜索结果
            System.out.println("Search results:");
            System.out.println(searchResponse);

String userInput = "北京 天安门 广场"; 定义了用户输入的查询语句,包含了关键字 "北京"、"天安门" 和 "广场"。
searchSourceBuilder.query(QueryBuilders.matchQuery("content", userInput)); 使用 matchQuery 查询构建器,在 content 字段上进行关键字匹配查询。 

到此,es实战应用场景分析完毕,后期会有更加精彩的技术点,敬请期待!