作者:来自 Elastic Peter Steenbergen
在 Elasticsearch 中去重产品变体?以下是确定正确分页的方法。
想获得 Elastic 认证?了解下一期 Elasticsearch Engineer 培训的时间吧!
Elasticsearch 拥有丰富的新功能,帮助你为你的使用场景构建最佳搜索解决方案。浏览我们的示例笔记本了解更多内容,开始免费 cloud 试用,或立即在本地机器上体验 Elastic。
在任何大型目录中,从电商商品到文章列表,重复项都是不可避免的。虽然 Elasticsearch 的 collapse 功能非常适合对这些变体进行分组并展示整洁的 UI,但它在分页方面带来了一个关键挑战:total hit count 反映的是原始文档数量,而不是最终的唯一分组数量,这使得构建可靠的分页器变得不可能。本文将详细介绍如何通过将 collapse 与 cardinality 聚合结合,获取真实结果总数,从而解决这一问题的最终方案。
使用 collapse 和 cardinality 实现分页
在 Elasticsearch 中,创建高效搜索体验时一个常见挑战(特别是在电商场景)是去重与分页的结合,即将一大批数据分成较小、可管理的页面。例如,一个产品目录可能包含同一产品设计的多个版本(尺寸、颜色、SKU/Stock Keeping Unit)。
使用 collapse 功能,我们可以按单个字段对命中结果进行分组,每个分组仅显示一个代表结果,从而避免 UI 被几乎重复的项填满。下面我们通过一个实际示例来演示这个过程。
有关 collapse 的更多描述,可以参考 “Elasticsearch: 运用 Field collapsing 来减少基于单个字段的搜索结果”。
示例:使用 collapse 组合 SKU
以下是一些可以通过 Kibana DevTools 导入的示例 JSON 文档。这些示例用于说明文中描述的场景 —— 你拥有产品的多个变体(尺寸和颜色),希望将它们进行分组。
JSON 结构说明:
product_id
:核心产品的唯一标识符。你将使用这个字段进行 collapse 和 cardinality 聚合。sku
:特定变体的唯一库存单位( Stock Keeping Unit - SKU )。name
:产品的通用名称。color
:产品变体的颜色。size
:产品变体的尺寸。price
:该变体的价格。timestamp
:时间戳,用于排序和选择最新版本(如有需要)。
要导入以下数据,可以将这些行复制粘贴到 Kibana DevTools 中,然后点击右侧灰色框中的播放按钮执行即可。
POST your-products-index/_bulk
{ "index" : { "_id" : "sku-1001" } }
{ "product_id": "P001", "sku": "sku-1001", "name": "Classic T-Shirt", "color": "White", "size": "S", "price": 19.99, "timestamp": "2025-06-23T10:00:00Z" }
{ "index" : { "_id" : "sku-1002" } }
{ "product_id": "P001", "sku": "sku-1002", "name": "Classic T-Shirt", "color": "White", "size": "M", "price": 19.99, "timestamp": "2025-06-23T10:01:00Z" }
{ "index" : { "_id" : "sku-1003" } }
{ "product_id": "P001", "sku": "sku-1003", "name": "Classic T-Shirt", "color": "White", "size": "L", "price": 19.99, "timestamp": "2025-06-23T10:02:00Z" }
{ "index" : { "_id" : "sku-1004" } }
{ "product_id": "P001", "sku": "sku-1004", "name": "Classic T-Shirt", "color": "Black", "size": "M", "price": 21.99, "timestamp": "2025-06-23T11:00:00Z" }
{ "index" : { "_id" : "sku-2001" } }
{ "product_id": "P002", "sku": "sku-2001", "name": "V-Neck Sweater", "color": "Navy", "size": "M", "price": 49.95, "timestamp": "2025-06-22T14:00:00Z" }
{ "index" : { "_id" : "sku-2002" } }
{ "product_id": "P002", "sku": "sku-2002", "name": "V-Neck Sweater", "color": "Navy", "size": "L", "price": 49.95, "timestamp": "2025-06-22T14:01:00Z" }
{ "index" : { "_id" : "sku-2003" } }
{ "product_id": "P002", "sku": "sku-2003", "name": "V-Neck Sweater", "color": "Grey", "size": "S", "price": 45.95, "timestamp": "2025-06-22T15:00:00Z" }
{ "index" : { "_id" : "sku-3001" } }
{ "product_id": "P003", "sku": "sku-3001", "name": "Running Shorts", "color": "Blue", "size": "32", "price": 35.00, "timestamp": "2025-06-21T09:30:00Z" }
{ "index" : { "_id" : "sku-3002" } }
{ "product_id": "P003", "sku": "sku-3002", "name": "Running Shorts", "color": "Red", "size": "34", "price": 35.00, "timestamp": "2025-06-21T09:31:00Z" }
现在我们可以在 product_id.keyword
字段上执行搜索并使用 collapse 来获取分组后的结果:
GET your-products-index/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "product_id.keyword"
}
}
结果如下所示:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 9,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "your-products-index",
"_id": "sku-1003",
"_score": 1,
"_source": {
"product_id": "P001",
"sku": "sku-1003",
"name": "Classic T-Shirt",
"color": "White",
"size": "L",
"price": 19.99,
"timestamp": "2025-06-23T10:02:00Z"
},
"fields": {
"product_id.keyword": [
"P001"
]
}
},
{
"_index": "your-products-index",
"_id": "sku-2001",
"_score": 1,
"_source": {
"product_id": "P002",
"sku": "sku-2001",
"name": "V-Neck Sweater",
"color": "Navy",
"size": "M",
"price": 49.95,
"timestamp": "2025-06-22T14:00:00Z"
},
"fields": {
"product_id.keyword": [
"P002"
]
}
},
{
"_index": "your-products-index",
"_id": "sku-3001",
"_score": 1,
"_source": {
"product_id": "P003",
"sku": "sku-3001",
"name": "Running Shorts",
"color": "Blue",
"size": "32",
"price": 35,
"timestamp": "2025-06-21T09:30:00Z"
},
"fields": {
"product_id.keyword": [
"P003"
]
}
}
]
}
}
但这里有一个重要的注意事项:
返回的 hits.total.value
仍然表示的是文档的总数量,而不是 collapse 后唯一分组的数量。这意味着如果依赖 hits.total.value
来做分页,会破坏用户体验,并错误地展示结果页数。如果 UI 显示有 5 个结果,那么会认为有第 2 页,但加载第 2 页时会失败,因为实际上只有 3 个文档,而不是 9 个。
解决方案:结合 collapse 和 cardinality
通过在相同的 collapse 字段上添加一个 cardinality 聚合,我们可以准确地计算出唯一分组的数量,从而实现可靠且可预测的分页。
下面是一个示例查询:
GET your-products-index/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "product_id.keyword"
},
"aggs": {
"total_uniques": {
"cardinality": {
"field": "product_id.keyword"
}
}
},
"sort": [
{
"timestamp": {
"order": "desc"
}
}
],
"size": 5,
"from": 0
}
上面命令显示的结果为:
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 9,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "your-products-index",
"_id": "sku-1004",
"_score": null,
"_source": {
"product_id": "P001",
"sku": "sku-1004",
"name": "Classic T-Shirt",
"color": "Black",
"size": "M",
"price": 21.99,
"timestamp": "2025-06-23T11:00:00Z"
},
"fields": {
"product_id.keyword": [
"P001"
]
},
"sort": [
1750676400000
]
},
{
"_index": "your-products-index",
"_id": "sku-2003",
"_score": null,
"_source": {
"product_id": "P002",
"sku": "sku-2003",
"name": "V-Neck Sweater",
"color": "Grey",
"size": "S",
"price": 45.95,
"timestamp": "2025-06-22T15:00:00Z"
},
"fields": {
"product_id.keyword": [
"P002"
]
},
"sort": [
1750604400000
]
},
{
"_index": "your-products-index",
"_id": "sku-3002",
"_score": null,
"_source": {
"product_id": "P003",
"sku": "sku-3002",
"name": "Running Shorts",
"color": "Red",
"size": "34",
"price": 35,
"timestamp": "2025-06-21T09:31:00Z"
},
"fields": {
"product_id.keyword": [
"P003"
]
},
"sort": [
1750498260000
]
}
]
},
"aggregations": {
"total_uniques": {
"value": 3
}
}
}
如上所示,我们的 total_uniques 的值为3。这个值可以被用于分页之用。
结论
现在,你应该已经清楚了在使用 collapse 搜索结果时分页所面临的挑战,以及最终的解决方案。依赖 collapse 功能时默认的 hits.total.value
会导致分页体验出错,页面总数显示不准确,最终让用户感到困扰。
核心要点是:将 collapse 查询与针对相同字段的 cardinality 聚合搭配使用,是一个稳健的解决方案。
collapse:在查询阶段通过对
collapse 字段
分组来实现去重。cardinality:返回
collapse 字段
唯一值的近似计数,是分页分组结果的关键。sort + from:依然有效,但是在 collapse 之后应用。
hits.total.value:返回的是文档总数,而不是去重后的数量。在 collapse 查询中不要用于分页计算。
原文:Efficient pagination with collapse and cardinality in Elasticsearch - Elasticsearch Labs