统计可重复列表中的TOP N-EW帮帮网

文章目录

这种统计top值的情况场景使用的不少，面试过程中也有聊到过这类问题，在这详细介绍一下思路和方案

在Java中统计列表中出现次数最多的前N个对象，常见的实现方案及其优缺点如下：

方案1：HashMap统计 + 全排序

实现步骤：

使用HashMap统计每个元素的频率。
将统计结果转为列表，按频率降序排序。
取前N个元素。

代码实现：

public static List<Map.Entry<String, Integer>> topNWithSort(List<String> list, int n) {
    // 统计频率
    Map<String, Integer> freqMap = new HashMap<>();
    for (String item : list) {
        freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);
    }
    // 转换为列表并排序
    List<Map.Entry<String, Integer>> entries = new ArrayList<>(freqMap.entrySet());
    entries.sort((a, b) -> b.getValue().compareTo(a.getValue()));
    // 取前N个
    return entries.subList(0, Math.min(n, entries.size()));
}

优缺点：

优点：实现简单，代码直观。
缺点：全排序时间复杂度为 (O(m \log m))（(m) 为不同元素的数量），当 (m) 较大时效率低。

方案2：HashMap统计 + 最小堆（优先队列）

实现步骤：

使用HashMap统计频率。
使用大小为N的最小堆，遍历频率表，维护堆顶为当前最小的频率。
将堆中元素逆序输出。

代码实现：

public static List<Map.Entry<String, Integer>> topNWithHeap(List<String> list, int n) {
    // 统计频率
    Map<String, Integer> freqMap = new HashMap<>();
    for (String item : list) {
        freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);
    }
    // 初始化最小堆（按频率升序）
    PriorityQueue<Map.Entry<String, Integer>> heap = new PriorityQueue<>(
        (a, b) -> a.getValue() - b.getValue()
    );
    // 遍历频率表，维护堆的大小为N
    for (Map.Entry<String, Integer> entry : freqMap.entrySet()) {
        if (heap.size() < n) {
            heap.offer(entry);
        } else if (entry.getValue() > heap.peek().getValue()) {
            heap.poll();
            heap.offer(entry);
        }
    }
    // 将堆转换为列表并逆序
    List<Map.Entry<String, Integer>> result = new ArrayList<>(heap);
    result.sort((a, b) -> b.getValue().compareTo(a.getValue()));
    return result;
}

优缺点：

优点：时间复杂度为 (O(m \log n))，适合大数据量且 (n \ll m) 的场景。
缺点：需要手动维护堆，代码稍复杂。

方案3：Java Stream API

实现步骤：

使用Stream的groupingBy和counting统计频率。
按频率降序排序后取前N个。

代码实现：

public static List<Map.Entry<String, Long>> topNWithStream(List<String> list, int n) {
    return list.stream()
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
        .entrySet().stream()
        .sorted(Map.Entry.<String, Long>comparingByValue().reversed())
        .limit(n)
        .collect(Collectors.toList());
}

优缺点：

优点：代码简洁，函数式编程风格。
缺点：隐藏实现细节，可能对内存和性能控制不足。

完整示例代码

import java.util.*;
import java.util.function.Function;
import java.util.stream.Collectors;

public class TopNFrequency {

    public static void main(String[] args) {
        List<String> list = Arrays.asList("apple", "banana", "apple", "orange", "banana", "apple");
        int n = 2;

        // 方法1：全排序
        System.out.println("HashMap + Sorting: " + topNWithSort(list, n));
        // 方法2：最小堆
        System.out.println("HashMap + Heap: " + topNWithHeap(list, n));
        // 方法3：Stream API
        System.out.println("Stream API: " + topNWithStream(list, n));
    }

    // 方法1：全排序
    public static List<Map.Entry<String, Integer>> topNWithSort(List<String> list, int n) {
        Map<String, Integer> freqMap = new HashMap<>();
        for (String item : list) {
            freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);
        }
        List<Map.Entry<String, Integer>> entries = new ArrayList<>(freqMap.entrySet());
        entries.sort((a, b) -> b.getValue().compareTo(a.getValue()));
        return entries.subList(0, Math.min(n, entries.size()));
    }

    // 方法2：最小堆
    public static List<Map.Entry<String, Integer>> topNWithHeap(List<String> list, int n) {
        Map<String, Integer> freqMap = new HashMap<>();
        for (String item : list) {
            freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);
        }
        PriorityQueue<Map.Entry<String, Integer>> heap = new PriorityQueue<>(
            (a, b) -> a.getValue() - b.getValue()
        );
        for (Map.Entry<String, Integer> entry : freqMap.entrySet()) {
            if (heap.size() < n) {
                heap.offer(entry);
            } else if (entry.getValue() > heap.peek().getValue()) {
                heap.poll();
                heap.offer(entry);
            }
        }
        List<Map.Entry<String, Integer>> result = new ArrayList<>(heap);
        result.sort((a, b) -> b.getValue().compareTo(a.getValue()));
        return result;
    }

    // 方法3：Stream API
    public static List<Map.Entry<String, Long>> topNWithStream(List<String> list, int n) {
        return list.stream()
            .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
            .entrySet().stream()
            .sorted(Map.Entry.<String, Long>comparingByValue().reversed())
            .limit(n)
            .collect(Collectors.toList());
    }
}

关键点总结

全排序适合数据量小的场景，代码简单但效率低。
最小堆适合大数据量，时间复杂度更优。
Stream API以简洁性取胜，但需注意类型转换和性能。

方案4：并行流处理（Parallel Stream）

实现步骤：

使用并行流加速统计和排序。
利用ConcurrentHashMap保证线程安全。

代码实现：

public static List<Map.Entry<String, Long>> topNParallelStream(List<String> list, int n) {
    return list.parallelStream()
        .collect(Collectors.groupingByConcurrent(Function.identity(), Collectors.counting()))
        .entrySet().parallelStream()
        .sorted(Map.Entry.<String, Long>comparingByValue().reversed())
        .limit(n)
        .collect(Collectors.toList());
}

优缺点：

优点：利用多核并行处理，适合超大数据量。
缺点：线程安全控制复杂，可能因数据倾斜导致性能提升有限。

方案5：桶排序（Bucket Sort）

实现步骤：

统计频率，记录最大频率。
创建频率桶，索引为频率，值为元素列表。
从高到低遍历桶，收集前N个元素。

代码实现：

public static List<Map.Entry<String, Integer>> topNBucketSort(List<String> list, int n) {
    Map<String, Integer> freqMap = new HashMap<>();
    int maxFreq = 0;
    for (String item : list) {
        int freq = freqMap.getOrDefault(item, 0) + 1;
        freqMap.put(item, freq);
        maxFreq = Math.max(maxFreq, freq);
    }
    // 创建桶（索引为频率）
    List<List<String>> buckets = new ArrayList<>(maxFreq + 1);
    for (int i = 0; i <= maxFreq; i++) {
        buckets.add(new ArrayList<>());
    }
    freqMap.forEach((k, v) -> buckets.get(v).add(k));
    // 从高到低收集结果
    List<Map.Entry<String, Integer>> result = new ArrayList<>();
    for (int i = maxFreq; i >= 0 && result.size() < n; i--) {
        for (String item : buckets.get(i)) {
            result.add(new AbstractMap.SimpleEntry<>(item, i));
            if (result.size() == n) break;
        }
    }
    return result;
}

优缺点：

优点：时间复杂度 (O(m + k))（(k)为最大频率），适合频率分布集中的场景。
缺点：空间复杂度 (O(k))，若最大频率极高则浪费内存。

方案6：快速选择（Quickselect）算法

实现步骤：

统计频率，将Entry存入列表。
使用快速选择算法找到第N大的频率分界点。
对前N个元素进行排序。

代码实现（部分）：

public static List<Map.Entry<String, Integer>> topNQuickSelect(List<String> list, int n) {
    Map<String, Integer> freqMap = new HashMap<>();
    for (String item : list) {
        freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);
    }
    List<Map.Entry<String, Integer>> entries = new ArrayList<>(freqMap.entrySet());
    quickSelect(entries, n);
    return entries.subList(0, n).stream()
        .sorted((a, b) -> b.getValue().compareTo(a.getValue()))
        .collect(Collectors.toList());
}

private static void quickSelect(List<Map.Entry<String, Integer>> list, int n) {
    int left = 0, right = list.size() - 1;
    while (left <= right) {
        int pivotIndex = partition(list, left, right);
        if (pivotIndex == n) break;
        else if (pivotIndex < n) left = pivotIndex + 1;
        else right = pivotIndex - 1;
    }
}

private static int partition(List<Map.Entry<String, Integer>> list, int low, int high) {
    int pivotValue = list.get(high).getValue();
    int i = low;
    for (int j = low; j < high; j++) {
        if (list.get(j).getValue() > pivotValue) {
            Collections.swap(list, i, j);
            i++;
        }
    }
    Collections.swap(list, i, high);
    return i;
}

优缺点：

优点：平均时间复杂度 (O(m))，适合对性能要求极高的场景。
缺点：实现复杂，需处理大量边界条件。

方案7：Guava库的MultiSet（第三方依赖）

实现步骤：

使用Guava的Multiset统计频率。
按频率排序后取前N个。

代码实现：

public static List<Multiset.Entry<String>> topNGuava(List<String> list, int n) {
    Multiset<String> multiset = HashMultiset.create(list);
    return multiset.entrySet().stream()
        .sorted((a, b) -> b.getCount() - a.getCount())
        .limit(n)
        .collect(Collectors.toList());
}

优缺点：

优点：代码极简，依赖Guava工具类。
缺点：需引入第三方库，不适合纯JDK环境。

二、方案对比总表

方案	时间复杂度	空间复杂度	适用场景
全排序	(O(m \log m))	(O(m))	数据量小，代码简单
最小堆	(O(m \log n))	(O(n))	大数据量且 (n \ll m)
Stream API	(O(m \log m))	(O(m))	快速开发，代码简洁
并行流	(O(m \log m / p))	(O(m))	多核环境，超大数据量
桶排序	(O(m + k))	(O(k))	频率集中且最大值已知
快速选择	(O(m))（平均）	(O(m))	高性能需求，允许复杂实现
Guava MultiSet	(O(m \log m))	(O(m))	允许第三方依赖

三、总结建议

小数据量：优先使用 Stream API 或 全排序，代码简洁。
大数据量：选择 最小堆 或 并行流，平衡性能与内存。
已知频率分布：尝试 桶排序 优化时间和空间。
极高性能需求：考虑 快速选择（需自行处理实现复杂度）。
允许第三方库：Guava 可大幅简化代码。

统计可重复列表中的TOP N

文章目录

方案1：HashMap统计 + 全排序

实现步骤：

代码实现：

优缺点：

方案2：HashMap统计 + 最小堆（优先队列）

实现步骤：

代码实现：

优缺点：

方案3：Java Stream API

实现步骤：

代码实现：

优缺点：

完整示例代码

关键点总结

方案4：并行流处理（Parallel Stream）

实现步骤：

代码实现：

优缺点：

方案5：桶排序（Bucket Sort）

实现步骤：

代码实现：

优缺点：

方案6：快速选择（Quickselect）算法

实现步骤：

代码实现（部分）：

优缺点：

方案7：Guava库的MultiSet（第三方依赖）

实现步骤：

代码实现：

优缺点：

二、方案对比总表

三、总结建议

网站公告

今日签到

热门文章

最新发布