Java 数据清洗 List集合去重-EW帮帮网

Java 数据清洗 List集合去重

🚀 Java 8 列表去重实用指南（多属性去重）

📌 方法1：最优性能方案（自定义循环 + Key包装器）

import java.util.*;

public class DistinctUtil {
    
    // 高性能去重工具（预分配内存/避免装箱）
    public static <T> List<T> distinctByKeys(List<T> list, Function<T, Object[]> keyExtractor) {
        // 预分配足够空间防止扩容
        Set<KeyWrapper> seen = new HashSet<>(list.size() * 2);
        List<T> result = new ArrayList<>(list.size());
        
        for (T obj : list) {
            // 提取属性数组（避免创建临时对象）
            Object[] keys = keyExtractor.apply(obj);
            KeyWrapper keyWrapper = new KeyWrapper(keys);
            
            if (seen.add(keyWrapper)) {  // O(1) 时间复杂度
                result.add(obj);
            }
        }
        return result;
    }

    // Key包装器（高性能实现）
    private static class KeyWrapper {
        private final Object[] keys;
        private final int hash;

        public KeyWrapper(Object[] keys) {
            this.keys = keys;
            this.hash = Arrays.deepHashCode(keys);  // 预先计算hash
        }

        @Override
        public boolean equals(Object o) {
            if (this == o) return true;
            if (o == null || getClass() != o.getClass()) return false;
            KeyWrapper that = (KeyWrapper) o;
            return Arrays.deepEquals(keys, that.keys);  // 深度比较数组内容
        }

        @Override
        public int hashCode() {
            return hash;  // 返回预先计算的hash
        }
    }
}

使用示例

// 定义对象
class Employee {
    private String name;
    private int deptId;
    private String location;
    
    // getters & constructor
}

// 使用去重
List<Employee> employees = Arrays.asList(
    new Employee("Alice", 101, "New York"),
    new Employee("Bob", 102, "London"),
    new Employee("Alice", 101, "Chicago"), // 重复(name+deptId)
    new Employee("Bob", 103, "Tokyo")
);

// 根据name+deptId去重
List<Employee> distinctEmp = DistinctUtil.distinctByKeys(
    employees,
    e -> new Object[]{e.getName(), e.getDeptId()} // 关键属性
);

// 结果：保留第1、2、4条记录，"Alice-101"只保留第一个

📌 方法2：流式处理方案（保序 + 简洁）

// 使用标准库函数（保留原始顺序）
List<Employee> distinctList = employees.stream()
    .collect(Collectors.collectingAndThen(
        Collectors.toMap(
            e -> Arrays.asList(e.getName(), e.getDeptId()), // 属性组合
            Function.identity(),
            (exist, newObj) -> exist,  // 保留首次出现
            LinkedHashMap::new          // 保序
        ),
        map -> new ArrayList<>(map.values())
    ));

📌 方法3：动态过滤器（流式处理灵活版）

// 创建可复用的过滤器
public static <T> Predicate<T> distinctByProperty(Function<T, Object[]> keyExtractor) {
    Map<Object, Boolean> seen = new ConcurrentHashMap<>();
    return t -> {
        Object[] keys = keyExtractor.apply(t);
        List<Object> keyList = Arrays.asList(keys);
        return seen.putIfAbsent(keyList, Boolean.TRUE) == null;
    };
}

// 使用示例
List<Employee> result = employees.stream()
    .filter(distinctByProperty(
        e -> new Object[]{e.getName(), e.getLocation()}
    ))
    .collect(Collectors.toList());

💡 高级用法：动态属性组合

// 按需组合任意数量的属性
public static <T> List<T> distinctByProperties(List<T> list, String... propNames) {
    return DistinctUtil.distinctByKeys(list, obj -> {
        try {
            Object[] values = new Object[propNames.length];
            Class<?> clazz = obj.getClass();
            
            for (int i = 0; i < propNames.length; i++) {
                // 通过反射获取属性值（生产环境建议缓存Method）
                Method method = clazz.getMethod("get" + capitalize(propNames[i]));
                values[i] = method.invoke(obj);
            }
            return values;
        } catch (Exception e) {
            throw new RuntimeException("属性访问错误", e);
        }
    });
}

private static String capitalize(String s) {
    return s.substring(0, 1).toUpperCase() + s.substring(1);
}

// 使用：动态指定属性
distinctByProperties(employees, "name", "deptId"); // 根据name和部门ID去重
distinctByProperties(employees, "location");       // 仅根据地区去重

⚡ 性能关键点

属性处理优化

// 基础类型避免装箱
e -> new Object[]{e.getName(), (Integer)e.getDeptId()}

空值安全处理

// 处理null值
e -> new Object[]{
    Optional.ofNullable(e.getName()).orElse(""),
    Optional.ofNullable(e.getLocation()).orElse("N/A")
}

大数据集分块处理

// 10万+数据处理
List<Employee> results = new ArrayList<>();
Set<DistinctUtil.KeyWrapper> seen = new HashSet<>(10000);

int batchSize = 5000;
for (int i = 0; i < employees.size(); i += batchSize) {
    List<Employee> batch = employees.subList(i, Math.min(i+batchSize, employees.size()));
    results.addAll(DistinctUtil.distinctByKeys(batch, ...));
}

🏆 方案选择建议

场景	推荐方案	优势
大数据量（10万+）	自定义循环 + Key包装器	极致性能
需保留原始顺序	流式处理toMap方案	代码简洁 + 顺序保持
不同方法中动态组合属性	动态属性组合方案	最大灵活性
流式处理中过滤	`distinctByProperty`过滤器	流式集成 + 易读性

生产提示：对于超大数据集（100万+），推荐使用自定义循环+批处理方案，结合 -Xmx 调整堆内存。对于需要线程安全的场景，可改用 ConcurrentHashMap 实现 Key 跟踪。

Java 数据清洗 List集合去重

Java 数据清洗 List集合去重

🚀 Java 8 列表去重实用指南（多属性去重）

📌 方法1：最优性能方案（自定义循环 + Key包装器）

使用示例

📌 方法2：流式处理方案（保序 + 简洁）

📌 方法3：动态过滤器（流式处理灵活版）

💡 高级用法：动态属性组合

⚡ 性能关键点

🏆 方案选择建议

网站公告

今日签到

热门文章

最新发布