parquet类型小文件合并

发布于:2024-12-23 ⋅ 阅读:(13) ⋅ 点赞:(0)

parquet类型小文件合并:
./2024-7-26/0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq

hadoop jar ./parquet-tools-1.9.0.jar --help
WARNING: Use “yarn jar” to launch YARN applications.
usage: parquet-tools cat [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
-j,–json Show records in JSON format.
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools head [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
-n,–records The number of records to show (default: 5)
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools schema [option…]
where option is one of:
-d,–detailed Show detailed information about the schema.
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the parquet file containing the schema to show

usage: parquet-tools meta [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools dump [option…]
where option is one of:
-c,–column Dump only the given column, can be specified more than
once
-d,–disable-data Do not dump column data
–debug Enable debug output
-h,–help Show this help string
-m,–disable-meta Do not dump row group and page metadata
-n,–disable-crop Do not crop the output based on console width
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools merge [option…] [ …]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the source parquet files/directory to be merged
is the destination parquet file

查看结构:
hadoop jar ./parquet-tools-1.9.0.jar schema ./0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq
message schema {
optional binary id;
optional binary sn;
optional binary mes_sn;
optional binary line_code;
optional binary section_code;
optional binary station_code;
optional binary station_slot;
optional binary test_software_version;
optional binary test_time;
optional double elapsed_time;
optional binary test_result;
optional binary failitem;
optional binary failitems;
optional binary bg;
optional binary bu;
optional binary project_code;
optional binary project_name;
}

查看内容:
hadoop jar ./parquet-tools-1.9.0.jar head -n 10 ./0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq

合并parquet小文件:原文件不删除,产生新的合并文件
hadoop jar ./parquet-tools-1.9.0.jar merge ./2024-7-26/ /tmp/all.parquet
合并结果:
hdfs dfs -du -h /tmp/all.parquet
280.6 M 841.7 M /tmp/all.parquet