目录
1、问题描述
根据uuid、dlb_file_name_key、frame_num分区,根据distance排序。
row_number() over(partition by uuid,dlb_file_name_key,frame_num order by distance) rank
当使用spark默认引擎的时候,会出现分区错误的情况。
完整sql:
set spark.executor.instances=20; set spark.executor.cores=3; set spark.executor.memory=4g; set spark.driver.memory=4g; select *, -- 后面可以根据distance绝对值的大小来判定哪个离trigger_time最近 row_number() over(partition by uuid,dlb_file_name_key,frame_num order by distance) rank from ( select t.*, CASE WHEN (cast(t.delta as double) / 50) >= 0 THEN ceil(cast(t.delta as double) / 50) ELSE floor(cast(t.delta as double) / 50) END AS frame_num from ( select vehicle_id, vehicle_uuid, uuid, -- 规避同一个uuid 出现多个baseline rule_name max(baseline) over(partition by uuid) as baseline, max(rule_name) over(partition by uuid) as rule_name, dlb_file_name_key, trigger_time, cast(trigger_time as bigint)/1000000 AS trigger_time_ms, date_format(from_utc_timestamp(bigint(trigger_time/1000000),'GMT+8'),'yyyy-MM-dd HH:mm:ss.SSS') as trigger_time_ms_format, cast(utc_timestamp as bigint)/1000000 AS utc_timestamp, (cast(substr(utc_timestamp,1,13) as bigint) - cast(substr(trigger_time,1,13) as bigint)) AS delta, -- 差值 cast((cast(utc_timestamp as bigint) - cast(trigger_time as bigint)) as bigint) AS distance FROM ad_d_ods.ex_s_adw_production_collection_datafiles_raw_delta_parquet_hh WHERE pt between '2022090600' and '2022090623' and rule_name = 'aeb_act_event' and uuid = '1b9c213a-f299-4522-9072-cb9046707dab' and dlb_file_name_key in ( 'np-debug-dgb_aebstrategy' ) ) t ) t1 order by frame_num,distance |
2、Spark引擎
期望结果:
但出现了下面的情况:
分区错误。
3、MR引擎
相同的代码切换成mr引擎时,又可以了。
set hive.execution.engine=mr;
4、直接在presto里运行
完全相同的语句,直接在presto里面运行也是可以的
5、将中中间过程落hive表,再row_number
现将row_number之前的结果落到hive表中,然后再重新读取,再使用row_number分区排序又可以。
6、原因排查
涉及hive3.1.2原生的bug,在做row_number(partition by 字段A)排序之前,如发生cast( 字段A as 类型B) as 字段A情况,那分区排序结果是有问题的。而使用hive on mr方式是无问题的。