验证pyspark提交参数指定环境变量生效

发布于:2025-08-01 ⋅ 阅读:(22) ⋅ 点赞:(0)

一,背景需要在我们已经内置的流程化提交平台中使用用户自己的python环境
二,我们自己中台页面中默认执行的提交命令如下

 /opt/apps/ali/spark-3.5.2-bin-hadoop3-scala2.13/bin/spark-submit 
--master yarn --deploy-mode cluster --name print.py_6 --conf spark.yarn.submit.waitAppCompletion=false  
 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./python3.6/python3.6/bin/python  
 --archives hdfs:///ali/ai/python3.6.zip#python3.6  
 --conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=./python3.6/python3.6/bin/python 
 --executor-cores 2  --executor-memory 8g   
 file:/opt/apps/ali/print.py 

三,用户提交添加参数

spark.yarn.dist.archives="hdfs://ali/testpysaprk/dns_fenxi.tar.gz#pyenv";spark.executorEnv.PYTHONPATH=pyenv/lib/python3.10/site-packages; spark.pyspark.python=pyenv/python3.10/bin/python3.10

我们平台会默认将他们这个添加到配置中的参数添加到提交命令中

 /opt/apps/ali/spark-3.5.2-bin-hadoop3-scala2.13/bin/spark-submit 
--master yarn --deploy-mode cluster --name print.py_6 --conf spark.yarn.submit.waitAppCompletion=false  
 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./python3.6/python3.6/bin/python  
 --archives hdfs:///ali/ai/python3.6.zip#python3.6  
 --conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=./python3.6/python3.6/bin/python 
 --executor-cores 2  --executor-memory 8g   
 file:/opt/apps/ali/print.py  spark.yarn.dist.archives="hdfs://ali/testpysaprk/dns_fenxi.tar.gz#pyenv";spark.executorEnv.PYTHONPATH=pyenv/lib/python3.10/site-packages; spark.pyspark.python=pyenv/python3.10/bin/python3.10

程序运行报错

submit-spark: Exception in thread "main" java.io.FileNotFoundException: File file:/apps/"/opt/apps/dns_fenxi.tar.gz#pyenv" does not exist

四,发现问题,更改提交命令,将命令中的“”去掉

spark.yarn.dist.archives=hdfs://everdc/mzqtestpysaprk/dns_det.tar.gz#pyenv;spark.executorEnv.PYTHONPATH=./pyenv/dns_det/bin/python3.10/site-packages; spark.pyspark.python=./pyenv/dns_det/bin/python3.10 

提交成功,运行也正常

  opt/apps/spark_ali/bin/spark-submit --master yarn --deploy-mode cluster --name  testprint.py_237 --conf spark.yarn.submit.waitAppCompletion=false  --principal hdfs/ali14@ali.COM --keytab /opt/apps/ali_cluster_file/tickets/215/keytab  --conf  spark.pyspark.python=./pyenv/dns_det/bin/python3.10      --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./python3.6/python3.6/bin/python  --conf spark.executorEnv.PYTHONPATH=./pyenv/dns_det/bin/python3.10/site-packages  --archives /opt/apps/python3.6.zip#python3.6  --driver-memory 8g  --conf spark.default.parallelism=10  --num-executors 1  --conf spark.yarn.dist.archives="hdfs://ali/mzqtestpysaprk/dns_det.tar.gz#pyenv"   --conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=./python3.6/python3.6/bin/python  --executor-cores 2  --executor-memory 8g    --queue root.default file:/opt/apps/resource/testprint.py 

五,spark中指定参数中指定python环境的优先级
我再提交命令中有自带的python.3.6的环境,同时有用户提交的3.10的环境,最后通过脚本发现用户的环境生效了
最后对比发现 spark.pyspark.python配置的优先级最高