Spark on K8s搭建

2024-03-20 2024-03-20 约 2000 字预计阅读 4 分钟 - 次阅读

版本记录

k8s v1.22.2

spark spark-3.3.2-bin-hadoop3-scala2.13

环境基础配置

下载spark包

1
2
3
4


wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3-scala2.13.tgz
tar -zxvf spark-3.3.2-bin-hadoop3-scala2.13.tgz

cd spark-3.3.2-bin-hadoop3-scala2.13

构建镜像

spark附带了一个Dockerfile，在kubernetes/dockerfiles/spark目录下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


tree kubernetes/dockerfiles/spark

kubernetes/dockerfiles/spark
├── bindings
│   ├── python
│   │   └── Dockerfile
│   └── R
│       └── Dockerfile
├── decom.sh
├── Dockerfile
├── Dockerfile.java17
└── entrypoint.sh

3 directories, 6 files

spark还带有一个脚本，用于build和push镜像，脚本为./bin/docker-image-tool.sh。可以使用–help来查看具体用法

1
2
3
4


#注意：脚本的使用要在spark家目录下使用
#示例
    ./bin/docker-image-tool.sh -r docker.io/myrepo -t v2.3.0 build
    ./bin/docker-image-tool.sh -r docker.io/myrepo -t v2.3.0 push

创建命名空间及rbac

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


apiVersion: v1
kind: Namespace
metadata:
  name: spark
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark-sa
  namespace: spark
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: spark-role
rules:
- apiGroups:
  - ""
  resources:
  - "pods"
  - "services"
  - "events"
  - "configmaps"
  - "endpoints"
  - "secrets"
  verbs:
  - "*"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: spark-role-binding
subjects:
- kind: ServiceAccount
  name: spark-sa
  namespace: spark
roleRef:
  kind: ClusterRole
  name: spark-role
  apiGroup: rbac.authorization.k8s.io

提交spark任务

cluster模式

spark有个示例，是计算圆周率的简单应用程序可以用来验证是否可用

注意：这里的 local:///opt/spark/examples/jars/spark-examples_2.13-3.3.2.jar指的是容器的文件系统路径，不是执行spark-submit的机器的文件系统路径，不使用local的话，也可以使用HTTP、HDFS等系统，没有指定的话，默认就是local

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


/opt/spark/bin/spark-submit \
 --name SparkPi \
 --verbose \
 --master k8s://https://master.k8s.io:6443 \
 --deploy-mode cluster \
 --conf spark.network.timeout=300 \
 --conf spark.executor.instances=3 \
 --conf spark.driver.cores=1 \
 --conf spark.executor.cores=1 \
 --conf spark.driver.memory=1024m \
 --conf spark.executor.memory=1024m \
 --conf spark.kubernetes.namespace=spark \
 --conf spark.kubernetes.container.image.pullPolicy=Always \
 --conf spark.kubernetes.container.image.pullSecrets=registry-pull-secret \
 --conf spark.kubernetes.container.image=swr.cn-north-4.myhuaweicloud.com/cotte-internal/spark:3.3.2-hadoop3 \
 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
 --conf spark.kubernetes.authenticate.executor.serviceAccountName=spark-sa \
 --conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
 --conf spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
 --class org.apache.spark.examples.SparkPi \
 local:///opt/spark/examples/jars/spark-examples_2.13-3.3.2.jar






####################################参数解释
/opt/spark/bin/spark-submit：要运行的提交脚本的完整路径。
--name SparkPi：Spark 应用程序作业的名称。
--verbose：使输出更详细，显示 Spark 配置和应用程序输出。
--master k8s://https://master.k8s.io:6443：指定 Spark 应用程序的部署模式和主节点 URL。
--deploy-mode cluster：选择部署模式并将 Spark 应用程序作业提交到集群中的 Worker 节点。
--conf spark.network.timeout=300：设置 Spark 应用程序作业的网络超时时限。
--conf spark.executor.instances=3：设置 Spark 应用程序作业需要的 Executor 实例数量。
--conf spark.driver.cores=1：设置 Spark 应用程序 Driver 程序所需的 CPU 核心数。
--conf spark.executor.cores=1：设置 Spark 应用程序 Executor 所需的 CPU 核心数。
--conf spark.driver.memory=1024m：为 Spark 应用程序 Driver 程序设置内存量。
--conf spark.executor.memory=1024m：为 Spark 应用程序 Executor 设置内存量。
--conf spark.kubernetes.namespace=spark：设置 Spark 应用程序作业的命名空间。
--conf spark.kubernetes.container.image.pullPolicy=Always：设置 Spark 容器镜像拉取策略为始终拉取。
--conf spark.kubernetes.container.image.pullSecrets=registry-pull-secret：指定所需的镜像拉取凭据。
--conf spark.kubernetes.container.image=swr.cn-north-4.myhuaweicloud.com/cotte-internal/spark:3.3.2-hadoop3 设置 Spark 容器映像。
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa：指定正在使用的 Kubernetes ServiceAccount 名称。
--conf spark.kubernetes.authenticate.executor.serviceAccountName=spark-sa：指定正在使用的 Kubernetes ServiceAccount 名称。
--conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"：该命令用于为 Spark Driver 指定额外的 Java 选项。
--conf spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"：该命令用于为 Spark Executors 指定额外的 Java 选项
--class org.apache.spark.examples.SparkPi：指定作业中要运行的类。
local:///opt/spark/examples/jars/spark-examples_2.13-3.3.2.jar：指定包含应用程序代码的 JAR 文件路径，并在本地文件系统中查找此文件，将其作为应用程序的打包文件。

Buy me a coffee

赞赏

支付宝

微信