关于python 3.x:Spark初始作业未接受任何资源

Spark initial job has not accepted any resources

我很难让我的程序在我的Spark集群上运行。我将群集设置为1个主服务器和4个从属服务器。我启动了主服务器,此后,我启动了从服务器,它们出现在主服务器的Web ui中。

然后我启动一个小的python脚本来检查是否可以执行作业:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from pyspark import * #SparkContext, SparkConf, spark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import SQLContext

from files import files

import sys


if __name__ =="__main__":

    appName = 'SparkExample'
    masterUrl = 'spark://10.0.2.55:7077'

    conf = SparkConf()
    conf.setAppName(appName)
    conf.setMaster(masterUrl)
    conf.set("spark.driver.cores","1")
    conf.set("spark.driver.memory","1g")
    conf.set("spark.executor.cores","1")
    conf.set("spark.executor.memory","4g")
    conf.set("spark.python.worker.memory","256m")
   
    conf.set("spark.cores.max","4")
   
    conf.set("spark.shuffle.service.enabled","true")
    conf.set("spark.dynamicAllocation.enabled","true")
    conf.set("spark.dynamicAllocation.maxExecutors","1")
   
   
    for k,v in conf.getAll():
        print(k+":"+v)
   
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
    #spark = SparkSession.builder.master(masterUrl).appName(appName).config("spark.executor.memory","1g").getOrCreate()
   
    l = [('Alice', 1)]
    spark.createDataFrame(l).collect()
    spark.createDataFrame(l, ['name', 'age']).collect()


    print("#############")
    print("Test finished")
    print("#############")

但是,一旦我应该找回东西(第45行:" spark.createDataFrame(l).collect()"),spark似乎就挂了。一段时间后,我看到消息:

" WARN TaskSchedulerImpl:初始作业未接受任何资源:检查您的群集UI,以确保工作人员已注册并具有足够的资源"

因此,我检查了集群用户界面:

1
2
3
4
worker-20171027105227-xx.x.x.x6-35309   10.0.2.56:35309 ALIVE   4 (0 Used)  6.8 GB (0.0 B Used)
worker-20171027110202-xx.x.x.x0-43433   10.0.2.10:43433 ALIVE   16 (1 Used) 30.4 GB (4.0 GB Used)
worker-20171027110746-xx.x.x.x5-45126   10.0.2.65:45126 ALIVE   8 (0 Used)  30.4 GB (0.0 B Used)
worker-20171027110939-xx.x.x.x4-42477   10.0.2.64:42477 ALIVE   16 (0 Used) 30.4 GB (0.0 B Used)

看起来有很多资源可用于我创建的小任务。我还看到任务实际上在那儿运行。当我单击它时,我看到它是在5个执行器上启动的,除一名外,其他人都启动了。当我打开其中一个退出的日志时,我看到以下错误消息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/10/27 16:45:23 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14443@CODA
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for TERM
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for HUP
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for INT
17/10/27 16:45:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/27 16:45:24 INFO SecurityManager: Changing view acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing view acls groups to:
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls groups to:
17/10/27 16:45:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root, geissler); groups with view permissions: Set(); users  with modify permissions: Set(root, geissler); groups with modify permissions: Set()
17/10/27 16:47:25 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Exception in thread"main" java.lang.reflect.UndeclaredThrowableException
\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
\tat org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
\tat org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
\tat org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
\tat org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
\tat org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
\tat org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
\tat org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
\tat scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
\tat scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
\tat scala.util.Try$.apply(Try.scala:192)
\tat scala.util.Failure.recover(Try.scala:216)
\tat scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
\tat scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
\tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
\tat org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
\tat scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
\tat scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
\tat scala.concurrent.Promise$class.complete(Promise.scala:55)
\tat scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
\tat scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
\tat scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
\tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
\tat scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
\tat scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
\tat scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
\tat scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
\tat scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
\tat scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
\tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
\tat scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
\tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
\tat scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
\tat scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
\tat scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
\tat org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
\tat org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)
\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
\tat java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
\t... 8 more

这似乎是奴隶无法将其结果提供给我的主人。但是我现在不知道该怎么办。从服务器与主服务器位于网络的同一层,但位于不同的虚拟机(不是Docker容器)上。有没有办法我可以检查它们是否可以/不能到达主服务器?设置群集时是否有任何我忽略的配置设置?

Spark版本:2.1.2(在主服务器,节点和pyspark上)


这里的错误是,python脚本是在本地执行的。始终通过spark-submit启动您的spark脚本,切勿仅将其作为普通程序运行。 Java Spark程序也是如此。