Spark initial job has not accepted any resources
我很难让我的程序在我的Spark集群上运行。我将群集设置为1个主服务器和4个从属服务器。我启动了主服务器,此后,我启动了从服务器,它们出现在主服务器的Web ui中。
然后我启动一个小的python脚本来检查是否可以执行作业:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | from pyspark import * #SparkContext, SparkConf, spark from pyspark.sql import SparkSession from pyspark.sql.types import * from pyspark.sql import SQLContext from files import files import sys if __name__ =="__main__": appName = 'SparkExample' masterUrl = 'spark://10.0.2.55:7077' conf = SparkConf() conf.setAppName(appName) conf.setMaster(masterUrl) conf.set("spark.driver.cores","1") conf.set("spark.driver.memory","1g") conf.set("spark.executor.cores","1") conf.set("spark.executor.memory","4g") conf.set("spark.python.worker.memory","256m") conf.set("spark.cores.max","4") conf.set("spark.shuffle.service.enabled","true") conf.set("spark.dynamicAllocation.enabled","true") conf.set("spark.dynamicAllocation.maxExecutors","1") for k,v in conf.getAll(): print(k+":"+v) spark = SparkSession.builder.config(conf=conf).getOrCreate() #spark = SparkSession.builder.master(masterUrl).appName(appName).config("spark.executor.memory","1g").getOrCreate() l = [('Alice', 1)] spark.createDataFrame(l).collect() spark.createDataFrame(l, ['name', 'age']).collect() print("#############") print("Test finished") print("#############") |
但是,一旦我应该找回东西(第45行:" spark.createDataFrame(l).collect()"),spark似乎就挂了。一段时间后,我看到消息:
" WARN TaskSchedulerImpl:初始作业未接受任何资源:检查您的群集UI,以确保工作人员已注册并具有足够的资源"
因此,我检查了集群用户界面:
1 2 3 4 | worker-20171027105227-xx.x.x.x6-35309 10.0.2.56:35309 ALIVE 4 (0 Used) 6.8 GB (0.0 B Used) worker-20171027110202-xx.x.x.x0-43433 10.0.2.10:43433 ALIVE 16 (1 Used) 30.4 GB (4.0 GB Used) worker-20171027110746-xx.x.x.x5-45126 10.0.2.65:45126 ALIVE 8 (0 Used) 30.4 GB (0.0 B Used) worker-20171027110939-xx.x.x.x4-42477 10.0.2.64:42477 ALIVE 16 (0 Used) 30.4 GB (0.0 B Used) |
看起来有很多资源可用于我创建的小任务。我还看到任务实际上在那儿运行。当我单击它时,我看到它是在5个执行器上启动的,除一名外,其他人都启动了。当我打开其中一个退出的日志时,我看到以下错误消息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/10/27 16:45:23 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14443@CODA 17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for TERM 17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for HUP 17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for INT 17/10/27 16:45:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/10/27 16:45:24 INFO SecurityManager: Changing view acls to: root,geissler 17/10/27 16:45:24 INFO SecurityManager: Changing modify acls to: root,geissler 17/10/27 16:45:24 INFO SecurityManager: Changing view acls groups to: 17/10/27 16:45:24 INFO SecurityManager: Changing modify acls groups to: 17/10/27 16:45:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, geissler); groups with view permissions: Set(); users with modify permissions: Set(root, geissler); groups with modify permissions: Set() 17/10/27 16:47:25 ERROR RpcOutboxMessage: Ask timeout before connecting successfully Exception in thread"main" java.lang.reflect.UndeclaredThrowableException \tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713) \tat org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) \tat org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) \tat org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284) \tat org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout \tat org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) \tat org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) \tat org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) \tat scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) \tat scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) \tat scala.util.Try$.apply(Try.scala:192) \tat scala.util.Failure.recover(Try.scala:216) \tat scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) \tat scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) \tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) \tat org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) \tat scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) \tat scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) \tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) \tat scala.concurrent.Promise$class.complete(Promise.scala:55) \tat scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) \tat scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) \tat scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) \tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) \tat scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) \tat scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) \tat scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) \tat scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) \tat scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) \tat scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) \tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) \tat scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) \tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) \tat scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) \tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) \tat scala.concurrent.Promise$class.tryFailure(Promise.scala:112) \tat scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153) \tat org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205) \tat org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239) \tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) \tat java.util.concurrent.FutureTask.run(FutureTask.java:266) \tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) \tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) \tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) \tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) \tat java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds \t... 8 more |
这似乎是奴隶无法将其结果提供给我的主人。但是我现在不知道该怎么办。从服务器与主服务器位于网络的同一层,但位于不同的虚拟机(不是Docker容器)上。有没有办法我可以检查它们是否可以/不能到达主服务器?设置群集时是否有任何我忽略的配置设置?
Spark版本:2.1.2(在主服务器,节点和pyspark上)
这里的错误是,python脚本是在本地执行的。始终通过spark-submit启动您的spark脚本,切勿仅将其作为普通程序运行。 Java Spark程序也是如此。