AWS EMR 上 Spark 任务 Exit status: -100 Container released on a *lost* node 错误

一、问题描述

近期,使用 AWS EMR 集群上跑 Spark 任务时常出现 Exit status: -100. Diagnostics: Container released on a lost node 这样的报错信息,导致任务运行失败

报错日志如下:

ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a lost node

二、原因分析

大体日志情况和 Exit code 137 错误 基本一致,找不到任何原因,Container 就被释放了

后来咨询了 AWS 的技术人员,找到了原因,我们使用的时 AWS EMR 的 spot instance。这种实列会在资源池不够的情况下,强制释放掉

三、解决方案

要避免由于实列被强制释放导致 Container 进程丢失的概率,在数仓开发侧能做的是,尽量避免有耗时较长且占用节点较多的任务

大体方案就两个:

1,Executor 尽量不要太多,因为过多的 Executor 就会增加被分配倒 spot instance 节点上的盖率

2,提升任务的运行速度,确保分配倒 spot instance 上的任务能够较快的运行完

参考链接:Exit status: -100


后续跟进这个问题,发现导致该错误的原因还有一个:

AWS EMR 当某个节点磁盘利用率超过 90%(该参数可以配置) 时会触发 Auto Scaling 机制,该磁盘将被视为运行不正常。YARN ResourceManager 会逐步停用该节点

对应节点的日中中会有如下 log:

local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/yarn : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/containers : used space above threshold of 90.0% ]

解决该问题需要更改集群相关配置

参考链接:Exit status: -100 AWS EMR 官方解释

已标记关键词 清除标记
在执行spark的exsample里sparkpi的时候报错了 ----------------------------------------------------- 17/11/08 17:13:14 WARN AppClient$ClientEndpoint: Failed to connect to master namenode:7077 java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local class incompatible: stream classdesc serialVersionUID = 18257903091306170, local class serialVersionUID = 5017373498943810947 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:621) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109) at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:261) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:313) at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:260) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:259) at org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:590) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:572) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:154) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:186) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:106) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745)
©️2020 CSDN 皮肤主题: 编程工作室 设计师:CSDN官方博客 返回首页