Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v313] Cache eviction issue with multiple requests and unequal page/requested size #18674

Open
CharlesBuell opened this issue Aug 23, 2024 · 6 comments
Labels
type-bug This issue is about a bug

Comments

@CharlesBuell
Copy link

CharlesBuell commented Aug 23, 2024

Alluxio Version

v313

Describe the bug

Two problems: Multiple write requests can try to evict the same page in cache and only one can succeed, the other is the evict page size is not checked against the requested size so the cache eviction can not free enough space for write request.

A TPC benchmark to do with big data and hadoop, using alluxio as a caching layer.

Multiple queries coming at the same time during the throughput query stage of a benchmark will end up failing due to cache eviction issues described earlier.

Please let me know if additional information is required to get this fixed. Thanks!

@CharlesBuell CharlesBuell added the type-bug This issue is about a bug label Aug 23, 2024
@jja725
Copy link
Contributor

jja725 commented Aug 24, 2024

Do you mind add more env info or the error message?

@CharlesBuell
Copy link
Author

Thanks for the quick response. Here's some more environment information.

Alluxio is running with one master and one worker node on the same system, with the following site configuration, being attached to HDFS locally:

alluxio-site.properties

alluxio.master.mount.table.root.ufs=hdfs://127.0.0.1:9000/
alluxio.dora.client.ufs.root=hdfs://127.0.0.1:9000/

alluxio.worker.tieredstore.levels=0

alluxio-env.sh

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

The error is coming from an environment where TPC-DS is being ran with an Apache Hadoop storage medium for data being queried by TPC-DS. Using Apache Spark for executing queries and Alluxio for caching data queried, the following is outputted when a query fails between Spark/Alluxio:

Spark error

org.apache.spark.SparkException: Multiple failures in stage materialization.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.multiFailuresInStageMaterializationError(QueryExecutionErrors.scala:2076)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.cleanUpAndThrowException(AdaptiveSparkPlanExec.scala:809)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:333)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:272)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:417)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:390)
	at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:475)
	at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$2(SparkSQLDriver.scala:76)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:76)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:501)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:619)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:613)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:613)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
	at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
	at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:229)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
	Suppressed: org.apache.spark.SparkException: Job aborted due to stage failure: Task 65 in stage 3.0 failed 1 times, most recent failure: Lost task 65.0 in stage 3.0 (TID 96) (localhost executor driver): java.lang.IllegalArgumentException: Self-suppression not permitted
	at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
	at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
	at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
	at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
	at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
	at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
	at org.apache.spark.shuffle.S3MeasureOutputStream.flush(S3MeasureOutputStream.scala:40)
	at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.getPartitionWriter(S3ShuffleMapOutputWriter.scala:74)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.writePartitionedData(BypassMergeSortShuffleWriter.java:211)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:180)
	at org.apache.spark.shuffle.S3ShuffleWriter.write(S3ShuffleWriter.scala:8)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
	Suppressed: java.lang.IllegalArgumentException: Self-suppression not permitted
		at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
		at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
		at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
		at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
		at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
		at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
		at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
		at org.apache.spark.shuffle.S3MeasureOutputStream.close(S3MeasureOutputStream.scala:49)
		at java.base/java.nio.channels.Channels$WritableByteChannelImpl.implCloseChannel(Channels.java:476)
		at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
		at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.cleanUp(S3ShuffleMapOutputWriter.scala:123)
		at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.abort(S3ShuffleMapOutputWriter.scala:118)
		at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:185)
		... 14 more
	Caused by: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4
		at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:174)
		at alluxio.util.CommonUtils.unwrapResponseFrom(CommonUtils.java:723)
		at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelRead(NettyDataWriter.java:448)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
		at alluxio.shaded.client.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
		at alluxio.shaded.client.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
		at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
		at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
		at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
		at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
		at alluxio.shaded.client.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
		at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
		at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
		at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
		at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
		at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
		at alluxio.shaded.client.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
		... 1 more
		Suppressed: alluxio.shaded.client.io.netty.channel.StacklessClosedChannelException
			at alluxio.shaded.client.io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown Source)
			Suppressed: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]
		Suppressed: java.io.IOException: Channel [id: 0x17db5241, L:0.0.0.0/0.0.0.0:42622 ! R:localhost/127.0.0.1:29997] is closed when writing block 0.
			at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelUnregistered(NettyDataWriter.java:476)
			at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:219)
			at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
			at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:188)
			at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelUnregistered(DefaultChannelPipeline.java:1388)
			at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:215)
			at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
			at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:821)
			at alluxio.shaded.client.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:821)
			at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
			at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
			at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
			at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
			... 3 more
Caused by: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]

Driver stacktrace:
		at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
		at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
		at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
		at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
		at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
		at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
		at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
		at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
		at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
		at scala.Option.foreach(Option.scala:407)
		at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
		at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
		at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
		at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
		at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: [CIRCULAR REFERENCE: org.apache.spark.SparkException: Job aborted due to stage failure: Task 65 in stage 3.0 failed 1 times, most recent failure: Lost task 65.0 in stage 3.0 (TID 96) (localhost executor driver): java.lang.IllegalArgumentException: Self-suppression not permitted
	at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
	at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
	at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
	at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
	at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
	at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
	at org.apache.spark.shuffle.S3MeasureOutputStream.flush(S3MeasureOutputStream.scala:40)
	at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.getPartitionWriter(S3ShuffleMapOutputWriter.scala:74)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.writePartitionedData(BypassMergeSortShuffleWriter.java:211)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:180)
	at org.apache.spark.shuffle.S3ShuffleWriter.write(S3ShuffleWriter.scala:8)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
	Suppressed: java.lang.IllegalArgumentException: Self-suppression not permitted
		at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
		at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
		at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
		at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
		at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
		at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
		at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
		at org.apache.spark.shuffle.S3MeasureOutputStream.close(S3MeasureOutputStream.scala:49)
		at java.base/java.nio.channels.Channels$WritableByteChannelImpl.implCloseChannel(Channels.java:476)
		at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
		at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.cleanUp(S3ShuffleMapOutputWriter.scala:123)
		at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.abort(S3ShuffleMapOutputWriter.scala:118)
		at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:185)
		... 14 more
	Caused by: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4
		at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:174)
		at alluxio.util.CommonUtils.unwrapResponseFrom(CommonUtils.java:723)
		at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelRead(NettyDataWriter.java:448)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
		at alluxio.shaded.client.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
		at alluxio.shaded.client.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
		at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
		at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
		at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
		at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
		at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
		at alluxio.shaded.client.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
		at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
		at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
		at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
		at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
		at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
		at alluxio.shaded.client.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
		... 1 more
		Suppressed: alluxio.shaded.client.io.netty.channel.StacklessClosedChannelException
			at alluxio.shaded.client.io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown Source)
			Suppressed: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]
		Suppressed: java.io.IOException: Channel [id: 0x17db5241, L:0.0.0.0/0.0.0.0:42622 ! R:localhost/127.0.0.1:29997] is closed when writing block 0.
			at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelUnregistered(NettyDataWriter.java:476)
			at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:219)
			at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
			at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:188)
			at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelUnregistered(DefaultChannelPipeline.java:1388)
			at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:215)
			at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
			at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:821)
			at alluxio.shaded.client.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:821)
			at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
			at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
			at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
			at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
			... 3 more
Caused by: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]

Driver stacktrace:]

Alluxio error (left some additional debug logs around it for context if needed)

2024-08-23 00:34:43,049 DEBUG [worker-rpc-executor-TPE-thread-2424](AuthenticatedUserInjector.java:96) - Acquiring credentials for service-method: alluxio.grpc.block.BlockWorker/GetStatus on channel: e22b1d91-1ed6-4233-a3be-b31e79320e50
2024-08-23 00:34:43,049 DEBUG [NettyReaderExecutor-12443](LocalCacheManager.java:430) - putInternal(PageId{FileId=eb9ea488437311a93e0aead7398d0892df232298485b5a814407d8250f575105, PageIndex=0},1048576 bytes) enters
2024-08-23 00:34:43,050 ERROR [NettyReaderExecutor-12817](DefaultPageMetaStore.java:223) - Invalid result returned by evictor: page PageId{FileId=f7e2d893c181d37ea09a20af015f59c381e3493ad4dbab67b8d43ba4a488c1cb, PageIndex=0} not available
2024-08-23 00:34:43,052 ERROR [NettyReaderExecutor-12817](LocalCacheManager.java:461) - Unable to find page to evict: space used 487142919, page length 1042648, cache size 488064465
2024-08-23 00:34:43,050 DEBUG [NettyReaderExecutor-12839](LocalCacheManager.java:430) - putInternal(PageId{FileId=b2f20af747c70dea2c2b3a220e5bdb28e12ce0a3e94fd7df095584cfd283a9f9, PageIndex=0},1048576 bytes) enters
2024-08-23 00:34:43,049 DEBUG [worker-rpc-executor-TPE-thread-2434](AuthenticatedUserInjector.java:96) - Acquiring credentials for service-method: alluxio.grpc.block.BlockWorker/GetStatus on channel: 10930bcf-43ac-4b39-9053-9937cd42498e
2024-08-23 00:34:43,052 ERROR [NettyReaderExecutor-12795](DefaultPageMetaStore.java:223) - Invalid result returned by evictor: page PageId{FileId=a67ab3c108d458fe607c8e24f346de082724612baef30a3d2a6969b9e47707fc, PageIndex=0} not available
2024-08-23 00:34:43,053 ERROR [NettyReaderExecutor-12795](LocalCacheManager.java:461) - Unable to find page to evict: space used 487142919, page length 1048576, cache size 488064465
2024-08-23 00:34:43,052 DEBUG [NettyReaderExecutor-12817](LocalCacheManager.java:338) - put(PageId{FileId=1d712123a45e300893de23c24703af413e98634c60fa588a73e0b3f22c9107a3, PageIndex=0},0 bytes) exits: false
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:42) - State transitioned from CHANNEL_IDLE to READING_DATA because of trigger REQUEST_RECEIVED
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:37) - Trigger fired: DATA_AVAILABLE
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:42) - State transitioned from READING_DATA to SENDING_DATA because of trigger DATA_AVAILABLE
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:37) - Trigger fired: OUTPUT_LENGTH_FULFILLED
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12860](LocalCacheManager.java:504) - Page PageId{FileId=ebf731c6c100e4b2cb802f7701be3f8bfe6443f914e941f970374026e27d9e52, PageIndex=0} is unavailable to evict, likely due to a benign race
2024-08-23 00:34:43,054 DEBUG [NettyReaderExecutor-12860](LocalCacheManager.java:430) - putInternal(PageId{FileId=b2f20af747c70dea2c2b3a220e5bdb28e12ce0a3e94fd7df095584cfd283a9f9, PageIndex=0},1048576 bytes) enters
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12795](LocalCacheManager.java:338) - put(PageId{FileId=eb9ea488437311a93e0aead7398d0892df232298485b5a814407d8250f575105, PageIndex=0},0 bytes) exits: false
2024-08-23 00:34:43,054 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:42) - State transitioned from CHANNEL_IDLE to READING_DATA because of trigger REQUEST_RECEIVED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:37) - Trigger fired: DATA_AVAILABLE
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:42) - State transitioned from READING_DATA to SENDING_DATA because of trigger DATA_AVAILABLE
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:37) - Trigger fired: OUTPUT_LENGTH_FULFILLED
2024-08-23 00:34:43,052 DEBUG [NettyReaderExecutor-12598](DebugLoggingTracer.java:42) - State transitioned from COMPLETING_REQUEST to CHANNEL_IDLE because of trigger REQUEST_COMPLETED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12598](DebugLoggingTracer.java:37) - Trigger fired: REQUEST_RECEIVED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:42) - State transitioned from SENDING_DATA to EOF because of trigger OUTPUT_LENGTH_FULFILLED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:37) - Trigger fired: COMPLETE_REQUES

@CharlesBuell
Copy link
Author

@jja725 Hello, do you need any more information for this issue? Could it be possible to fix this through alluxio-site.properties? Thanks.

@YichuanSun
Copy link
Contributor

how about set a bigger cache size in alluxio-site.properties?

@YichuanSun
Copy link
Contributor

try to allocate more cache. modify alluxio.worker.page.store.sizes=<a bigger size>, the default size is 512MB

@CharlesBuell
Copy link
Author

Increasing it does help a little in terms of how many queries are able to complete, but it doesn't completely solve the issue. Once I get to around the size of the dataset, then all the queries are able to complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

No branches or pull requests

3 participants