Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

can't find snappy lib #504

Open
luck02 opened this issue Sep 22, 2017 · 26 comments
Open

can't find snappy lib #504

luck02 opened this issue Sep 22, 2017 · 26 comments

Comments

@luck02
Copy link

luck02 commented Sep 22, 2017

I built off the 2.2 branch. Works great until it tries to write to parquet files.

Then it can't seem to initialize / find org.xerial.snappy.Snappy. I see snappy in the jars directory of the spark distro. My endpoint is s3a:\bucket\etc and I've included the following jars.

--jars local:/opt/jars/hadoop-aws-2.7.3.jar,local:/opt/jars/aws-java-sdk-1.7.4.jar,local:/opt/jars/joda-time-2.9.3.jar,local:/opt/jars/mysql-connector-java-5.1.35.jar \

I'm not sure if this is something odd in spark distro or if I'm doing something odd.

Thanks!

I can provide the full error log output if that'll help.

2017-09-22 22:51:57 INFO  DAGScheduler:54 - ResultStage 3 (parquet at NativeMethodAccessorImpl.java:0) failed in 4.996 s due to Job aborted due to stage failure: Task 8 in stage 3.0 failed 4 times, most recent failure: Lost task 8.3 in stage 3.0 (TID 434, 10.2.106.11, executor 7): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
@foxish
Copy link
Member

foxish commented Sep 22, 2017

Botched dependencies? Hmm..

@ash211 @mccheah seen this?

@luck02
Copy link
Author

luck02 commented Sep 22, 2017

garylucas@gary-lucas:~/dev/proj/infra/spark-2.2.0-k8s-0.4.0-SNAPSHOT-bin-2.7.3/jars|DAT-spark-on-k8s⚡
⇒  ls -alh | grep snappy
-rw-r--r--    1 garylucas  staff    48K Sep 16 20:54 snappy-0.2.jar
-rw-r--r--    1 garylucas  staff   1.0M Sep 16 20:54 snappy-java-1.1.2.6.jar

I think the snappy deps are there. I'll take a peek at the dockerfiles
--edit-- yah, they copy the whole jars directory.

@mccheah
Copy link

mccheah commented Sep 22, 2017

Can you post the result of kubectl describe pod <driver> and kubectl describe pod <executor>?

@luck02
Copy link
Author

luck02 commented Sep 22, 2017

driver output

Name:		kmderived-1506122359203-driver
Namespace:	default
Node:		ip-10-253-50-77.ec2.internal/10.253.50.77
Start Time:	Fri, 22 Sep 2017 16:19:25 -0700
Labels:		role=kissmetrics
		spark-app-selector=spark-e65d6a8c895f47eead96caad62edfc7d
		spark-role=driver
Annotations:	podpreset.admission.kubernetes.io/podpreset-kissmetrics-env=9039918
		spark-app-name=kmderived
Status:		Running
IP:		10.2.106.14
Containers:
  spark-kubernetes-driver:
    Container ID:	docker://872c49fbee70f53dc3a8163e89d0a80abe0a0fce6360223641359c3c7749b403
    Image:		ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest
    Image ID:		docker-pullable://ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics@sha256:41d872998dbbb28882f176e164add56fd22d40a76d27381bb5f243b5c9bf1aa3
    Port:		<none>
    State:		Running
      Started:		Fri, 22 Sep 2017 16:19:37 -0700
    Ready:		True
    Restart Count:	0
    Limits:
      memory:	15769Mi
    Requests:
      cpu:	8
      memory:	14Gi
    Environment:
      SPARK_DRIVER_MEMORY:				14G
      SPARK_DRIVER_CLASS:				org.apache.spark.deploy.PythonRunner
      SPARK_DRIVER_ARGS:				
      SPARK_MOUNTED_CLASSPATH:				/opt/jars/hadoop-aws-2.7.3.jar:/opt/jars/aws-java-sdk-1.7.4.jar:/opt/jars/joda-time-2.9.3.jar:/opt/jars/mysql-connector-java-5.1.35.jar:/opt/jars/snappy-java-1.1.4.jar
      PYSPARK_PRIMARY:					/opt/ub-data-kissmetrics/jobs/kissmetrics_warehouse/Derived_NonAppPageViewed/Compute__Derived_NonAppPageViewed.py
      PYSPARK_FILES:					/opt/ub-data-kissmetrics/jobs/kissmetrics_warehouse/Derived_NonAppPageViewed/Compute__Derived_NonAppPageViewed.py
      SPARK_JAVA_OPT_0:					-Dspark.kubernetes.docker.image.pullPolicy=Always
      SPARK_JAVA_OPT_1:					-Dspark.kubernetes.namespace=default
      SPARK_JAVA_OPT_2:					-Dspark.app.id=spark-e65d6a8c895f47eead96caad62edfc7d
      SPARK_JAVA_OPT_3:					-Dspark.kubernetes.driver.docker.image=ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest
      SPARK_JAVA_OPT_4:					-Dspark.submit.deployMode=cluster
      SPARK_JAVA_OPT_5:					-Dspark.driver.blockManager.port=7079
      SPARK_JAVA_OPT_6:					-Dspark.app.name=kmderived
      SPARK_JAVA_OPT_7:					-Dspark.driver.bindAddress=kmderived-1506122359203-driver-svc.default.svc.cluster.local
      SPARK_JAVA_OPT_8:					-Dspark.driver.host=kmderived-1506122359203-driver-svc.default.svc.cluster.local
      SPARK_JAVA_OPT_9:					-Dspark.kubernetes.authenticate.driver.serviceAccountName=spark-k8s-service-account
      SPARK_JAVA_OPT_10:				-Dspark.driver.memory=14G
      SPARK_JAVA_OPT_11:				-Dspark.driver.port=7078
      SPARK_JAVA_OPT_12:				-Dspark.driver.cores=8
      SPARK_JAVA_OPT_13:				-Dspark.kubernetes.executor.labels=role=kissmetrics
      SPARK_JAVA_OPT_14:				-Dspark.executor.memory=20G
      SPARK_JAVA_OPT_15:				-Dspark.kubernetes.executor.podNamePrefix=kmderived-1506122359203
      SPARK_JAVA_OPT_16:				-Dspark.executor.instances=10
      SPARK_JAVA_OPT_17:				-Dspark.executor.cores=6
      SPARK_JAVA_OPT_18:				-Dspark.kubernetes.driver.labels=role=kissmetrics
      SPARK_JAVA_OPT_19:				-Dspark.files=/opt/ub-data-kissmetrics/jobs/kissmetrics_warehouse/Derived_NonAppPageViewed/Compute__Derived_NonAppPageViewed.py,/opt/ub-data-kissmetrics/jobs/kissmetrics_warehouse/Derived_NonAppPageViewed/Compute__Derived_NonAppPageViewed.py
      SPARK_JAVA_OPT_20:				-Dspark.master=k8s://http://127.0.0.1:8001
      SPARK_JAVA_OPT_21:				-Dspark.kubernetes.initcontainer.docker.image=ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-init-kissmetrics:latest
      SPARK_JAVA_OPT_22:				-Dspark.kubernetes.executor.docker.image=ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics:latest
      SPARK_JAVA_OPT_23:				-Dspark.jars=/opt/jars/hadoop-aws-2.7.3.jar,/opt/jars/aws-java-sdk-1.7.4.jar,/opt/jars/joda-time-2.9.3.jar,/opt/jars/mysql-connector-java-5.1.35.jar,/opt/jars/snappy-java-1.1.4.jar
      SPARK_JAVA_OPT_24:				-Dspark.kubernetes.driver.pod.name=kmderived-1506122359203-driver
      REPORTSDB_HOSTNAME:				reports-db.unbounce.net
      REPORTSDB_PORT:					3306
      REPORTSDB_KISSMETRICS_INBOUND_WAREHOUSE_NAME:	kissmetrics_inbound
      KM_MANIFEST_FILE:					s3a://manifest/index.csv
      REPORTSDB_KISSMETRICS_WAREHOUSE_NAME:		kissmetrics_warehouse
      MAX_FILES_TO_CONSUME:				250
      OUTPUT_LOCATION:					s3a://output
      AWS_DATA_S3_ACCESS_KEY:				<set to the key 'AWS_DATA_S3_ACCESS_KEY' in secret 'kissmetrics-k8s'>				Optional: false
      AWS_DATA_S3_SECRET_ACCESS_KEY:			<set to the key 'AWS_DATA_S3_SECRET_ACCESS_KEY' in secret 'kissmetrics-k8s'>			Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_JDBCURL:	<set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_READ_JDBCURL:	<set to the key 'REPORTSDB_KISSMETRICS_INBOUND_READ_JDBCURL' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_USERNAME:	<set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_USERNAME' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_PASSWORD:	<set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_PASSWORD' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_USERNAME:	<set to the key 'REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_USERNAME' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_PASSWORD:	<set to the key 'REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_PASSWORD' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL:	<set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_READ_JDBCURL:	<set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_PROD_APPLICATIONDBUSER_READ_JDBCURL:	<set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_READ_JDBCURL' in secret 'kissmetrics-k8s'>	Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from spark-k8s-service-account-token-fn4qg (ro)
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	True 
  PodScheduled 	True 
Volumes:
  spark-k8s-service-account-token-fn4qg:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	spark-k8s-service-account-token-fn4qg
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	<none>
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath					Type		Reason			Message
  ---------	--------	-----	----					-------------					--------	------			-------
  47s		47s		1	default-scheduler									Normal		Scheduled		Successfully assigned kmderived-1506122359203-driver to ip-10-253-50-77.ec2.internal
  47s		47s		1	kubelet, ip-10-253-50-77.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "spark-k8s-service-account-token-fn4qg" 
  46s		46s		1	kubelet, ip-10-253-50-77.ec2.internal	spec.containers{spark-kubernetes-driver}	Normal		Pulling			pulling image "ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest"
  35s		35s		1	kubelet, ip-10-253-50-77.ec2.internal	spec.containers{spark-kubernetes-driver}	Normal		Pulled			Successfully pulled image "ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest"
  35s		35s		1	kubelet, ip-10-253-50-77.ec2.internal	spec.containers{spark-kubernetes-driver}	Normal		Created			Created container
  35s		35s		1	kubelet, ip-10-253-50-77.ec2.internal	spec.containers{spark-kubernetes-driver}	Normal		Started			Started container

executor:

Name:		kmderived-1506122359203-exec-1
Namespace:	default
Node:		ip-10-253-46-101.ec2.internal/10.253.46.101
Start Time:	Fri, 22 Sep 2017 16:19:41 -0700
Labels:		role=kissmetrics
		spark-app-selector=spark-e65d6a8c895f47eead96caad62edfc7d
		spark-exec-id=1
		spark-role=executor
Annotations:	podpreset.admission.kubernetes.io/podpreset-kissmetrics-env=9039918
Status:		Running
IP:		10.2.36.12
Controlled By:	Pod/kmderived-1506122359203-driver
Containers:
  executor:
    Container ID:	docker://67035573284ccc7cb11808afbc66df0ea28923222e8d1a77326d40669a410f4c
    Image:		ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics:latest
    Image ID:		docker-pullable://ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics@sha256:9c1515b8c5150717835d5405e5f052163ec453ed56d865dae354e11dd83b542f
    Ports:		10000/TCP, 7079/TCP
    State:		Running
      Started:		Fri, 22 Sep 2017 16:19:42 -0700
    Ready:		True
    Restart Count:	0
    Limits:
      memory:	22Gi
    Requests:
      cpu:	6
      memory:	20Gi
    Environment:
      SPARK_EXECUTOR_PORT:				10000
      SPARK_DRIVER_URL:					spark://CoarseGrainedScheduler@kmderived-1506122359203-driver-svc.default.svc.cluster.local:7078
      SPARK_EXECUTOR_CORES:				6
      SPARK_EXECUTOR_MEMORY:				20G
      SPARK_APPLICATION_ID:				spark-e65d6a8c895f47eead96caad62edfc7d
      SPARK_EXECUTOR_ID:				1
      SPARK_MOUNTED_CLASSPATH:				/var/spark-data/spark-jars/*
      SPARK_EXECUTOR_POD_IP:				 (v1:status.podIP)
      REPORTSDB_HOSTNAME:				reports-db.unbounce.net
      REPORTSDB_PORT:					3306
      REPORTSDB_KISSMETRICS_INBOUND_WAREHOUSE_NAME:	kissmetrics_inbound
      KM_MANIFEST_FILE:					s3a://manfiest/index.csv
      REPORTSDB_KISSMETRICS_WAREHOUSE_NAME:		kissmetrics_warehouse
      MAX_FILES_TO_CONSUME:				250
      OUTPUT_LOCATION:					s3a://output
      AWS_DATA_S3_ACCESS_KEY:				<set to the key 'AWS_DATA_S3_ACCESS_KEY' in secret 'kissmetrics-k8s'>				Optional: false
      AWS_DATA_S3_SECRET_ACCESS_KEY:			<set to the key 'AWS_DATA_S3_SECRET_ACCESS_KEY' in secret 'kissmetrics-k8s'>			Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_JDBCURL:	<set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_READ_JDBCURL:	<set to the key 'REPORTSDB_KISSMETRICS_INBOUND_READ_JDBCURL' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_USERNAME:	<set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_USERNAME' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_PASSWORD:	<set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_PASSWORD' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_USERNAME:	<set to the key 'REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_USERNAME' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_PASSWORD:	<set to the key 'REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_PASSWORD' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL:	<set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_READ_JDBCURL:	<set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>	Optional: false
      REPORTSDB_PROD_APPLICATIONDBUSER_READ_JDBCURL:	<set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_READ_JDBCURL' in secret 'kissmetrics-k8s'>	Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-00b84 (ro)
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	True 
  PodScheduled 	True 
Volumes:
  default-token-00b84:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-00b84
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	<none>
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath			Type		Reason			Message
  ---------	--------	-----	----					-------------			--------	------			-------
  50s		50s		1	default-scheduler							Normal		Scheduled		Successfully assigned kmderived-1506122359203-exec-1 to ip-10-253-46-101.ec2.internal
  50s		50s		1	kubelet, ip-10-253-46-101.ec2.internal					Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "default-token-00b84" 
  49s		49s		1	kubelet, ip-10-253-46-101.ec2.internal	spec.containers{executor}	Normal		Pulling			pulling image "ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics:latest"
  49s		49s		1	kubelet, ip-10-253-46-101.ec2.internal	spec.containers{executor}	Normal		Pulled			Successfully pulled image "ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics:latest"
  49s		49s		1	kubelet, ip-10-253-46-101.ec2.internal	spec.containers{executor}	Normal		Created			Created container
  49s		49s		1	kubelet, ip-10-253-46-101.ec2.internal	spec.containers{executor}	Normal		Started			Started container

@ash211
Copy link

ash211 commented Sep 24, 2017

Hmm I haven't seen this, and we've been reading/writing parquet with Spark in k8s. Have you confirmed that class org.xerial.snappy.Snappy is in that snappy-java jar?

@luck02
Copy link
Author

luck02 commented Sep 24, 2017

@ash211 you're right it's not there.

META-INF/MANIFEST.MF
META-INF/maven/org.iq80.snappy/snappy/pom.properties
META-INF/maven/org.iq80.snappy/snappy/pom.xml
org/iq80/snappy/BufferRecycler.class
org/iq80/snappy/CorruptionException.class
org/iq80/snappy/Crc32C.class
org/iq80/snappy/HadoopSnappyCodec$SnappyCompressionInputStream.class
org/iq80/snappy/HadoopSnappyCodec$SnappyCompressionOutputStream.class
org/iq80/snappy/HadoopSnappyCodec.class
org/iq80/snappy/Main.class
org/iq80/snappy/Memory.class
org/iq80/snappy/SlowMemory.class
org/iq80/snappy/Snappy.class
org/iq80/snappy/SnappyCompressor.class
org/iq80/snappy/SnappyDecompressor.class
org/iq80/snappy/SnappyInputStream.class
org/iq80/snappy/SnappyInternalUtils.class
org/iq80/snappy/SnappyOutputStream.class
org/iq80/snappy/UnsafeMemory.class

Missing / broken snappy jar.

I built this off off the tip of 2.2 branch

garylucas@gary-lucas:~/dev/spark|branch-2.2-kubernetes⚡
⇒  git rev-parse HEAD
8a0f485d83f84a9b556f7634fed8cb1b65e12159

So could be my build command was incorrect or there could be some issue in the branch.

Here was my build command:

garylucas@gary-lucas:~/dev/spark|branch-2.2-kubernetes⚡
⇒  git rev-parse HEAD
8a0f485d83f84a9b556f7634fed8cb1b65e12159
garylucas@gary-lucas:~/dev/spark|branch-2.2-kubernetes⚡
⇒  ./dev/make-distribution.sh --pip --tgz -Pmesos -Pyarn -Pkinesis-asl -Phive -Phive-thriftserver -Pkubernetes -Phadoop-2.7 -Dhadoop.version=2.7.3

I got that from: #438

The reason I'm building my own is listed here: #494

The other issue i'm experiencing on this build is: #505

Which i'm currently trying to debug.

@luck02
Copy link
Author

luck02 commented Sep 25, 2017

Sorry, I didn't realize there were two snappy jars in there and inspected the wrong one (IE snappy0.2.jar)

This appears to be correct:
snappy-java-1.1.2.6.jar

META-INF/MANIFEST.MF
org/xerial/snappy/OSInfo.class
org/xerial/snappy/PureJavaCrc32C.class
org/xerial/snappy/Snappy.class
org/xerial/snappy/SnappyBundleActivator.class
org/xerial/snappy/SnappyCodec.class
org/xerial/snappy/SnappyError.class
org/xerial/snappy/SnappyErrorCode.class
org/xerial/snappy/SnappyException.class
org/xerial/snappy/SnappyFramed.class
org/xerial/snappy/SnappyFramedInputStream$FrameAction.class
org/xerial/snappy/SnappyFramedInputStream$FrameData.class
org/xerial/snappy/SnappyFramedInputStream$FrameMetaData.class
org/xerial/snappy/SnappyFramedInputStream.class
org/xerial/snappy/SnappyFramedOutputStream.class
org/xerial/snappy/SnappyIOException.class
org/xerial/snappy/SnappyInputStream.class
org/xerial/snappy/SnappyLoader.class
org/xerial/snappy/SnappyNative.class
org/xerial/snappy/SnappyOutputStream.class
org/xerial/snappy/VERSION
org/xerial/snappy/buffer/BufferAllocator.class
org/xerial/snappy/buffer/BufferAllocatorFactory.class
org/xerial/snappy/buffer/CachedBufferAllocator$1.class
org/xerial/snappy/buffer/CachedBufferAllocator.class
org/xerial/snappy/buffer/DefaultBufferAllocator$1.class
org/xerial/snappy/buffer/DefaultBufferAllocator.class
org/xerial/snappy/native/AIX/ppc/libsnappyjava.a
org/xerial/snappy/native/AIX/ppc64/libsnappyjava.a
org/xerial/snappy/native/Linux/aarch64/libsnappyjava.so
org/xerial/snappy/native/Linux/arm/libsnappyjava.so
org/xerial/snappy/native/Linux/armhf/libsnappyjava.so
org/xerial/snappy/native/Linux/ppc64/libsnappyjava.so
org/xerial/snappy/native/Linux/ppc64le/libsnappyjava.so
org/xerial/snappy/native/Linux/s390x/libsnappyjava.so
org/xerial/snappy/native/Linux/x86/libsnappyjava.so
org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so
org/xerial/snappy/native/Mac/x86/libsnappyjava.jnilib
org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib
org/xerial/snappy/native/README
org/xerial/snappy/native/SunOS/sparc/libsnappyjava.so
org/xerial/snappy/native/SunOS/x86/libsnappyjava.so
org/xerial/snappy/native/SunOS/x86_64/libsnappyjava.so
org/xerial/snappy/native/Windows/x86/snappyjava.dll
org/xerial/snappy/native/Windows/x86_64/snappyjava.dll

@luck02
Copy link
Author

luck02 commented Sep 25, 2017

Looks like everything is present on the executors:

garylucas@gary-lucas:~/dev/ub-data-kissmetrics|DAT-spark-on-k8s⚡
⇒ kubectl exec -it kmderived-1506306351957-exec-3 /bin/bash
bash-4.3# ls
Compute__Derived_NonAppPageViewed.py hadoop-aws-2.7.3.jar mysql-connector-java-5.1.35.jar
aws-java-sdk-1.7.4.jar joda-time-2.9.3.jar snappy-java-1.1.2.6.jar
bash-4.3# cd /opt/spark/
RELEASE/ bin/ conf/ examples/ jars/ python/ sbin/ work-dir/
bash-4.3# cd /opt/spark/jars/
bash-4.3# ls -alh | grep snappy
-rw-r--r-- 18 root root 47.6K Sep 17 03:54 snappy-0.2.jar
-rw-r--r-- 18 root root 1.0M Sep 17 03:54 snappy-java-1.1.2.6.jar
bash-4.3#

@luck02
Copy link
Author

luck02 commented Sep 25, 2017

Huh, also when I expose the spark UI I see this in the 'environment variables':

/opt/spark/jars/snappy-0.2.jar | System Classpath
/opt/spark/jars/snappy-java-1.1.2.6.jar | System Classpath

@luck02
Copy link
Author

luck02 commented Sep 27, 2017

I upgraded to the recent build (https://github.com/apache-spark-on-k8s/spark/releases/tag/v2.2.0-kubernetes-0.4.0) in the hopes that my personal build was somehow hooped.

Same results:

3.535 s due to Job aborted due to stage failure: Task 28 in stage 3.0 failed 4 times, most recent failure: Lost task 28.3 in stage 3.0 (TID 363, 10.2.88.8, executor 13): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
	at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
	at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
	at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
	at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
	at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
	at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
	at org.apache.parquet.column.impl.ColumnWriterV1.accountForValueWritten(ColumnWriterV1.java:109)
	at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
	at 

I'm just taking some time to go through the executor logs but that'll take some time. Wondering if anyone else has the same issue.

@mccheah
Copy link

mccheah commented Sep 27, 2017

What's the nature of the other libraries on the classpath? Are there any jars with shaded dependencies or large jars that contain library classes like Snappy?

@mccheah
Copy link

mccheah commented Sep 27, 2017

Can you try running a Java or Scala job instead and see if there's similar behavior?

@luck02
Copy link
Author

luck02 commented Sep 27, 2017

Good thoughts, here's my jars:

      --jars 
local:/opt/jars/hadoop-aws-2.7.3.jar,
local:/opt/jars/aws-java-sdk-1.7.4.jar,
local:/opt/jars/joda-time-2.9.3.jar,
local:/opt/jars/mysql-connector-java-5.1.35.jar \

I'll introspect them later this eve and also test with a simple raw scala job.

@luck02
Copy link
Author

luck02 commented Oct 1, 2017

Ok, so no mention of snappy in those jars.

I submitted a simple scala based spark job that consumes a parquet from S3 that I know is valid, samples it down to 10% of original and writes it back to a new parquet

I included all the jars I've been using, mentioned in above comment (however I'm not running any queries against mysql or using joda-time).

No sign of similar problems to what I'm seeing above in pyspark. Read / write works

Next experiment, port sample job over to pyspark and repeat.

@luck02
Copy link
Author

luck02 commented Oct 2, 2017

So this just gets weirder and weirder.

Both test jobs scala and pyspark work. But my actual original job still fails. At this stage my theories are:

Here's my spark-submit script:

#!/usr/bin/env bash

PATH_TO_DRIVER="$PWD/jobs/kissmetrics_warehouse/Derived_NonAppPageViewed/Compute__Derived_NonAppPageViewed.py"


  ./infra/spark-2.2.0-k8s-0.4.0-bin-2.7.3/bin/spark-submit \
      --verbose \
      --deploy-mode cluster \
      --master k8s://http://127.0.0.1:8001 \
      --jars file://"$PWD/jars/hadoop-aws-2.7.3.jar",file://"$PWD/jars/aws-java-sdk-1.7.4.jar",file://"$PWD/jars/joda-time-2.9.3.jar",file://"$PWD/jars/mysql-connector-java-5.1.35.jar"\
      --kubernetes-namespace default \
      --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-k8s-service-account \
      --conf spark.kubernetes.docker.image.pullPolicy=IfNotPresent \
      --conf spark.driver.memory=8G \
      --conf spark.driver.cores=4 \
      --conf spark.executor.instances=10 \
      --conf spark.executor.cores=4\
      --conf spark.executor.memory=8G \
      --conf spark.app.name=pytest-parquet \
      --conf spark.kubernetes.driver.label.role=kissmetrics \
      --conf spark.kubernetes.executor.label.role=kissmetrics \
      --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver-py:v2.2.0-kubernetes-0.4.0 \
      --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor-py:v2.2.0-kubernetes-0.4.0 \
      --conf spark.kubernetes.initcontainer.docker.image=kubespark/spark-init:v2.2.0-kubernetes-0.4.0 \
      --conf spark.kubernetes.resourceStagingServer.internal.uri=http://spark-resource-staging-service:10000\
      --conf spark.kubernetes.resourceStagingServer.uri=http://rss-url-numbers.us-east-1.elb.amazonaws.com:10000 \
      file://$PATH_TO_DRIVER
  1. mysql dependency (which is used in my job) is causing issues
  2. I'm doing something weird with my script and overwriting something in the wrong namespace somewhere.

The test scripts consumed a small-ish parquet file, sampled it down and produced a smaller file. This is trying to write about 10GB it looks like, IE not tiny but hardly huge.

Next I'm probably going to distill it down to a minimal problem and run that. Right now it takes awhile to iterate. I'll likely replace the current script with:
1: Hit Mysql db and get a bunch of rows.
2: Do some simple aggregation and filtering
3: Write to parquet.

That's essentially what I'm doing now but a 5-10 minute turnaround to fail time is a bit slow, however on the bright side I'm now using base images and don't have to build and push to ECR every darn time.

Error still seems to boil down too:

Suppressed: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy

Here's a subset of the logs I'm seeing (The whole trace is huge)

Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-8_default_executor: 2017-10-02 05:32:47 INFO  InternalParquetRecordWriter:160 - Flushing mem columnStore to file. allocated memory: 2172814
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-7_default_executor:   } ]
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-7_default_executor: }
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-7_default_executor: and corresponding Parquet message type:
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-7_default_executor: message spark_schema {
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-7_default_executor:   required binary Fact_PageViewedEventId (UTF8);
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-8_default_executor: 2017-10-02 05:32:47 ERROR Utils:91 - Aborting task
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-8_default_executor: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-8_default_executor: 	at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-8_default_executor: 	at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)

Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  SQLHadoopMapReduceCommitProtocol:54 - Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  SQLHadoopMapReduceCommitProtocol:54 - Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  CodecConfig:95 - Compression: SNAPPY
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  CodecConfig:95 - Compression: SNAPPY
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:329 - Parquet block size to 134217728
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:330 - Parquet page size to 1048576
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:331 - Parquet dictionary page size to 1048576
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:332 - Dictionary is on
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:333 - Validation is off
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:334 - Writer version is: PARQUET_1_0
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:335 - Maximum row group padding size is 0 bytes
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:336 - Page size checking is: estimated
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:337 - Min row count for page size check is: 100
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:338 - Max row count for page size check is: 10000
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetWriteSupport:54 - Initialized Parquet WriteSupport with Catalyst schema:

Oct 01 22:32:47 : 	at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
Oct 01 22:32:47 : 	at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
Oct 01 22:32:47 : 	at org.apache.parquet.column.impl.ColumnWriterV1.accountForValueWritten(ColumnWriterV1.java:109)
Oct 01 22:32:47 : 	at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
Oct 01 22:32:47 : 	at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:467)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$makeWriter$8.apply(ParquetWriteSupport.scala:163)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$makeWriter$8.apply(ParquetWriteSupport.scala:162)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$writeFields$1.apply$mcV$sp(ParquetWriteSupport.scala:124)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$consumeField(ParquetWriteSupport.scala:437)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$writeFields(ParquetWriteSupport.scala:123)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$write$1.apply$mcV$sp(ParquetWriteSupport.scala:114)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:425)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:113)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:51)
Oct 01 22:32:47 : 	at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
Oct 01 22:32:47 : 	at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
Oct 01 22:32:47 : 	at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
Oct 01 22:32:47 : 	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
Oct 01 22:32:47 : 	... 8 more
Oct 01 22:32:47 : 	Suppressed: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
Oct 01 22:32:47 : 		at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
Oct 01 22:32:47 : 		at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
Oct 01 22:32:47 : 		at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
Oct 01 22:32:47 : 		at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
Oct 01 22:32:47 : 		at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
Oct 01 22:32:47 : 		at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
Oct 01 22:32:47 : 		at org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
Oct 01 22:32:47 : 		at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
Oct 01 22:32:47 : 		at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
Oct 01 22:32:47 : 		at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
Oct 01 22:32:47 : 		at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
Oct 01 22:32:47 : 		at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
Oct 01 22:32:47 : 		at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:337)
Oct 01 22:32:47 : 		at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:264)
Oct 01 22:32:47 : 		at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1384)
Oct 01 22:32:47 : 		... 9 more
Oct 01 22:32:47 : 2017-10-02 05:32:47 WARN  FileOutputCommitter:569 - Could not delete s3a://bucketname/kissmetrics_warehouse/dev/gary/out/Derived_NonAppPageViewed.parquet/_temporary/0/_temporary/attempt_20171002053247_0003_m_000025_1
Oct 01 22:32:47 : 2017-10-02 05:32:47 ERROR FileFormatWriter:70 - Job job_20171002053247_0003 aborted.
Oct 01 22:32:47 : 2017-10-02 05:32:47 WARN  Utils:87 - Suppressing exception in catch: Could not initialize class org.xerial.snappy.Snappy
Oct 01 22:32:47 : java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
Oct 01 22:32:47 : 	at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
Oct 01 22:32:47 : 	at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
Oct 01 22:32:47 : 	at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
Oct 01 22:32:47 : 	at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
Oct 01 22:32:47 : 	at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
Oct 01 22:32:47 : 	at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
Oct 01 22:32:47 : 	at org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
Oct 01 22:32:47 : 	at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
Oct 01 22:32:47 : 	at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
Oct 01 22:32:47 : 	at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
Oct 01 22:32:47 : 	at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:337)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:264)
Oct 01 22:32:47 : 	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1384)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
Oct 01 22:32:47 : 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
Oct 01 22:32:47 : 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
Oct 01 22:32:47 : 	at org.apache.spark.scheduler.Task.run(Task.scala:108)
Oct 01 22:32:47 : 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
Oct 01 22:32:47 : 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Oct 01 22:32:47 : 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Oct 01 22:32:47 : 	at java.lang.Thread.run(Thread.java:748)

@luck02
Copy link
Author

luck02 commented Oct 2, 2017

Does anyone know if Snappy is used for all parquet writes to S3? The test writes were quite a bit smaller.

@mccheah
Copy link

mccheah commented Oct 2, 2017

One should be able to inspect the written files to know if they're compressed with Snappy or not. My recollection is that they are always compressed.

@luck02
Copy link
Author

luck02 commented Oct 3, 2017

Just tested a write that works, got rid of the sample(true, 0.1)

Looked in executor logs:
Oct 02 23:01:43 data-cluster-slow-test k8s.pytest-parquet-1507010416342-exec-1_default_executor: 2017-10-03 06:01:05 INFO CodecConfig:95 - Compression: SNAPPY

So this is just weird.

@akakitani
Copy link

@luck02 I ran into this as well and found this in the stack trace:

java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.2-6e299ad6-46dd-4ff3-aaac-112b4383ccf8-libsnappyjava.so: Error loading shared library ld-linux-x86-64.so.2: Nosuch file or directory (needed by /tmp/snappy-1.1.2-6e299ad6-46dd-4ff3-aaac-112b4383ccf8-libsnappyjava.so)
  at java.lang.ClassLoader$NativeLibrary.load(Native Method)
  at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
  at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
  at java.lang.Runtime.load0(Runtime.java:809)
  at java.lang.System.load(System.java:1086)
  at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:174)
  at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:152)
  at org.xerial.snappy.Snappy.<clinit>(Snappy.java:47)
  ... 48 elided

It seems like this is happening because the Spark Docker images are based on Alpine, which uses musl libc? I tried adding the libc6-compat apk package in custom images for the driver/executor and Snappy seems to work fine now

@ifilonenko
Copy link
Member

Thank you for this investigation @akakitani !!!!!

@luck02
Copy link
Author

luck02 commented Oct 4, 2017

That's awesome @akakitani where did you find that? In the executor logs? I'll reproduce. The logs are super verbose so it's hard to find useful information.

@akakitani
Copy link

This was on the driver logs actually, and it was when trying to read some Parquet files. This has worked so far for me on writes too 👍

@ifilonenko
Copy link
Member

@mccheah thoughts on modifying the docker images to include libc6-compat, if this in reality a fix? Is this one of the extra packages you guys are using in your customized Docker images?

@ash211
Copy link

ash211 commented Oct 5, 2017

We install https://github.com/sgerrand/alpine-pkg-glibc/releases/download/2.25-r0/glibc-2.25-r0.apk which might be similar?

@luck02
Copy link
Author

luck02 commented Oct 5, 2017

Found it in our trace as well:

Oct 04 21:38:00 data-cluster-slow-test k8s.pytest-parquet-1507176181001-exec-2_default_executor: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.2-2506cd9c-0dc3-49eb-958a-a032e8b7b784-libsnappyjava.so: Error loading shared library ld-linux-x86-64.so.2: No such file or directory (needed by /tmp/snappy-1.1.2-2506cd9c-0dc3-49eb-958a-a032e8b7b784-libsnappyjava.so)
Oct 04 21:38:00 data-cluster-slow-test k8s.pytest-parquet-1507176181001-exec-2_default_executor: 	at java.lang.ClassLoader$NativeLibrary.load(Native Method)

Ugh, it only appears once. in about 20k lines of logs. Wow, needle in a haystack. Don't have time to complete experiment this evening by adding to the images.

Great Catch @akakitani ! 🎉

@echarles
Copy link
Member

A fix is proposed in #550

ifilonenko pushed a commit to bloomberg/apache-spark-on-k8s that referenced this issue Mar 13, 2019
…-spark-on-k8s#504)

This one is really hard to deal with. Since internally we have used `yyyy-MM-dd HH:mm:ss.S` as a default SimpleDateFormat there's no way to keep STRICT parsing and not accidentally cause a failure. This will require rewriting data with any timestamps that have csv
ifilonenko pushed a commit to bloomberg/apache-spark-on-k8s that referenced this issue Mar 13, 2019
…apache-spark-on-k8s#504)" (apache-spark-on-k8s#505)

Reverts palantir#504

This pr is not necessary. After more debugging it looks like it's not possible to replicate `yyyy-MM-dd HH:mm:ss.S` SDF format in java.time.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants