-
Notifications
You must be signed in to change notification settings - Fork 118
can't find snappy lib #504
Comments
I think the snappy deps are there. I'll take a peek at the dockerfiles |
Can you post the result of |
driver output
executor:
|
Hmm I haven't seen this, and we've been reading/writing parquet with Spark in k8s. Have you confirmed that class |
@ash211 you're right it's not there.
Missing / broken snappy jar. I built this off off the tip of 2.2 branch
So could be my build command was incorrect or there could be some issue in the branch. Here was my build command:
I got that from: #438 The reason I'm building my own is listed here: #494 The other issue i'm experiencing on this build is: #505 Which i'm currently trying to debug. |
Sorry, I didn't realize there were two snappy jars in there and inspected the wrong one (IE snappy0.2.jar) This appears to be correct:
|
Looks like everything is present on the executors:
|
Huh, also when I expose the spark UI I see this in the 'environment variables':
|
I upgraded to the recent build (https://github.com/apache-spark-on-k8s/spark/releases/tag/v2.2.0-kubernetes-0.4.0) in the hopes that my personal build was somehow hooped. Same results:
I'm just taking some time to go through the executor logs but that'll take some time. Wondering if anyone else has the same issue. |
What's the nature of the other libraries on the classpath? Are there any jars with shaded dependencies or large jars that contain library classes like Snappy? |
Can you try running a Java or Scala job instead and see if there's similar behavior? |
Good thoughts, here's my jars:
I'll introspect them later this eve and also test with a simple raw scala job. |
Ok, so no mention of snappy in those jars. I submitted a simple scala based spark job that consumes a parquet from S3 that I know is valid, samples it down to 10% of original and writes it back to a new parquet I included all the jars I've been using, mentioned in above comment (however I'm not running any queries against mysql or using joda-time). No sign of similar problems to what I'm seeing above in pyspark. Read / write works Next experiment, port sample job over to pyspark and repeat. |
So this just gets weirder and weirder. Both test jobs scala and pyspark work. But my actual original job still fails. At this stage my theories are: Here's my spark-submit script:
The test scripts consumed a small-ish parquet file, sampled it down and produced a smaller file. This is trying to write about 10GB it looks like, IE not tiny but hardly huge. Next I'm probably going to distill it down to a minimal problem and run that. Right now it takes awhile to iterate. I'll likely replace the current script with: That's essentially what I'm doing now but a 5-10 minute turnaround to fail time is a bit slow, however on the bright side I'm now using base images and don't have to build and push to ECR every darn time. Error still seems to boil down too:
Here's a subset of the logs I'm seeing (The whole trace is huge)
|
Does anyone know if Snappy is used for all parquet writes to S3? The test writes were quite a bit smaller. |
One should be able to inspect the written files to know if they're compressed with Snappy or not. My recollection is that they are always compressed. |
Just tested a write that works, got rid of the sample(true, 0.1) Looked in executor logs: So this is just weird. |
@luck02 I ran into this as well and found this in the stack trace:
It seems like this is happening because the Spark Docker images are based on Alpine, which uses musl libc? I tried adding the |
Thank you for this investigation @akakitani !!!!! |
That's awesome @akakitani where did you find that? In the executor logs? I'll reproduce. The logs are super verbose so it's hard to find useful information. |
This was on the driver logs actually, and it was when trying to read some Parquet files. This has worked so far for me on writes too 👍 |
@mccheah thoughts on modifying the docker images to include |
We install https://github.com/sgerrand/alpine-pkg-glibc/releases/download/2.25-r0/glibc-2.25-r0.apk which might be similar? |
Found it in our trace as well:
Ugh, it only appears once. in about 20k lines of logs. Wow, needle in a haystack. Don't have time to complete experiment this evening by adding to the images. Great Catch @akakitani ! 🎉 |
A fix is proposed in #550 |
…-spark-on-k8s#504) This one is really hard to deal with. Since internally we have used `yyyy-MM-dd HH:mm:ss.S` as a default SimpleDateFormat there's no way to keep STRICT parsing and not accidentally cause a failure. This will require rewriting data with any timestamps that have csv
…apache-spark-on-k8s#504)" (apache-spark-on-k8s#505) Reverts palantir#504 This pr is not necessary. After more debugging it looks like it's not possible to replicate `yyyy-MM-dd HH:mm:ss.S` SDF format in java.time.
I built off the 2.2 branch. Works great until it tries to write to parquet files.
Then it can't seem to initialize / find org.xerial.snappy.Snappy. I see snappy in the jars directory of the spark distro. My endpoint is
s3a:\bucket\etc
and I've included the following jars.I'm not sure if this is something odd in spark distro or if I'm doing something odd.
Thanks!
I can provide the full error log output if that'll help.
The text was updated successfully, but these errors were encountered: