-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark does not talk to Lambda at all #5
Comments
@habemusne Thanks for trying Spark on Lambda out. I understand in its current form its not easy to set it up and try out. Some time back @faromero also had issues running it but with the AWS setup fixed it worked fine. As far as AWS Lambda, EC2 and VPC setup is concerned, this is what I did which worked.
There are couple of questions I have,
Let me try running this by this week and share you the set of configs I used to make it work. |
Quickly checking found that these 2 configs have to be set with access key and secret key
May be can you check if it fails in the write phase itself? These keys are used at the lambda end to write data to S3. |
@venkata91 Thanks a lot for your quick and detailed response! To answer your two questions: before all the below procedures, the Instead of running my kmeans application, I switched to test the SparkPi example. So technically my original problem is not yet solved. But I found some interesting problems I had by running SparkPi. My SparkPi is now able to run Lambda now (through not throughly --- the
Having these documented will be helpful to other people who want to try out. |
Glad you figured it out, @habemusne. For your second point, I think the combination of this section and this section in my documentation should cover it as well. Also, regarding 3), if I remember correctly, you would actually benefit from increasing the Lambda timeout to be longer in general, since the master benefits from persisting the workers and not having to restart them constantly. When the job is done, the workers will automatically be shut down. This is especially useful for long-running sort functions, etc. Qubole talks about their experiments with 5 minute Lambdas (back when that was the maximum) here. However, I agree: these limits should be mentioned somewhere, so I updated some of my documentation to capture this. |
@habemusne Would you mind if I ask you to help in fixing the docs? If you feel comfortable, please file a PR. |
@faromero Thanks for the detailed point-outs. I don't have tons of experience with Lambda, so at the beginning I wasn't too concerned about the memory and timeout. Your documentation was way more clear and I followed it almost exactly (as mentioned in the issue description). So I did add your "endpoint" line in my config file. But somehow I needed to add an AWS VPC Endpoint from the AWS console in order to fix some error. One thing to point out in your doc is that the build step took a long time. It will be helpful if your doc makes the user aware of it. I used the t2.micro EC2 and it took forever to compile the spark-lambda package. I simply moved on by downloading the zip from qubole's S3. Haven't yet met an error caused by not compiling from source. @venkata91 More than happy to contribute! I will help with the doc when I am done with this issue and the whole setup. I think the doc in this repo is pretty minimal, and faromero mentions more details. I wrote mine for my classmates, which has even more detail and may be easier to newbees (link). To what extent would you like to change? |
Almost getting it working after tough debugging. The first reason was that I did not open my security group to all inbound traffic. This was not the only reason why I got the error. The fact that I was running with wrong path also counted. I wrote my own kmeans script, and the script used data file path that was under It seems tricky to used user-defined Python code and user-defined data on the framework. I guess there can be some framework fix on this issue. But directions on how to run user-defined data will be very useful, as an alternative of spending time to modify this framework. |
Hey, I am trying to replicate the same but I am facing a similar issue. @venkata91 talks about VPC stuff. I have a VPC established with both private and public subnets and a default NAT Gateway associated within the VPC. However, I do not seem to get the logs in CloudWatch after I have run the SparkPi example from my EC2 instance. I have made sure my configuration file has the right key-value pairs and I can issue "aws lambda invoke --function-name spark-lambda ~/test.txt" via my EC2 instance. Can anyone suggest what I am missing here? |
I have been struggling with setting up this framework on my EC2 server. I tried the best to follow the instruction of both this repo and also faromero's forked repo, but I have been getting this error message each time I run
sudo ./../driver/bin/spark-submit ml_kmeans.py --master lambda://test
:Here is the full log
My config file:
~/.aws/config
is us-east-1. VPC subnets are configured following the forked repo's instruction.My Lambda function is tested to be able to write and read to S3. My spark-submit command ran on EC2 is able to write to S3 (it generates a
tmp/
folder on the bucket), but does not let the Lambda run at all. CloudWatch for my Lambda has no logs. However, I am able to run my Lambda from EC2 using something likeaws lambda invoke --function-name spark-lambda ~/test.txt
. I guess I configured Spark-on-Lambda wrong but I've been following the instructions.I am now trying to dive into the source code. Is there any clue for this message?
The text was updated successfully, but these errors were encountered: