forked from benblamey/jupyters-public
-
Notifications
You must be signed in to change notification settings - Fork 0
/
deploy-instructions-2020.txt
165 lines (126 loc) · 7 KB
/
deploy-instructions-2020.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
########################################################################################################################
##### Spark deploy instructions - First step of A3 #####
A Spark/HDFS cluster has been deployed for you to use. First we'll configure ports and security.
You can use a new or existing VM.
# 0. The web GUIs for Spark and HDFS are not open publicly, so we'll need to configure some port forwarding so that you
can access them via the TCP ports.
To do this, create or modify the file ~/.ssh/config on your local (laptop) computer by adding a section like the one shown below:
(This is unix-like systems and (Windows Subsystem for Linux) WSL, you may have to modify the instructions if you are using some other system).
# replace 130.238.x.y and ~/.ssh/id_rsa with your floating IP and key path appropriately.
Host 130.238.x.y
User ubuntu
# modify this to match the name of your key
IdentityFile ~/.ssh/id_rsa
# Spark master web GUI
LocalForward 8080 192.168.2.87:8080
# HDFS namenode web gui
LocalForward 50070 192.168.2.87:50070
# python notebook
LocalForward 8888 localhost:8888
# spark applications
LocalForward 4040 localhost:4040
LocalForward 4041 localhost:4041
LocalForward 4042 localhost:4042
LocalForward 4043 localhost:4043
LocalForward 4044 localhost:4044
LocalForward 4045 localhost:4045
LocalForward 4046 localhost:4046
LocalForward 4047 localhost:4047
LocalForward 4048 localhost:4048
LocalForward 4049 localhost:4049
LocalForward 4050 localhost:4050
LocalForward 4051 localhost:4051
LocalForward 4052 localhost:4052
LocalForward 4053 localhost:4053
LocalForward 4054 localhost:4054
LocalForward 4055 localhost:4055
LocalForward 4056 localhost:4056
LocalForward 4057 localhost:4057
LocalForward 4058 localhost:4058
LocalForward 4059 localhost:4059
LocalForward 4060 localhost:4060
Notes:
- The 'IdentityFile' line follows the same syntax whether you are using a .pem key file, or an OpenSSH key file (without an extension), as shown above. For a .pem, write something like this:
IdentityFile ~/.ssh/my_key.pem
- If you are using Windows Subsystem for Linux (WSL), the path to the identity file needs to be relative to the root of the filesystem for Ubuntu.
- You may get a warning about an "UNPROTECTED PRIVATE KEY FILE!" - to fix this, change the permissions on your key file to 600.
chmod 600 ~/.ssh/mykey.pem
- If you are using Windows Subsystem for Linux (WSL), you may need to copy your SSH key into the Ubuntu filesystem to be able to modify the permissions.
With these settings, you can connect to your host like this (without any additional parameters):
ssh 130.238.x.y
And when you access localhost:8080 in your browser, it will be forwarded to 192.168.2.87:8080 - the Web GUI of the Spark master.
# 0. Check the Spark and HDFS cluster is operating by opening these links in your browser
# http://localhost:8080
# http://localhost:50070
For HDFS, try Utilities > Browse to see the files on the cluster.
# 0. Assign the security group 'spark-cluster-client' to your virtual machine for it to work correctly with Spark.
# (The machines in the Spark cluster need to be able to connect to your VM)
#####################
### These instructions are for Ubuntu 18.04
# update apt repo metadata
sudo apt update
# install java
sudo apt-get install -y openjdk-8-jdk
# manually define a hostname for all the hosts on the ldsa project, this will make networking easier with spark:
# NOTE! if you have added entries to /etc/hosts yourself, you need to remove those.
for i in {1..255}; do echo "192.168.1.$i host-192-168-1-$i-ldsa" | sudo tee -a /etc/hosts; done
for i in {1..255}; do echo "192.168.2.$i host-192-168-2-$i-ldsa" | sudo tee -a /etc/hosts; done
# set the hostname according to the scheme above:
sudo hostname host-$(hostname -I | awk '{$1=$1};1' | sed 's/\./-/'g)-ldsa ; hostname
########################################################################################################################
##### Install the Python Notebook #####
# Env variable so the workers know which Python to use...
echo "export PYSPARK_PYTHON=python3" >> ~/.bashrc
source ~/.bashrc
# install git
sudo apt-get install -y git
# install python dependencies, start notebook
# install the python package manager 'pip' -- it is recommended to do this directly
sudo apt-get install -y python3-pip
# check the version -- this is a very old version of pip:
python3 -m pip --version
# upgrade it
python3 -m pip install pip
# check the version again -- now its 20.0.2 -- that's more up to date!
python3 -m pip --version
# install pyspark (the matching version as the cluster), and some other useful deps
python3 -m pip install pyspark==2.4.5 --user
python3 -m pip install pandas --user
python3 -m pip install matplotlib --user
# clone the examples from the lectures, so we have a copy to experiment with
git clone https://github.com/benblamey/jupyters-public.git
# install jupyter (installing via pip seems to be broken)
sudo apt install -y jupyter-notebook
# start the notebook!
jupyter notebook
# ...follow the instructions you see -- copy the link into your browser.
# Now you can run the examples from the lectures in your own notebook.
# Using the Jupyter Notebook, navigate into the directory you just cloned from GitHub.
# Start with ldsa-2020/Lecture1_Example1_ArraySquareandSum.ipynb
# You'll need to change the host name for the Spark master, and namenode, to:
# 192.168.2.87
# When you start your application, you'll see it running in the Spark master web GUI (link at the top).
# If you hover over the link to your application, you'll see the port number for the Web GUI for your application.
# It will be 4040,4041,...
# You can open the GUI in your web browser like this (e.g.):
# http://localhost:4040
########################################################################################################################
##### Creating your own notebook that deploys spark jobs to the cluster #####
# When working on your own notebooks, save them in your own git repository (which you created in A1, do a git clone) and
# make sure to commit and push changes often (for backup purposes).
# You need to share the Spark cluster with the other students:
# 1. Start your application with dynamic allocation enabled, a timeout of no more than 30 seconds, and a cap on CPU cores:
#spark_session = SparkSession\
# .builder\
# .master("spark://192.168.2.87:7077") \
# .appName("blameyben_lecture1_simple_example")\
# .config("spark.dynamicAllocation.enabled", True)\
# .config("spark.shuffle.service.enabled", True)\
# .config("spark.dynamicAllocation.executorIdleTimeout","30s")\
# .config("spark.executor.cores",4)\
# .getOrCreate()
# 2. Put your name in the name of your application.
# 3. Kill your application when your have finished with it.
# 4. Don't interfere with any of the virtual machines in the cluster.
# 5. Run one app at a time.
# 6. When the lab is not running, you can use more resources, but keep an eye on other people using the system.