This repository defines multiple ansible roles to help deploying different modes of a Spark cluster and Data Science Platform based on Anaconda and Jupyter Notebook stack
You will need a driver machine with ansible installed and a clone of the current repository:
- If you are running on cloud (public/private network)
- Install ansible on the edge node (with public ip)
- if you are running on private cloud (public network access to all nodes)
- Install ansible on your laptop and drive the deployment from it
curl -O https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo rpm -i epel-release-latest-7.noarch.rpm
sudo yum update -y
sudo yum install -y ansible
- Install Annaconda
- Use pip install ansible
pip install --upgrade ansible
In order to have variable overriding from host inventory, please add the following configuration into your ~/.ansible.cfg file
[defaults]
host_key_checking = False
hash_behaviour = merge
- RHEL 7.x
- Ansible 2.6.3
Ansible uses 'host inventory' files to define the cluster configuration, nodes, and groups of nodes that serves a given purpose (e.g. master node).
Below is a host inventory sample definition:
[all:vars]
ansible_connection=ssh
#ansible_user=root
#ansible_ssh_private_key_file=~/.ssh/ibm_rsa
gather_facts=True
gathering=smart
host_key_checking=False
install_java=True
install_temp_dir=/tmp/ansible-install
install_dir=/opt
python_version=2
[master]
lresende-elyra-node-1 ansible_host=IP ansible_host_private=IP ansible_host_id=1
[nodes]
lresende-elyra-node-2 ansible_host=IP ansible_host_private=IP ansible_host_id=2
lresende-elyra-node-3 ansible_host=IP ansible_host_private=IP ansible_host_id=3
lresende-elyra-node-4 ansible_host=IP ansible_host_private=IP ansible_host_id=4
lresende-elyra-node-5 ansible_host=IP ansible_host_private=IP ansible_host_id=5
Some specific configurations are:
- install_java=True : install/update java 8
- install_temp_dir=/tmp/ansible-install : temporary folder used for install files
- install_dir=/opt : where packages are installed (e.g. Spark)
- python_version=2 : python version to use, influence which version of Anaconda to download
Note: ansible_host_id is only used when deploying a "Spark Standalone" cluster. Note: Ambari is currently only supporting Python 2.x
In this scenario, a minimal blueprint is used to deploy the required components to run YARN and Spark.
- Common Deploys Java and common dependencies
- Ambari Deploys Ambari cluster with HDP Stack
The sample playbook below can be used to deploy an Spark using an HDP distribution
- name: ambari setup
hosts: all
remote_user: root
roles:
- role: common
- role: ambari
ansible-playbook --verbose <deployment playbook.yml> -i <hosts inventory>
Example:
ansible-playbook --verbose setup-ambari.yml -c paramiko -i hosts-fyre-ambari
In this scenario, a Standalone Spark cluster will be deployed with a few optional components.
- Common Deploys Java and common dependencies
- HDFS Deploys HDFS filesystem using slave nodes as data nodes
- Spark Deploys Spark in Standalone mode using slave nodes as workers
- Spark-CLuster-Admin Utility scripts for managing Spark cluster
- ElasticSearch Deploy ElasticSearch nodes on all slave nodes
- Zookeeper Depoys Zookeeper on all nodes (required by Kafka)
- Kafka Deploy Kafka nodes on all slave nodes
- name: spark setup
hosts: all
remote_user: root
roles:
- role: common
- role: hdfs
- role: spark
- role: spark-cluster-admin
Note: When deploying Kafka, the Zookeeper role is required
ansible-playbook --verbose <deployment playbook.yml> -i <hosts inventory>
Example:
ansible-playbook --verbose setup-spark-standalone.yml -c paramiko -i hosts-fyre-spark
In this scenario, an existing Spark cluster is updated with necessary components to build a data science platform based on Anaconda and Jupyter Notebook stack.
- Anaconda Deploys Anaconda Python distribution on all nodes
- Notebook Deploys Notebook Platform
- name: anaconda
hosts: all
vars:
anaconda:
update_path: true
remote_user: root
roles:
- role: anaconda
- name: notebook platform dependencies
hosts: all
vars:
notebook:
use_anaconda: true
deploy_kernelspecs_to_workers: false
remote_user: root
roles:
- role: notebook
Playbook Configuration
- use_anaconda: Flag to identify if anaconda is available and should be used as python package manager
- deploy_kernelspecs_to_workers: optionally deploy kernelspecs for Python, R, and Scala to all nodes
-
The Ambari role will install MySQL community edition which is available under GPL license.
-
The Notebook role will install R which is available under GPL2 | GPL 3
By deploying these packages via the ansible utility scripts in this project you are accepting the license terms for these components.