Skip to content
This repository has been archived by the owner on Nov 23, 2017. It is now read-only.

Config file devel #41

Open
wants to merge 5 commits into
base: branch-1.6
Choose a base branch
from
Open

Conversation

ar-ms
Copy link

@ar-ms ar-ms commented Jul 20, 2016

Configuration file

-c or --conf-file option enables to use a YAML configuration file to simplify the way you interact with the cluster.

Launch new cluster

./spark-ec2 --conf-file config.yml launch my_cluster

Example of a configuration file

slaves: 2
instance_type: r3.xlarge
region: eu-west-1
zone: eu-west-1c
ami: ami-xxxxxxxx
spot_price: 0.07
hadoop_major_version: yarn

spark_ec2_git_repo: https://github.com/tirami-su/spark-ec2
spark_ec2_git_branch: config_file-devel

copy_aws_credentials: true
no_ganglia: y
resume: on

## Network and security
key_pair: xxxxx
identity_file: xxxxx.pem

credentials: 
   aws_access_key_id: XXXXX
   aws_secret_access_key: XXXXX

Note

  • In this version, when -c or --config-file option is precised, only the parameters present in the configuration file are used. If there is other options in the command line, they are ignored.
  • The identity file must be located in ~/.ssh/.

Modificaiton

JSON to YAML
Keys must be located in ~/.ssh/

@nchammas
Copy link
Contributor

This is a much-requested feature from the early days of spark-ec2, so thank you for working on this.

I wonder, though, if there is a way to implement this feature without using JSON -- which IMO is not appropriate as a configuration format -- and without manually maintaining this mapping of config name to variable name.

Are there no features of argparse that can be used to offer this functionality in a more straightforward way?

@ar-ms
Copy link
Author

ar-ms commented Jul 25, 2016

I really enjoy using this feature, it permits me to make less error when managing a cluster, but I ignored that it was a much-requested feature.

I was thinking that JSON was an appropriate format, it seems clear and simple to manipulate.
We could use ConfigParser which uses the INI file format.
The YAML format could be nice too, but it requires to install PyYAML library...

I didn't find a feature of argparse that gives us what we want, but I wrote a function to avoid to maintain the manual mapping.

unneeded_opts = ("version", "help")

def mapping_conf(opt):
    normal_opt = str(opt).split("/")[-1]
    trans_opt = normal_opt.strip("-")
    trans_opt = trans_opt.replace("-", "_")
    return {trans_opt: normal_opt}

map_conf = {}
[map_conf.update(mapping_conf(opt)) for opt in parser.option_list]
[map_conf.pop(unneeded_opt, None) for unneeded_opt in unneeded_opts]

map_conf:

{
    'D': '-D',
    'additional_security_group': '--additional-security-group',
    'additional_tags': '--additional-tags',
    'ami': '--ami',
    'authorized_address': '--authorized-address',
    'conf_file': '--conf-file',
    'copy_aws_credentials': '--copy-aws-credentials',
    'delete_groups': '--delete-groups',
    'deploy_root_dir': '--deploy-root-dir',
    'ebs_vol_num': '--ebs-vol-num',
    'ebs_vol_size': '--ebs-vol-size',
    'ebs_vol_type': '--ebs-vol-type',
    'ganglia': '--ganglia',
    'hadoop_major_version': '--hadoop-major-version',
    'identity_file': '--identity-file',
    'instance_initiated_shutdown_behavior': '--instance-initiated-shutdown-behavior',
    'instance_profile_name': '--instance-profile-name',
    'instance_type': '--instance-type',
    'key_pair': '--key-pair',
    'master_instance_type': '--master-instance-type',
    'master_opts': '--master-opts',
    'no_ganglia': '--no-ganglia',
    'placement_group': '--placement-group',
    'private_ips': '--private-ips',
    'profile': '--profile',
    'region': '--region',
    'resume': '--resume',
    'slaves': '--slaves',
    'spark_ec2_git_branch': '--spark-ec2-git-branch',
    'spark_ec2_git_repo': '--spark-ec2-git-repo',
    'spark_git_repo': '--spark-git-repo',
    'spark_version': '--spark-version',
    'spot_price': '--spot-price',
    'subnet_id': '--subnet-id',
    'swap': '--swap',
    'use_existing_master': '--use-existing-master',
    'user': '--user',
    'user_data': '--user-data',
    'vpc_id': '--vpc-id',
    'wait': '--wait',
    'worker_instances': '--worker-instances',
    'zone': '--zone'
}

Do you think this solution could do the job ?

@nchammas
Copy link
Contributor

Yes, I think it's a step up to have a function that does the mapping for us.

Regarding JSON vs. something else, I am biased since I chose YAML for Flintrock. 😄 The main problem I have with JSON is that you can't put comments in it. In Flintrock I also used Click to parse command-line arguments, since that lets the user load configs from a file but override individual settings at the command-line.

Obviously, we don't have to do that here, as it would be a pretty big change. If this works well enough for @shivaram then it's fine by me too.

@shivaram
Copy link
Contributor

I prefer YAML as well compared to JSON -- Its just more human friendly to read and format.
@Tirami-su Would it be possible to have a similar function that generates the mapping for YAML ?

Also for YAML parsing we should download the library similar to how we download boto -- That should make it transparent to the user ?

@ar-ms
Copy link
Author

ar-ms commented Jul 25, 2016

@nchammas haha you're biased but you're right 😄 !

@shivaram I will take a look and let you know, I think that it should be possible. Yep you're right, we should do this 👍 !

@ar-ms
Copy link
Author

ar-ms commented Jul 27, 2016

@nchammas, @shivaram I pushed the new version with the YAML configuration file format. You can check :) !



# Only PyPI libraries are supported.
external_libs = [
{
"name": "boto",
"version": "2.34.0",
"md5": "5556223d2d0cc4d06dd4829e671dcecd"
"md5": "5556223d2d0cc4d06dd4829e671dcecd",
"add-path": None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of add-path?

Copy link
Author

@ar-ms ar-ms Jul 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose add-path is to precise a custom function to execute, because some libraries need to be added to the system path differently than just doing sys.path.insert(1, libs/LIBRARY_NAME).

@nchammas
Copy link
Contributor

@Tirami-su you will probably also want to add a note to the REAME, perhaps along with a brief example config file, to show users how this feature works.

@ar-ms
Copy link
Author

ar-ms commented Jul 27, 2016

@nchammas No prob, I will add a description of this feature with an example.

@@ -327,13 +411,19 @@ def parse_args():
parser.add_option(
"--instance-profile-name", default=None,
help="IAM profile name to launch instances under")
parser.add_option("-c", "--conf-file", default=None,
help="Specify config file", metavar="FILE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some more description of what the config file format is here ? A pointer to some other readme / markdown file will be fine if we dont want to make this too long. Also it'll be good to clarify precedence -- i.e. if a config option is present both on command line and in the file which one is used

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we need to add some more description in the help section. It will be great to document this feature, can we add a pointer to the README.md file ?

If -c/--conf-file is used, only the arguments present in the configuration file are used and the options present in the command line, will be ignored. It'll nice to add this information somewhere everyone can spot it directly. Can we print a message to the console when -c/--conf-file is used, to indicate that only the arguments in the configuration file are used ? Or write it in the help section of add_option ?

When we will switch from optparse to argparse, we can easily manage config file and the other command line arguments at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized we haven't merged #34 yet - Will it be fine if I merge that first ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants