Just a few comments as a point of departure in working with this repository:
There are four parts to this archive:
predictions.tsv.gz
is the file containing the model predictions for my submission.so1rb_doc
contains the documentationso1rb
is a reusable python package including a set of scripts for fitting a model and running it on data.so1rb_explore
is a sequence of self-contained python scripts that I used to explore the data.
so1rb_doc/SOLUTION.html
is a description of what's underso1rb
, with a description of what it is the code does, without, however, going into detail on why I believe that solution is the right solution to the problem.so1rb_doc/EXPLORATION.html
describes the process of how I explored the data which informed the solution I came up with.
To generate documentation from source, you will need
pandoc (I used 1.12.2.1). Use the Makefile
provided under so1rb_doc/Makefile
.
I'm listing the versions I used for development and testing. It probably works with other versions, too, but there are no guarantees, obviously.
- Python 3.4.2
- kyotocabinet 1.2.76 plus kyotocabinet-python 1.22
- leveldb 0.193 plus plyvel 0.9
- numpy 1.9.0
- scikit-learn 0.15.2
For so1rb_explore
you will also need
- matplotlib 1.4.0
There is a script to generate a zip file for each of so1rb
and
so1rb_explore
.
pack.sh
You can then conveniently run the zip file from python:
python3 so1rb_explore.zip step01_separate_dev_data /dta/so1
python3 so1rb.zip so1rb01_separate_dev_data /dta/so1
The first argument is always the path to a directory where the scripts will read input files and write output files.
For the scripts in so1rb_explore
, this is the only argument required,
everything else being hardcoded.
For so1rb
, the call signatures are as follows:
so1rb01_separate_dev_data [working dir]
so1rb02_fit_frontend [working dir]
so1rb03_fit_discretizer [working dir]
so1rb04_fit_fselector [working dir]
so1rb05_fit_model [working dir] [model filename]
, where[model filename]
needs to be one ofmdlp.kch
if you'd like to fit a model with homebrew feature engineering ormdlq.kch
if you'd like to fit a model with KPCA feature engineering.so1rb06_run_model [working dir] [model filename] [data filename]
where[model filename]
is as before (now being read, not written), and[data filename]
is the name of a the data you'd like to fit the model to.so1rb07_evaluate_run [working dir] [data filename] [run filename]
where[data filename]
is as before (now being read, not written), and[run filename]
is the name of an output file generated byso1rb06_run_model
.
Filenames are referred to without path prefix (it is assumed that they
reside in [working dir]
. Data files are not unpacked, but rather they
are always read from or written to such as to assume gzip
compression.
Here at So1, we ("the algo" team) do your typical data-science-y things:
- We write a lot code (to make models, explore data, etc.)
- We pull our hair out when said code doesn't work
- We pull our hair out when said code works, but the models are crap
- We use git to manage our code
- We try to facilitate knowledge transfer through written media (read: we occasional write notes to each other)
Justifiably, 1-in-3 of our data scientists are bald, and So1 occasionally sponsors experimental hair replacement surgeries*.
However, aside from your typical data-sciencey things, we are always looking for new talent to join the team. To give you a taste of the kind of work we do here, and the way we do our work, we've made this git repo, and some fake data for you to noodle around with. Completing this challenge requires:
- Command of at least one programming language (we mainly use R, but other languages are okay)
- A basic fluency in git
-
Fork this repo.
-
Download and unzip the training and testing data, or (if you're a CLI junkie) use
wget http://algo-recruitment-data.s3-website.eu-central-1.amazonaws.com/train.tsv.gz wget http://algo-recruitment-data.s3-website.eu-central-1.amazonaws.com/test.tsv.gz gunzip train.tsv.gz gunzip test.tsv.gz
-
Load up your favorite text editor (mine is Vim 😉) and start playing with/modeling the data.
-
When you're done, initialize a Pull Request, and we'll checkout what you've done and benchmark your predictions.
Per usual github etiquette, if you spot any issues with the data, or need any clarifications, raise an issue.
Being able to write code, make models, and generate predictions is all well and good, but we need people who can also communicate insight. The point of this challenge is three-fold:
- Find out how good your ML-fu is
- Find out how you think
- Find out well you can communicate what you're thinking
Consequently, your deliverables are
- A gzip'd .csv (or .tsv) of (binary) predictions containing columns named
id
andy
- At the risk of being pedantic, your
y
should come from data provided in theid
row oftest.tsv
- We will be evaluating their
F1
-score
- At the risk of being pedantic, your
- A report detailing how you arrived at your predictions
- While other data scientists are your intended audience, assume that they have no interest in seeing code snippets, in here
- Your report should contain complete sentences (and graphics, if you're inclined)
The report doesn't need to be Pulitzer Prize-worthy, but a few lines about how,
...[you] tried an SVM, and logistic regression with a cross-validated thresholding parameter...
is a one-way ticket to the trash bin.
- A (quantified) comparison of multiple models
- Any insight as to why certain methods performed better than others
- Graphics from any EDA that might have lead to feature-related insights
- An explanation of any feature engineering that occurred
- An analysis of the final model chosen to produce the uploaded predictions
- Model performance (perhaps as a function of certain model parameters)
- Decision boundaries or other interesting clustering results/visualizations
In case you're interested, you can see how well you stack up against previous participants.
* So1 doesn't actually offer experimental hair surgery as an explicit employee benefit. Maybe the German goverment might, though!