plasticc-kaggle

PLAsTiCC Kaggle Challenge Submission

This repository contains the source code of the approach described by the Kernel - CNN based Classification of Light Curves .

There's also an associated EDA Kernel here

Dependencies

The requirements.txt is an auto-generated file based on pigar, here are the steps I followed to create the environment.

Install Anaconda 3
Create a conda environment with python 3.6
Install Tensorflow
Install Keras

Source

I assume the input directory has the following structure:

input/
  training_set.csv                                        #file containing the time series data
  training_set_metadata.csv                               #file containing metadata
  train/
    train_csv/                                              #base directory to generate the individual csv files (see below)
    train_dmdt/                                             #base directory to generate the DMDT Images (see below)

I've divided the main kernel into four files:

split_csv.py - breaks the time series data into one csv file per object (this is useful especially when the number of objects is huge - as was the case for the test set). It expects a base directory (input/train_csv by default), and generates an objects.csv containing a list of all unique object ids, as well as one csv file per object, containing time series data for that object.
dmdtize.py - generates dmdt images for each object given its individual csv file. It expects a location where the csv files are store (input/train_csv by default) and a base directory to store the dmdt images (input/train_dmdt by default)
train.py - Trains a Keras model on the DMDT Images. It expects a location where the dmdt images are stored (input/train_dmdt by default). The resulting model is saved to model/model_<timestamp>.h5
predict.py - Uses the trained model to generate results on a set of images. It expects a location where the dmdt images are stored (input/train_dmdt by default), and a model (model/model_<timestamp>.h5 by default). The results are stored to output/test_results.csv.

The general sequence to run the source files would be:

python split_csv.py
python dmdtize.py
python train.py
python predict.py

I have pre-computed the Images for the training set as a Kaggle Dataset - See Plasticc DMDT Images, so you can skip the first two source files if you just need to run on the training dataset

I also have the images for the test set, but they're understandably huge and hard to share. I have them on a EC2 instance, let me know if you'd like access to it.

Name	Name	Last commit message	Last commit date
Latest commit pankajb64 Create LICENSE Jan 10, 2019 909249e · Jan 10, 2019 History 11 Commits
model	model	Added models	Dec 27, 2018
output	output	training set results	Dec 27, 2018
src	src	add eda notebook	Dec 27, 2018
.gitignore	.gitignore	added input	Dec 27, 2018
LICENSE	LICENSE	Create LICENSE	Jan 10, 2019
README.md	README.md	Update README.md	Dec 27, 2018
requirements.txt	requirements.txt	autogen requirements by pigar	Dec 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

plasticc-kaggle

Dependencies

Source

About

Releases

Packages

Languages

License

pankajb64/plasticc-kaggle

Folders and files

Latest commit

History

Repository files navigation

plasticc-kaggle

Dependencies

Source

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages