PipelineProfiler

AutoML Pipeline exploration tool compatible with Jupyter Notebooks. Supports Auto-Sklearn, Alpha-AutoML and D3M pipeline format.

(Shift click to select multiple pipelines)

Paper: https://arxiv.org/abs/2005.00160

Video: https://youtu.be/2WSYoaxLLJ8

Blog: Medium post

Demo

Live demo (Google Colab):

In Jupyter Notebook:

import PipelineProfiler
data = PipelineProfiler.get_heartstatlog_data()
PipelineProfiler.plot_pipeline_matrix(data)

You can also find multiple examples of PipelineProfiler in the Alpha-AutoML repository, an extensible AutoML system for multiple ML tasks.

Install

Option 1: install via pip:

pip install pipelineprofiler

Option 2: Run the docker image:

docker build -t pipelineprofiler .
docker run -p 9999:8888 pipelineprofiler

Then copy the access token and log in to jupyter in the browser url:

localhost:9999

Data preprocessing

PipelineProfiler reads data from the D3M Metalearning database. You can download this data from: https://metalearning.datadrivendiscovery.org/dumps/2020/03/04/metalearningdb_dump_20200304.tar.gz

You need to merge two files in order to explore the pipelines: pipelines.json and pipeline_runs.json. To do so, run

python -m PipelineProfiler.pipeline_merge [-n NUMBER_PIPELINES] pipeline_runs_file pipelines_file output_file

Pipeline exploration

import PipelineProfiler
import json

In a jupyter notebook, load the output_file

with open("output_file.json", "r") as f:
    pipelines = json.load(f)

and then plot it using:

PipelineProfiler.plot_pipeline_matrix(pipelines[:10])

Data postprocessing

You might want to group pipelines by problem type, and select the top k pipelines from each team. To do so, use the code:

def get_top_k_pipelines_team(pipelines, k):
    team_pipelines = defaultdict(list)
    for pipeline in pipelines:
        source = pipeline['pipeline_source']['name']
        team_pipelines[source].append(pipeline)
    for team in team_pipelines.keys():
        team_pipelines[team] = sorted(team_pipelines[team], key=lambda x: x['scores'][0]['normalized'], reverse=True)
        team_pipelines[team] = team_pipelines[team][:k]
    new_pipelines = []
    for team in team_pipelines.keys():
        new_pipelines.extend(team_pipelines[team])
    return new_pipelines

def sort_pipeline_scores(pipelines):
    return sorted(pipelines, key=lambda x: x['scores'][0]['value'], reverse=True)    

pipelines_problem = {}
for pipeline in pipelines:  
    problem_id = pipeline['problem']['id']
    if problem_id not in pipelines_problem:
        pipelines_problem[problem_id] = []
    pipelines_problem[problem_id].append(pipeline)
for problem in pipelines_problem.keys():
    pipelines_problem[problem] = sort_pipeline_scores(get_top_k_pipelines_team(pipelines_problem[problem], k=100))

Name		Name	Last commit message	Last commit date
Latest commit History 279 Commits
PipelineProfiler		PipelineProfiler
imgs		imgs
.dockerignore		.dockerignore
Demo.ipynb		Demo.ipynb
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PipelineProfiler

Demo

Install

Option 1: install via pip:

Option 2: Run the docker image:

Data preprocessing

Pipeline exploration

Data postprocessing

About

Releases

Packages

Contributors 4

Languages

License

VIDA-NYU/PipelineVis

Folders and files

Latest commit

History

Repository files navigation

PipelineProfiler

Demo

Install

Option 1: install via pip:

Option 2: Run the docker image:

Data preprocessing

Pipeline exploration

Data postprocessing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages