Skip to content
Yi Wang edited this page Apr 12, 2020 · 10 revisions

Scientific research has produced many model definitions, in the form of Python classes calling TensorFlow or PyTorch API, or of a set of XGBoost parameters in a text file. We want to reuse these models in production. To do so, we need to preprocess large training and test datasets, set correct hyperparameters, run distributed training jobs given the model definition and preprocess datasets on computing clusters, and evaluate the trained model parameters using preprocesses test datasets. All these steps are usually called an end-to-end machine learning pipeline.

In practice, it often that a model definition takes dozens of lines of Python code, but the end-to-end steps usually take hundreds if not thousands of lines of Python code. SQLFlow allows users to write the end-to-end pipeline in SQL with a few syntax extensions. Such a rewrite often reduces the thousands of lines of Python code into a few lines of SQL code.

Here follow some code snippets to exhibit the value of SQLFlow for developers.

Consider we have a model definition in as a Python class written by a machine learning specialist:

import pytorch as t
class MyModel:
  def __init__(self):
    self.layers=...
  def __call__(self, inputs):
    return self.layers(inputs)

End-to-end applications of such model definitions often include business intelligence, market analysis, ranking results of search engines, advertising, or recommendation systems. Companies running such applications hire software engineers who write the end-to-end pipeline in Python, Java, C++, or other languages.

An oversimplified example of an end-to-end pipeline in Python might look like the following.

def preprocess_data(db, sql_preprocess_stmts):
   mysql.run_sql_stmts(db, sql_preprocess_stmts) # write to a tmp table
  
def dump_data(db, tmp_table, dataset_file):
   tbl = db.open(tmp_table)
   for row in enumerate(tbl):
      append_to_file(dataset_file, row)

def do_some_statistics(dataset_file):
  for row in enumerate(dataset_file):
    ... compute mean, covariance, vocabulary list for each column ...
  return statistics

def transform_data_to_features(dataset_file, statistics):
  for row in enumerate(dataset_file):
    ... 

In SQLFlow, we can write such