Skip to content

Commit

Permalink
minigo updates for 0.7 (mlcommons#334)
Browse files Browse the repository at this point in the history
* minigo updates for 0.7

* Delete CODE_OF_CONDUCT.md

* Delete CONTRIBUTING.md
  • Loading branch information
pkanwar23 authored Feb 4, 2020
1 parent 3c1a0c6 commit b58c18e
Show file tree
Hide file tree
Showing 248 changed files with 36,737 additions and 2,930 deletions.
155 changes: 147 additions & 8 deletions reinforcement/tensorflow/minigo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,36 @@ abridged in Minigo documentation as *AG* (for AlphaGo), *AGZ* (for AlphaGo
Zero), and *AZ* (for AlphaZero) respectively.


Goals of the Project
==================================================

1. Provide a clear set of learning examples using Tensorflow, Kubernetes, and
Google Cloud Platform for establishing Reinforcement Learning pipelines on
various hardware accelerators.

2. Reproduce the methods of the original DeepMind AlphaGo papers as faithfully
as possible, through an open-source implementation and open-source pipeline
tools.

3. Provide our data, results, and discoveries in the open to benefit the Go,
machine learning, and Kubernetes communities.

An explicit non-goal of the project is to produce a competitive Go program that
establishes itself as the top Go AI. Instead, we strive for a readable,
understandable implementation that can benefit the community, even if that
means our implementation is not as fast or efficient as possible.

While this product might produce such a strong model, we hope to focus on the
process. Remember, getting there is half the fun. :)

We hope this project is an accessible way for interested developers to have
access to a strong Go model with an easy-to-understand platform of python code
available for extension, adaptation, etc.

If you'd like to read about our experiences training models, see [RESULTS.md](RESULTS.md).

To see our guidelines for contributing, see [CONTRIBUTING.md](CONTRIBUTING.md).

Getting Started
===============

Expand All @@ -35,7 +65,6 @@ This project assumes you have the following:
- Python 3.5+
- [Docker](https://docs.docker.com/install/)
- [Cloud SDK](https://cloud.google.com/sdk/downloads)
- Bazel v0.11 or greater

The [Hitchhiker's guide to
python](http://docs.python-guide.org/en/latest/dev/virtualenvs/) has a good
Expand All @@ -47,6 +76,16 @@ pip3 install virtualenv
pip3 install virtualenvwrapper
```

Install Bazel
------------------

```shell
BAZEL_VERSION=0.24.1
wget https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh
chmod 755 bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh
sudo ./bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh
```

Install TensorFlow
------------------
First set up and enter your virtualenv and then the shared requirements:
Expand All @@ -57,11 +96,10 @@ pip3 install -r requirements.txt

Then, you'll need to choose to install the GPU or CPU tensorflow requirements:

- GPU: `pip3 install "tensorflow-gpu>=1.11,<1.12"`.
- *Note*: You must install [CUDA
9.0](https://developer.nvidia.com/cuda-90-download-archive). for Tensorflow
1.5+.
- CPU: `pip3 install "tensorflow>=1.11,<1.12"`.
- GPU: `pip3 install "tensorflow-gpu==1.15.0"`.
- *Note*: You must install [CUDA 10.0](https://developer.nvidia.com/cuda-10.0-download-archive). for Tensorflow
1.13.0+.
- CPU: `pip3 install "tensorflow==1.15.0"`.

Setting up the Environment
--------------------------
Expand Down Expand Up @@ -237,10 +275,10 @@ it gets picked up by selfplay workers.

Configuration for things like "where do debug SGFs get written", "where does
training data get written", "where do the latest models get published" are
managed by the helper scripts in the rl_loop directory. Those helper scripts
managed by the helper scripts in the rl\_loop directory. Those helper scripts
execute the same commands as demonstrated below. Configuration for things like
"what size network is being used?" or "how many readouts during selfplay" can
be passed in as flags. The mask_flags.py utility helps ensure all parts of the
be passed in as flags. The mask\_flags.py utility helps ensure all parts of the
pipeline are using the same network configuration.

All local paths in the examples can be replaced with `gs://` GCS paths, and the
Expand Down Expand Up @@ -352,6 +390,107 @@ The validate.py will glob all the .tfrecord.zz files under the
directories given as positional arguments and compute the validation error
for the positions from those files.


Retraining a model
======================

The training data for most of Minigo's models up to v13 is publicly available in
the `minigo-pub` Cloud storage bucket, e.g.:

```shell
gsutil ls gs://minigo-pub/v13-19x19/data/golden_chunks/
```

For models v14 and onwards, we started using Cloud BigTable and are still
working on making that data public.

Here's how to retrain your own model from this source data using a Cloud TPU:

```shell
# I wrote these notes using our existing TPU-enabled project, so they're missing
# a few preliminary steps, like setting up a Cloud account, creating a project,
# etc. New users will also need to enable Cloud TPU on their project using the
# TPUs panel.

###############################################################################

# Note that you will be billed for any storage you use and also while you have
# VMs running. Remember to shut down your VMs when you're not using them!

# To use a Cloud TPU on GCE, you need to create a special TPU-enabled VM using
# the `ctpu` tool. First, set up some environment variables:
# GCE_PROJECT=<your project name>
# GCE_VM_NAME=<your VM's name>
# GCE_ZONE<the zone in which you want to bring uo your VM, e.g. us-central1-f>

# In this example, we will use the following values:
GCE_PROJECT=example-project
GCE_VM_NAME=minigo-etpu-test
GCE_ZONE=us-central1-f

# Create the Cloud TPU enabled VM.
ctpu up \
--project="${GCE_PROJECT}" \
--zone="${GCE_ZONE}" \
--name="${GCE_VM_NAME}" \
--tf-version=1.13

# This will take a few minutes and you should see output similar to the
# following:
# ctpu will use the following configuration values:
# Name: minigo-etpu-test
# Zone: us-central1-f
# GCP Project: example-project
# TensorFlow Version: 1.13
# OK to create your Cloud TPU resources with the above configuration? [Yn]: y
# 2019/04/09 10:50:04 Creating GCE VM minigo-etpu-test (this may take a minute)...
# 2019/04/09 10:50:04 Creating TPU minigo-etpu-test (this may take a few minutes)...
# 2019/04/09 10:50:11 GCE operation still running...
# 2019/04/09 10:50:12 TPU operation still running...

# Once the Cloud TPU is created, `ctpu` will have SSHed you into the machine.

# Remember to set the same environment variables on your VM.
GCE_PROJECT=example-project
GCE_VM_NAME=minigo-etpu-test
GCE_ZONE=us-central1-f

# Clone the Minigo Github repository:
git clone https://github.com/tensorflow/minigo
cd minigo

# Install virtualenv.
pip3 install virtualenv virtualenvwrapper

# Create a virtual environment
virtualenv -p /usr/bin/python3 --system-site-packages "${HOME}/.venvs/minigo"

# Activate the virtual environment.
source "${HOME}/.venvs/minigo/bin/activate"

# Install Minigo dependencies (TensorFlow for Cloud TPU is already installed as
# part of the VM image).
pip install -r requirements.txt

# When training on a Cloud TPU, the training work directory must be on Google Cloud Storage.
# You'll need to choose your own globally unique bucket name.
# The bucket location should be close to your VM.
GCS_BUCKET_NAME=minigo_test_bucket
GCE_BUCKET_LOCATION=us-central1
gsutil mb -p "${GCE_PROJECT}" -l "${GCE_BUCKET_LOCATION}" "gs://${GCS_BUCKET_NAME}"

# Run the training script and note the location of the training work_dir
# it reports, e.g.
# Writing to gs://minigo_test_bucket/train/2019-04-25-18
./oneoffs/train.sh "${GCS_BUCKET_NAME}"

# Launch tensorboard, pointing it at the work_dir reported by the train.sh script.
tensorboard --logdir=gs://minigo_test_bucket/train/2019-04-25-18

# After a few minutes, TensorBoard should start updating.
# Interesting graphs to look at are value_cost_normalized, policy_cost and policy_entropy.
```

Running Minigo on a Kubernetes Cluster
==============================

Expand Down
78 changes: 32 additions & 46 deletions reinforcement/tensorflow/minigo/WORKSPACE
Original file line number Diff line number Diff line change
@@ -1,63 +1,49 @@
http_archive(
name = "com_google_protobuf",
strip_prefix = "protobuf-3.6.1",
url = "https://github.com/google/protobuf/archive/v3.6.1.tar.gz",
)
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive", "http_file")

# These must be kept up to date with the rules from tensorflow/WORKSPACE.
# vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
http_archive(
name = "com_github_gflags_gflags",
strip_prefix = "gflags-e292e0452fcfd5a8ae055b59052fc041cbab4abf",
urls = ["https://github.com/gflags/gflags/archive/e292e0452fcfd5a8ae055b59052fc041cbab4abf.zip"],
)

http_archive(
name = "com_google_absl",
strip_prefix = "abseil-cpp-666fc1266bccfd8e6eaaa084e7b42580bb8eb199",
name = "io_bazel_rules_closure",
sha256 = "5b00383d08dd71f28503736db0500b6fb4dda47489ff5fc6bed42557c07c6ba9",
strip_prefix = "rules_closure-308b05b2419edb5c8ee0471b67a40403df940149",
urls = [
"http://mirror.tensorflow.org/github.com/abseil/abseil-cpp/archive/666fc1266bccfd8e6eaaa084e7b42580bb8eb199.tar.gz",
"https://github.com/abseil/abseil-cpp/archive/666fc1266bccfd8e6eaaa084e7b42580bb8eb199.tar.gz",
"https://storage.googleapis.com/mirror.tensorflow.org/github.com/bazelbuild/rules_closure/archive/308b05b2419edb5c8ee0471b67a40403df940149.tar.gz",
"https://github.com/bazelbuild/rules_closure/archive/308b05b2419edb5c8ee0471b67a40403df940149.tar.gz", # 2019-06-13
],
)

http_archive(
name = "com_github_googlecloudplatform_google_cloud_cpp",
strip_prefix = "google-cloud-cpp-0.4.0",
url = "https://github.com/GoogleCloudPlatform/google-cloud-cpp/archive/v0.4.0.zip",
name = "bazel_skylib",
sha256 = "2ef429f5d7ce7111263289644d233707dba35e39696377ebab8b0bc701f7818e",
urls = ["https://github.com/bazelbuild/bazel-skylib/releases/download/0.8.0/bazel-skylib.0.8.0.tar.gz"],
)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# These must be kept up to date with the rules from tensorflow/WORKSPACE.

new_http_archive(
name = "com_google_benchmark",
build_file = "cc/benchmark.BUILD",
strip_prefix = "benchmark-1.3.0",
urls = ["https://github.com/google/benchmark/archive/v1.3.0.zip"],
# This should also be kept up to date with the version used by Tensorflow.
http_file(
name = "com_github_nlohmann_json_single_header",
sha256 = "63da6d1f22b2a7bb9e4ff7d6b255cf691a161ff49532dcc45d398a53e295835f",
urls = [
"https://github.com/nlohmann/json/releases/download/v3.4.0/json.hpp",
],
)

new_http_archive(
name = "com_github_nlohmann_json",
build_file = "cc/json.BUILD",
strip_prefix = "json-3.2.0",
urls = ["https://github.com/nlohmann/json/archive/v3.2.0.zip"],
http_archive(
name = "org_tensorflow",
sha256 = "76abfd5045d1474500754566edd54ce4c386a1fbccf22a3a91d6832c6b7e90ad",
strip_prefix = "tensorflow-1.15.0",
urls = ["https://github.com/tensorflow/tensorflow/archive/v1.15.0.zip"],
)

http_archive(
name = "com_google_googletest",
strip_prefix = "googletest-master",
urls = ["https://github.com/google/googletest/archive/master.zip"],
name = "wtf",
build_file = "//cc:wtf.BUILD",
sha256 = "1837833cd159060f8bd6f6dd87edf854ed3135d07a6937b7e14b0efe70580d74",
strip_prefix = "tracing-framework-fb639271fa3d56ed1372a792d74d257d4e0c235c",
urls = ["https://github.com/google/tracing-framework/archive/fb639271fa3d56ed1372a792d74d257d4e0c235c.zip"],
)

load("@com_github_googlecloudplatform_google_cloud_cpp//bazel:google_cloud_cpp_deps.bzl", "google_cloud_cpp_deps")

google_cloud_cpp_deps()

# Have to manually call the corresponding function for gRPC:
# https://github.com/bazelbuild/bazel/issues/1550
load("@com_github_grpc_grpc//bazel:grpc_deps.bzl", "grpc_deps")

grpc_deps()

load("//cc:cuda_configure.bzl", "cuda_configure")
load("//cc:tensorrt_configure.bzl", "tensorrt_configure")

cuda_configure(name = "local_config_cuda")
load("@org_tensorflow//tensorflow:workspace.bzl", "tf_workspace")

tensorrt_configure(name = "local_config_tensorrt")
tf_workspace()
Loading

0 comments on commit b58c18e

Please sign in to comment.