minigo updates for 0.7 (mlcommons#334)

* minigo updates for 0.7 * Delete CODE_OF_CONDUCT.md * Delete CONTRIBUTING.md
PenghuiCheng · Feb 4, 2020 · b58c18e · b58c18e
1 parent 3c1a0c6
commit b58c18e
Show file tree

Hide file tree

Showing 248 changed files with 36,737 additions and 2,930 deletions.
diff --git a/reinforcement/tensorflow/minigo/README.md b/reinforcement/tensorflow/minigo/README.md
@@ -26,6 +26,36 @@ abridged in Minigo documentation as *AG* (for AlphaGo), *AGZ* (for AlphaGo
 Zero), and *AZ* (for AlphaZero) respectively.
 
 
+Goals of the Project
+==================================================
+
+1. Provide a clear set of learning examples using Tensorflow, Kubernetes, and
+   Google Cloud Platform for establishing Reinforcement Learning pipelines on
+   various hardware accelerators.
+
+2. Reproduce the methods of the original DeepMind AlphaGo papers as faithfully
+   as possible, through an open-source implementation and open-source pipeline
+   tools.
+
+3. Provide our data, results, and discoveries in the open to benefit the Go,
+   machine learning, and Kubernetes communities.
+
+An explicit non-goal of the project is to produce a competitive Go program that
+establishes itself as the top Go AI. Instead, we strive for a readable,
+understandable implementation that can benefit the community, even if that
+means our implementation is not as fast or efficient as possible.
+
+While this product might produce such a strong model, we hope to focus on the
+process.  Remember, getting there is half the fun. :)
+
+We hope this project is an accessible way for interested developers to have
+access to a strong Go model with an easy-to-understand platform of python code
+available for extension, adaptation, etc.
+
+If you'd like to read about our experiences training models, see [RESULTS.md](RESULTS.md).
+
+To see our guidelines for contributing, see [CONTRIBUTING.md](CONTRIBUTING.md).
+
 Getting Started
 ===============
 
@@ -35,7 +65,6 @@ This project assumes you have the following:
 - Python 3.5+
 - [Docker](https://docs.docker.com/install/)
 - [Cloud SDK](https://cloud.google.com/sdk/downloads)
-- Bazel v0.11 or greater
 
 The [Hitchhiker's guide to
 python](http://docs.python-guide.org/en/latest/dev/virtualenvs/) has a good
@@ -47,6 +76,16 @@ pip3 install virtualenv
 pip3 install virtualenvwrapper
 ```
 
+Install Bazel
+------------------
+
+```shell
+BAZEL_VERSION=0.24.1
+wget https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh
+chmod 755 bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh
+sudo ./bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh
+```
+
 Install TensorFlow
 ------------------
 First set up and enter your virtualenv and then the shared requirements:
@@ -57,11 +96,10 @@ pip3 install -r requirements.txt
 
 Then, you'll need to choose to install the GPU or CPU tensorflow requirements:
 
-- GPU: `pip3 install "tensorflow-gpu>=1.11,<1.12"`.
-  - *Note*: You must install [CUDA
-    9.0](https://developer.nvidia.com/cuda-90-download-archive). for Tensorflow
-    1.5+.
-- CPU: `pip3 install "tensorflow>=1.11,<1.12"`.
+- GPU: `pip3 install "tensorflow-gpu==1.15.0"`.
+  - *Note*: You must install [CUDA 10.0](https://developer.nvidia.com/cuda-10.0-download-archive). for Tensorflow
+    1.13.0+.
+- CPU: `pip3 install "tensorflow==1.15.0"`.
 
 Setting up the Environment
 --------------------------
@@ -237,10 +275,10 @@ it gets picked up by selfplay workers.
 
 Configuration for things like "where do debug SGFs get written", "where does
 training data get written", "where do the latest models get published" are
-managed by the helper scripts in the rl_loop directory. Those helper scripts
+managed by the helper scripts in the rl\_loop directory. Those helper scripts
 execute the same commands as demonstrated below. Configuration for things like
 "what size network is being used?" or "how many readouts during selfplay" can
-be passed in as flags. The mask_flags.py utility helps ensure all parts of the
+be passed in as flags. The mask\_flags.py utility helps ensure all parts of the
 pipeline are using the same network configuration.
 
 All local paths in the examples can be replaced with `gs://` GCS paths, and the
@@ -352,6 +390,107 @@ The validate.py will glob all the .tfrecord.zz files under the
 directories given as positional arguments and compute the validation error
 for the positions from those files.
 
+
+Retraining a model
+======================
+
+The training data for most of Minigo's models up to v13 is publicly available in
+the `minigo-pub` Cloud storage bucket, e.g.:
+
+```shell
+gsutil ls gs://minigo-pub/v13-19x19/data/golden_chunks/
+```
+
+For models v14 and onwards, we started using Cloud BigTable and are still
+working on making that data public.
+
+Here's how to retrain your own model from this source data using a Cloud TPU:
+
+```shell
+# I wrote these notes using our existing TPU-enabled project, so they're missing
+# a few preliminary steps, like setting up a Cloud account, creating a project,
+# etc. New users will also need to enable Cloud TPU on their project using the
+# TPUs panel.
+
+###############################################################################
+
+# Note that you will be billed for any storage you use and also while you have
+# VMs running. Remember to shut down your VMs when you're not using them!
+
+# To use a Cloud TPU on GCE, you need to create a special TPU-enabled VM using
+# the `ctpu` tool. First, set up some environment variables:
+#   GCE_PROJECT=<your project name>
+#   GCE_VM_NAME=<your VM's name>
+#   GCE_ZONE<the zone in which you want to bring uo your VM, e.g. us-central1-f>
+
+# In this example, we will use the following values:
+GCE_PROJECT=example-project
+GCE_VM_NAME=minigo-etpu-test
+GCE_ZONE=us-central1-f
+
+# Create the Cloud TPU enabled VM.
+ctpu up \
+  --project="${GCE_PROJECT}" \
+  --zone="${GCE_ZONE}" \
+  --name="${GCE_VM_NAME}" \
+  --tf-version=1.13
+
+# This will take a few minutes and you should see output similar to the
+# following:
+#   ctpu will use the following configuration values:
+#         Name:                 minigo-etpu-test
+#         Zone:                 us-central1-f
+#         GCP Project:          example-project
+#         TensorFlow Version:   1.13
+#  OK to create your Cloud TPU resources with the above configuration? [Yn]: y
+#  2019/04/09 10:50:04 Creating GCE VM minigo-etpu-test (this may take a minute)...
+#  2019/04/09 10:50:04 Creating TPU minigo-etpu-test (this may take a few minutes)...
+#  2019/04/09 10:50:11 GCE operation still running...
+#  2019/04/09 10:50:12 TPU operation still running...
+
+# Once the Cloud TPU is created, `ctpu` will have SSHed you into the machine.
+
+# Remember to set the same environment variables on your VM.
+GCE_PROJECT=example-project
+GCE_VM_NAME=minigo-etpu-test
+GCE_ZONE=us-central1-f
+
+# Clone the Minigo Github repository:
+git clone https://github.com/tensorflow/minigo
+cd minigo
+
+# Install virtualenv.
+pip3 install virtualenv virtualenvwrapper
+
+# Create a virtual environment
+virtualenv -p /usr/bin/python3 --system-site-packages "${HOME}/.venvs/minigo"
+
+# Activate the virtual environment.
+source "${HOME}/.venvs/minigo/bin/activate"
+
+# Install Minigo dependencies (TensorFlow for Cloud TPU is already installed as
+# part of the VM image).
+pip install -r requirements.txt
+
+# When training on a Cloud TPU, the training work directory must be on Google Cloud Storage.
+# You'll need to choose your own globally unique bucket name.
+# The bucket location should be close to your VM.
+GCS_BUCKET_NAME=minigo_test_bucket
+GCE_BUCKET_LOCATION=us-central1
+gsutil mb -p "${GCE_PROJECT}" -l "${GCE_BUCKET_LOCATION}" "gs://${GCS_BUCKET_NAME}"
+
+# Run the training script and note the location of the training work_dir
+# it reports, e.g.
+#    Writing to gs://minigo_test_bucket/train/2019-04-25-18
+./oneoffs/train.sh "${GCS_BUCKET_NAME}"
+
+# Launch tensorboard, pointing it at the work_dir reported by the train.sh script.
+tensorboard --logdir=gs://minigo_test_bucket/train/2019-04-25-18
+
+# After a few minutes, TensorBoard should start updating.
+# Interesting graphs to look at are value_cost_normalized, policy_cost and policy_entropy.
+```
+
 Running Minigo on a Kubernetes Cluster
 ==============================
 

diff --git a/reinforcement/tensorflow/minigo/WORKSPACE b/reinforcement/tensorflow/minigo/WORKSPACE
@@ -1,63 +1,49 @@
-http_archive(
-    name = "com_google_protobuf",
-    strip_prefix = "protobuf-3.6.1",
-    url = "https://github.com/google/protobuf/archive/v3.6.1.tar.gz",
-)
+load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive", "http_file")
 
+# These must be kept up to date with the rules from tensorflow/WORKSPACE.
+# vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
 http_archive(
-    name = "com_github_gflags_gflags",
-    strip_prefix = "gflags-e292e0452fcfd5a8ae055b59052fc041cbab4abf",
-    urls = ["https://github.com/gflags/gflags/archive/e292e0452fcfd5a8ae055b59052fc041cbab4abf.zip"],
-)
-
-http_archive(
-    name = "com_google_absl",
-    strip_prefix = "abseil-cpp-666fc1266bccfd8e6eaaa084e7b42580bb8eb199",
+    name = "io_bazel_rules_closure",
+    sha256 = "5b00383d08dd71f28503736db0500b6fb4dda47489ff5fc6bed42557c07c6ba9",
+    strip_prefix = "rules_closure-308b05b2419edb5c8ee0471b67a40403df940149",
     urls = [
-        "http://mirror.tensorflow.org/github.com/abseil/abseil-cpp/archive/666fc1266bccfd8e6eaaa084e7b42580bb8eb199.tar.gz",
-        "https://github.com/abseil/abseil-cpp/archive/666fc1266bccfd8e6eaaa084e7b42580bb8eb199.tar.gz",
+        "https://storage.googleapis.com/mirror.tensorflow.org/github.com/bazelbuild/rules_closure/archive/308b05b2419edb5c8ee0471b67a40403df940149.tar.gz",
+        "https://github.com/bazelbuild/rules_closure/archive/308b05b2419edb5c8ee0471b67a40403df940149.tar.gz",  # 2019-06-13
     ],
 )
 
 http_archive(
-    name = "com_github_googlecloudplatform_google_cloud_cpp",
-    strip_prefix = "google-cloud-cpp-0.4.0",
-    url = "https://github.com/GoogleCloudPlatform/google-cloud-cpp/archive/v0.4.0.zip",
+    name = "bazel_skylib",
+    sha256 = "2ef429f5d7ce7111263289644d233707dba35e39696377ebab8b0bc701f7818e",
+    urls = ["https://github.com/bazelbuild/bazel-skylib/releases/download/0.8.0/bazel-skylib.0.8.0.tar.gz"],
 )
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+# These must be kept up to date with the rules from tensorflow/WORKSPACE.
 
-new_http_archive(
-    name = "com_google_benchmark",
-    build_file = "cc/benchmark.BUILD",
-    strip_prefix = "benchmark-1.3.0",
-    urls = ["https://github.com/google/benchmark/archive/v1.3.0.zip"],
+# This should also be kept up to date with the version used by Tensorflow.
+http_file(
+    name = "com_github_nlohmann_json_single_header",
+    sha256 = "63da6d1f22b2a7bb9e4ff7d6b255cf691a161ff49532dcc45d398a53e295835f",
+    urls = [
+        "https://github.com/nlohmann/json/releases/download/v3.4.0/json.hpp",
+    ],
 )
 
-new_http_archive(
-    name = "com_github_nlohmann_json",
-    build_file = "cc/json.BUILD",
-    strip_prefix = "json-3.2.0",
-    urls = ["https://github.com/nlohmann/json/archive/v3.2.0.zip"],
+http_archive(
+    name = "org_tensorflow",
+    sha256 = "76abfd5045d1474500754566edd54ce4c386a1fbccf22a3a91d6832c6b7e90ad",
+    strip_prefix = "tensorflow-1.15.0",
+    urls = ["https://github.com/tensorflow/tensorflow/archive/v1.15.0.zip"],
 )
 
 http_archive(
-    name = "com_google_googletest",
-    strip_prefix = "googletest-master",
-    urls = ["https://github.com/google/googletest/archive/master.zip"],
+    name = "wtf",
+    build_file = "//cc:wtf.BUILD",
+    sha256 = "1837833cd159060f8bd6f6dd87edf854ed3135d07a6937b7e14b0efe70580d74",
+    strip_prefix = "tracing-framework-fb639271fa3d56ed1372a792d74d257d4e0c235c",
+    urls = ["https://github.com/google/tracing-framework/archive/fb639271fa3d56ed1372a792d74d257d4e0c235c.zip"],
 )
 
-load("@com_github_googlecloudplatform_google_cloud_cpp//bazel:google_cloud_cpp_deps.bzl", "google_cloud_cpp_deps")
-
-google_cloud_cpp_deps()
-
-# Have to manually call the corresponding function for gRPC:
-#   https://github.com/bazelbuild/bazel/issues/1550
-load("@com_github_grpc_grpc//bazel:grpc_deps.bzl", "grpc_deps")
-
-grpc_deps()
-
-load("//cc:cuda_configure.bzl", "cuda_configure")
-load("//cc:tensorrt_configure.bzl", "tensorrt_configure")
-
-cuda_configure(name = "local_config_cuda")
+load("@org_tensorflow//tensorflow:workspace.bzl", "tf_workspace")
 
-tensorrt_configure(name = "local_config_tensorrt")
+tf_workspace()