Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add basic foundry examples #136

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/tutorials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ If you have any questions or issues, please [let us know](https://github.com/flu

The following tutorials are provided from their respective directories (and are not documented here):

### Machine Learning

- [Foundry ML](https://github.com/flux-framework/flux-operator/tree/main/examples/machine-learning/foundry)

### Simulations

- [Laghos](https://github.com/flux-framework/flux-operator/tree/main/examples/simulations/laghos)
Expand Down
143 changes: 143 additions & 0 deletions examples/machine-learning/foundry/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Foundry

This tutorial example will show using [Foundry](https://github.com/MLMI2-CSSI/foundry) to download a dataset and run an example.

## Credentials

You'll need to generate a credential file that we will provide to the job. This needs
to be done locally (it's recommended to make a Python environment):

```bash
$ python -m venv env
$ source env/bin/activate
$ pip install foundry_ml
```

You'll next want to login with Globus. Yes, this requires an account! I was able
to just login via my institution. Running this command should open a web interface
to authenticate:

```bash
$ python -c "from foundry import Foundry; f = Foundry()"
```
This will generate a credential file in your home directory - let's copy it
here so we can provide it to the minicluster (do NOT add to git!)

```bash
$ cp ~/.globus-native-apps.cfg .
```

## Kind Cluster

We will want to bind the present working directory (with the examples) to our MiniCluster,
and that is easy to do with kind. Create a cluster with the included kind config.
Make sure this is run in the directory context here!

```bash
$ kind create cluster --config kind-config.yaml
```

## Create MiniCluster

Since we have several examples, let's create an interactive cluster so we can run (and watch them run) with flux submit.
If you were doing this at scale you would likely choose one workflow and run headlessly by removing `interactive: true`
and providing the [minicluster.yaml](minicluster.yaml) with a command. Let's create the namespace and install
the operator:

```bash
$ kubectl create namespace flux-operator
$ kubectl apply -f ../../../examples/dist/flux-operator-dev.yaml
```

And then create the interactive cluster:

```bash
$ kubectl apply -f minicluster.yaml
```

See pods creating:

```bash
$ kubectl get -n flux-operator pods
```

When the broker (index 0) is running, shell in!

```bash
$ kubectl exec -it -n flux-operator flux-sample-0-fzml6 bash
```

You'll want to connect to the broker.

```bash
$ sudo -E $(env) -E HOME=/home/fluxuser -u fluxuser flux proxy local:///run/flux/local bash
```

### Globus Credentials

Export your Globus credentials (I'm not convinced this is necessary, but the testing example does it, so why not)

```bash
$ export GLOBUS_CONFIG=$(cat .globus-native-apps.cfg)
```

### Run Examples

Now let's cd into the examples directory and run a few! We will run these on the node, but they could
also be run with `flux submit --watch` and a certain number of nodes `-n`

```bash
$ cd ./examples
$ ls
```
```console
atom-position-finding bandgap dendrite-segmentation g4mp2-solvation oqmd publishing-guides qmc_ml zeolite
```

#### Atom Position Finding

These interactions are run from inside the container:

```bash
$ cd ./atom-position-finding
$ python atom_position_finding.py
```

![./examples/atom-position-finding/result.png](./examples/atom-position-finding/result.png)


### Bandgap

Note that downloading the data on this one froze my computer the first time, so be careful!

```bash
$ cd ./bandgap
$ python bandgap_demo.py
```

![./examples/bandgap/result.png](./examples/bandgap/result.png)

### QMC ML

Note that downloading the data on this one froze my computer the first time, so be careful!

```bash
$ cd ./qmc_ml
$ python qmc_ml.py
```

![./examples/qmc_ml/result.png](./examples/qmc_ml/result.png)


And finally, clean up:

```bash
$ kubectl delete -f minicluster.yaml
```

It's not clear yet how these machine learning runs can best integrate with flux, beyond submitting a job
to Flux. We will need to think about this. One design, however, I think could work really nicely here is:

1. Use Foundry for storing data, download a dataset via the broker pre command.
2. Use flux filemap in the batch script (with batch:true and batchRaw: true) to map the data to nodes
3. Run some job that uses the data across the nodes (e.g., MPI or similar)
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#!/usr/bin/env python
# coding: utf-8

# # Installing Foundry
# First we'll need to install Foundry. We'll also be installing [Matplotlib](https://matplotlib.org/) for our visualizations. If you're using Google Colab, this code block will install this package into the Colab environment.
#
#
# If you are running locally, it will install this module onto your machine if you do not already have it. We also have a [requirements file](https://github.com/MLMI2-CSSI/foundry/tree/main/examples/atom-position-finding) included with this notebook. You can run `pip install -r requirements.txt` in your terminal to set up your environment locally.


# # Importing Packages
# Now we can import Foundry and Matplotlib so we can import the data and visualize it.

# In[9]:


from foundry import Foundry
import matplotlib.pyplot as plt

# # Instantiating Foundry
# To instantiate Foundry, you'll need a [Globus](https://www.globus.org) account. Once you have your account, you can instantiate Foundry using the code below. When you instantiate Foundry locally, be sure to have your Globus endpoint turned on (you can do that with [Globus Connect Personal](https://www.globus.org/globus-connect-personal)). When you instantiate Foundry on Google Colab, you'll be given a link in the cell's output and asked to enter the provided auth code.

f = Foundry(index="mdf", no_local_server=True, no_browser=True)

dataset_doi = '10.18126/e73h-3w6n'

# download the data
f.load(dataset_doi, download=True, globus=False)

# load the HDF5 image data into a local object
res = f.load_data()

# using the 'train' split, 'input' or 'target' type, and Foundry Keys specified by the dataset publisher
# we can grab the atom images, metadata, and coorinates we desire
imgs = res['train']['input']['imgs']
desc = res['train']['input']['metadata']
coords = res['train']['target']['coords']

n_images = 3
offset = 150
key_list = list(res['train']['input']['imgs'].keys())[0+offset:n_images+offset]

fig, axs = plt.subplots(1, n_images, figsize=(20,20))
for i in range(n_images):
axs[i].imshow(imgs[key_list[i]])
axs[i].scatter(coords[key_list[i]][:,0], coords[key_list[i]][:,1], s = 20, c = 'r', alpha=0.5)

fig.savefig("result.png")
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
168 changes: 168 additions & 0 deletions examples/machine-learning/foundry/examples/bandgap/bandgap_demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
#!/usr/bin/env python
# coding: utf-8

# <img src="https://raw.githubusercontent.com/MLMI2-CSSI/foundry/main/assets/foundry-black.png" width=450>

# # Foundry Bandgap Data Quickstart for Beginners

# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MLMI2-CSSI/foundry/blob/main/examples/bandgap/bandgap_demo.ipynb)

# This introduction uses Foundry to:
#
#
# 1. Instantiate and authenticate a Foundry client locally or in the cloud
# 2. Aggregate data from the collected datasets
# 3. Build a simple predictive model

# This notebook is set up to run as a [Google Colaboratory](https://colab.research.google.com/notebooks/intro.ipynb#scrollTo=5fCEDCU_qrC0) notebook, which allows you to run python code in the browser, or as a [Jupyter](https://jupyter.org/) notebook, which runs locally on your machine.
#
# The code in the next cell will detect your environment to make sure that only cells that match your environment will run.
#

# # Environment Set Up
# First we'll need to install Foundry as well as a few other packages. If you're using Google Colab, this code block will install these packages into the Colab environment.
# If you are running locally, it will install these modules onto your machine if you do not already have them. We also have a [requirements file](https://github.com/MLMI2-CSSI/foundry/tree/main/examples/bandgap) included with this notebook. You can run `pip install -r requirements.txt` in your terminal to set up your environment locally.


# We need to import a few packages. We'll be using [Matplotlib](https://matplotlib.org/) to make visualizations of our data, [scikit-learn](https://scikit-learn.org/stable/) to create our model, and [pandas](https://pandas.pydata.org/) and [NumPy ](https://numpy.org/)to work with our data.

from matplotlib.colors import LogNorm
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import warnings
import glob
from matminer.featurizers.conversions import StrToComposition
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers import composition as cf
from sklearn.model_selection import cross_val_predict, GridSearchCV, ShuffleSplit, KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics


warnings.filterwarnings('ignore')

# # Instantiate and Authenticate Foundry
# Once the installations are complete, we can import Foundry.

from foundry import Foundry


# We'll also need to instantiate it. To do so, you'll need a [Globus](https://www.globus.org) account. Once you have your account, you can instantiate Foundry using the code below. When you instantiate Foundry locally, be sure to have your Globus endpoint turned on (you can do that with [Globus Connect Personal](https://www.globus.org/globus-connect-personal)). When you instantiate Foundry on Google Colab, you'll be given a link in the cell's output and asked to enter the provided auth code.

f = Foundry(no_local_server=True, no_browser=True, index="mdf")


# # Loading the Band Gap Data
# Now that we've installed and imported everything we'll need, it's time to load the data. We'll be loading 2 datasets from Foundry using `f.load` to load the data and then `f.load_data` to load the data into the client. Then we'll concatenate them using pandas.
globus = False

f.load("foundry_mp_band_gaps_v1.1", globus=globus)
res = f.load_data()
X_mp,y_mp = res['train'][0], res['train'][1]


f.load("foundry_assorted_computational_band_gaps_v1.1", globus=globus)
res = f.load_data()
X_assorted,y_assorted = res['train'][0], res['train'][1]


X, y = pd.concat([X_mp, X_assorted]), pd.concat([y_mp, y_assorted])


# Let's see the data!

X.head()


# # Add Composition Features
# We need to pull out the composition data that will serve as our targets.

n_datapoints = 300
data = StrToComposition(target_col_id='composition_obj')
data = data.featurize_dataframe(X[0:n_datapoints],
'composition',
ignore_errors=True)
y_subset = y[0:n_datapoints]['bandgap value (eV)']


assert(len(y_subset) == len(data))


# # Add Other Features
# Choose the features that we'll use in training.

feature_calculators = MultipleFeaturizer([cf.Stoichiometry(),
cf.ElementProperty.from_preset("magpie"),
cf.ValenceOrbital(props=['avg']),
cf.IonProperty(fast=True)])
feature_labels = feature_calculators.feature_labels()

data = feature_calculators.featurize_dataframe(data,
col_id='composition_obj',
ignore_errors=False);


# # Grid Search and Fit Model
# Set up the grid search model using a random forest regressor as our estimator. Then, fit the model!

quick_demo=False
est = RandomForestRegressor(n_estimators=30 if quick_demo else 150, n_jobs=-1)

model = GridSearchCV(est,
param_grid=dict(max_features=range(8,15)),
scoring='neg_mean_squared_error',
cv=ShuffleSplit(n_splits=1,
test_size=0.1))
model.fit(data[feature_labels], y_subset)


# # Cross Validation and Scoring
# Perform cross validation to ensure our error values are below the desired thresholds.

cv_prediction = cross_val_predict(model,
data[feature_labels],
y_subset,
cv=KFold(10, shuffle=True))


for scorer in ['r2_score', 'mean_absolute_error', 'mean_squared_error']:
score = getattr(metrics,scorer)( y_subset, cv_prediction)
print(scorer, score)


# # Make Plots
# Plot the data for our bandgap analysis.

fig, ax = plt.subplots()

ax.hist2d(pd.to_numeric( y_subset),
cv_prediction,
norm=LogNorm(),
bins=64,
cmap='Blues',
alpha=0.8)

ax.set_xlim(ax.get_ylim())
ax.set_ylim(ax.get_xlim())

mae = metrics.mean_absolute_error( y_subset,
cv_prediction)
r2 = metrics.r2_score( y_subset,
cv_prediction)
ax.text(0.5, 0.1, 'MAE: {:.2f} eV/atom\n$R^2$: {:.2f}'.format(mae, r2),
transform=ax.transAxes,
bbox={'facecolor': 'w', 'edgecolor': 'k'})

ax.plot(ax.get_xlim(), ax.get_xlim(), 'k--')

ax.set_xlabel('DFT $\Delta H_f$ (eV/atom)')
ax.set_ylabel('ML $\Delta H_f$ (eV/atom)')

fig.set_size_inches(5, 5)
fig.tight_layout()
fig.savefig('result.png', dpi=320)




Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading