Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues for AlphaFlow #89

Closed
3 tasks done
jyaacoub opened this issue Mar 27, 2024 · 3 comments
Closed
3 tasks done

Memory issues for AlphaFlow #89

jyaacoub opened this issue Mar 27, 2024 · 3 comments
Labels
bug Something isn't working main hurdle/issue This is an issue that was a pivotal moment during the project.

Comments

@jyaacoub
Copy link
Owner

jyaacoub commented Mar 27, 2024

related: #84
Potential Solutions

Inputs 12+ failed for Davis (~178 proteins impacted)

203/229 for kiba

@jyaacoub jyaacoub added the bug Something isn't working label Mar 27, 2024
@jyaacoub
Copy link
Owner Author

jyaacoub commented Mar 28, 2024

Histogram plot with stagered labels shows the extent of this issue, especially with davis which has a lot of sequences above 1000
image

CODE FOR PLOTS

#%%
import os
import pandas as pd
import matplotlib.pyplot as plt

# Function to load sequences and their lengths from csv files
def load_sequences(directory):
    lengths = []
    labels_positions = {}  # Dictionary to hold the last length of each file for labeling
    files = sorted([f for f in os.listdir(directory) if f.endswith('.csv') and f.startswith('input_')])
    for file in files:
        file_path = os.path.join(directory, file)
        data = pd.read_csv(file_path)
        # Extract lengths
        current_lengths = data['seqres'].apply(len)
        lengths.extend(current_lengths)
        # Store the position for the label using the last length in the current file
        labels_positions[int(file.split('_')[1].split('.')[0])] = current_lengths.iloc[0]
    return lengths, labels_positions

p = lambda d: f"/cluster/home/t122995uhn/projects/data/{d}/alphaflow_io"

DATASETS = {d: p(d) for d in ['davis', 'kiba', 'pdbbind']}
DATASETS['platinum'] = "/cluster/home/t122995uhn/projects/data/PlatinumDataset/raw/alphaflow_io"

fig, axs = plt.subplots(len(DATASETS), 1, figsize=(10, 5*len(DATASETS) + len(DATASETS)))

n_bins = 50  # Adjust the number of bins according to your preference

for i, (dataset, d_dir) in enumerate(DATASETS.items()):
    # Load sequences and positions for labels
    lengths, labels_positions = load_sequences(d_dir)
    
    # Plot histogram
    ax = axs[i]
    n, bins, patches = ax.hist(lengths, bins=n_bins, color='blue', alpha=0.7)
    ax.set_title(dataset)
    
    # Add counts to each bin
    for count, x, patch in zip(n, bins, patches):
        ax.text(x + 0.5, count, str(int(count)), ha='center', va='bottom')
    
    # Adding red number labels
    for label, pos in labels_positions.items():
        ax.text(pos, label, str(label), color='red', ha='center')
    
    # Optional: Additional formatting for readability
    ax.set_xlabel('Sequence Length')
    ax.set_ylabel('Frequency')
    ax.set_xlim([0, max(lengths) + 10])  # Adjust xlim to make sure labels fit

plt.tight_layout()
plt.show()
# %%

@jyaacoub
Copy link
Owner Author

jyaacoub commented Apr 5, 2024

Alphaflow low mem code:

Issue mentions long_sequence_inference=True for ModelConfig bjing2016/alphaflow#17

Outcome

This just barely output 2 additional proteins before running into OOM again.

@jyaacoub
Copy link
Owner Author

jyaacoub commented May 6, 2024

Solution

See commit jyaacoub/alphaflow@b93e289 for predict_deepspeed.py

To summarize how to get around memory issues there are 4 things that can be done. Listed in order of minimal impact to time complexity they are:

  1. --low_pres: Using low precision for parameters (torch.bfloat vs torch.float32). This will also improve time complexity since there are also fewer calculations to be made at the risk of reduced accuracy.
  2. --chunk_size: Chunking calculations on GPU by modules. Setting this to 4-2 is usually sufficient. on 2x a100 this and --low_pres would get us to sequence lengths of 1070-1167, respectively.
  3. --cpu_offload: offload parameters immediately when they are not in use to the CPU.
  4. --lma: low memory attention using Staats & Rabe's low-memory attention algorithm. This increases time complexity quite a bit and should only be used when absolutely necessary. For this to modify the default chunk_sizes we must change the source code for OpenFold (see OS.Environ for LMA default chunk_sizes aqlaboratory/openfold#435)

@jyaacoub jyaacoub closed this as completed May 6, 2024
@jyaacoub jyaacoub added the main hurdle/issue This is an issue that was a pivotal moment during the project. label Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working main hurdle/issue This is an issue that was a pivotal moment during the project.
Projects
None yet
Development

No branches or pull requests

1 participant