Skip to content

Commit

Permalink
merge remote branch 'main' into experiments
Browse files Browse the repository at this point in the history
qcampbel committed Aug 19, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
2 parents 08ccfa9 + f4cecf4 commit 8373afe
Showing 14 changed files with 181 additions and 256 deletions.
7 changes: 5 additions & 2 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
# Copy this file to a new file named .env and replace the placeholders with your actual keys.
# REMOVE "pragma: allowlist secret" when you replace with actual keys.
# DO NOT fill your keys directly into this file.

# OpenAI API Key
OPENAI_API_KEY=YOUR_OPENAI_API_KEY_GOES_HERE # pragma: allowlist secret

# Serp API key
SERP_API_KEY=YOUR_SERP_API_KEY_GOES_HERE # pragma: allowlist secret
# PQA API Key to use LiteratureSearch tool (optional) -- it also requires OpenAI key
PQA_API_KEY=YOUR_PQA_API_KEY_GOES_HERE # pragma: allowlist secret

# Optional: add TogetherAI, Fireworks, or Anthropic API key here to use their models
38 changes: 24 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,48 @@
MD-Agent is a LLM-agent based toolset for Molecular Dynamics.
MDAgent is a LLM-agent based toolset for Molecular Dynamics.
It's built using Langchain and uses a collection of tools to set up and execute molecular dynamics simulations, particularly in OpenMM.


## Environment Setup
To use the OpenMM features in the agent, please set up a conda environment, following these steps.
- Create conda environment: `conda env create -n mdagent -f environment.yaml`
- Activate your environment: `conda activate mdagent`
```
conda env create -n mdagent -f environment.yaml
conda activate mdagent
```

If you already have a conda environment, you can install dependencies before you activate it with the following step.
- Install the necessary conda dependencies: `conda env update -n <YOUR_CONDA_ENV_HERE> -f environment.yaml`

If you already have a conda environment, you can install dependencies with the following step.
- Install the necessary conda dependencies: `conda install -c conda-forge openmm pdbfixer mdtraj`


## Installation
```
pip install git+https://github.com/ur-whitelab/md-agent.git
```


## Usage
The first step is to set up your API keys in your environment. An OpenAI key is necessary for this project.
The next step is to set up your API keys in your environment. An API key for LLM provider is necessary for this project. Supported LLM providers are OpenAI, TogetherAI, Fireworks, and Anthropic.
Other tools require API keys, such as paper-qa for literature searches. We recommend setting up the keys in a .env file. You can use the provided .env.example file as a template.
1. Copy the `.env.example` file and rename it to `.env`: `cp .env.example .env`
2. Replace the placeholder values in `.env` with your actual keys

<!-- ## Using Streamlit Interface
If you'd like to use MDAgent via the streamlit app, make sure you have completed the steps above. Then, in your terminal, run `streamlit run st_app.py` in the project root directory.
You can ask MDAgent to conduct molecular dynamics tasks using OpenAI's GPT model
```
from mdagent import MDAgent
agent = MDAgent(model="gpt-3.5-turbo")
agent.run("Simulate protein 1ZNI at 300 K for 0.1 ps and calculate the RMSD over time.")
```
Note: to distinguish Together models from the rest, you'll need to add "together\" prefix in model flag, such as `agent = MDAgent(model="together/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo")`

From there you may upload files to use during the run. Note: the app is currently limited to uploading .pdb and .cif files, and the max size is defaulted at 200MB.
- To upload larger files, instead run `streamlit run st_app.py --server.maxUploadSize=some_large_number`
- To add different file types, you can add your desired file type to the list in the [streamlit app file](https://github.com/ur-whitelab/md-agent/blob/main/st_app.py). -->
## LLM Providers
By default, we support LLMs through OpenAI API. However, feel free to use other LLM providers. Make sure to install the necessary package for it. Here's list of packages required for alternative LLM providers we support:
- `pip install langchain-together` to use models from TogetherAI
- `pip install langchain-anthropic` to use models from Anthropic
- `pip install langchain-fireworks` to use models from Fireworks


## Contributing

We welcome contributions to MD-Agent! If you're interested in contributing to the project, please check out our [Contributor's Guide](CONTRIBUTING.md) for detailed instructions on getting started, feature development, and the pull request process.
We welcome contributions to MDAgent! If you're interested in contributing to the project, please check out our [Contributor's Guide](CONTRIBUTING.md) for detailed instructions on getting started, feature development, and the pull request process.

We value and appreciate all contributions to MD-Agent.
We value and appreciate all contributions to MDAgent.
4 changes: 2 additions & 2 deletions mdagent/agent/agent.py
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@
from langchain.agents import AgentExecutor, OpenAIFunctionsAgent
from langchain.agents.structured_chat.base import StructuredChatAgent

from ..tools import get_tools, make_all_tools
from ..tools import get_relevant_tools, make_all_tools
from ..utils import PathRegistry, SetCheckpoint, _make_llm
from .memory import MemoryManager
from .prompt import openaifxn_prompt, structured_prompt
@@ -76,7 +76,7 @@ def _initialize_tools_and_agent(self, user_input=None):
else:
if self.top_k_tools != "all" and user_input is not None:
# retrieve only tools relevant to user input
self.tools = get_tools(
self.tools = get_relevant_tools(
query=user_input,
llm=self.tools_llm,
top_k_tools=self.top_k_tools,
4 changes: 2 additions & 2 deletions mdagent/tools/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from .maketools import get_tools, make_all_tools
from .maketools import get_relevant_tools, make_all_tools

__all__ = ["get_tools", "make_all_tools"]
__all__ = ["get_relevant_tools", "make_all_tools"]
2 changes: 0 additions & 2 deletions mdagent/tools/base_tools/__init__.py
Original file line number Diff line number Diff line change
@@ -45,7 +45,6 @@
)
from .simulation_tools.create_simulation import ModifyBaseSimulationScriptTool
from .simulation_tools.setup_and_run import SetUpandRunFunction
from .util_tools.git_issues_tool import SerpGitTool
from .util_tools.registry_tools import ListRegistryPaths, MapPath2Name
from .util_tools.search_tools import Scholar2ResultLLM

@@ -87,7 +86,6 @@
"RDFTool",
"RMSDCalculator",
"Scholar2ResultLLM",
"SerpGitTool",
"SetUpandRunFunction",
"SimulationOutputFigures",
"SmallMolPDB",
2 changes: 0 additions & 2 deletions mdagent/tools/base_tools/preprocess_tools/pdb_get.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from typing import Optional

import requests
import streamlit as st
from langchain.tools import BaseTool
from rdkit import Chem
from rdkit.Chem import AllChem
@@ -36,7 +35,6 @@ def get_pdb(query_string: str, path_registry: PathRegistry):
results = r.json()["result_set"]
pdbid = max(results, key=lambda x: x["score"])["identifier"]
print(f"PDB file found with this ID: {pdbid}")
st.markdown(f"PDB file found with this ID: {pdbid}", unsafe_allow_html=True)
url = f"https://files.rcsb.org/download/{pdbid}.{filetype}"
pdb = requests.get(url)
filename = path_registry.write_file_name(
12 changes: 0 additions & 12 deletions mdagent/tools/base_tools/simulation_tools/setup_and_run.py
Original file line number Diff line number Diff line change
@@ -7,7 +7,6 @@
from typing import Any, Dict, List, Optional, Type

import requests
import streamlit as st
from langchain.tools import BaseTool
from openff.toolkit.topology import Molecule
from openmm import (
@@ -251,7 +250,6 @@ def __init__(

def setup_system(self):
print("Building system...")
st.markdown("Building system", unsafe_allow_html=True)
self.pdb_id = self.params["pdb_id"]
self.pdb_path = self.path_registry.get_mapped_path(self.pdb_id)
self.pdb = PDBFile(self.pdb_path)
@@ -285,7 +283,6 @@ def setup_system(self):

def setup_integrator(self):
print("Setting up integrator...")
st.markdown("Setting up integrator", unsafe_allow_html=True)
int_params = self.int_params
integrator_type = int_params.get("integrator_type", "LangevinMiddle")

@@ -310,7 +307,6 @@ def setup_integrator(self):

def create_simulation(self):
print("Creating simulation...")
st.markdown("Creating simulation", unsafe_allow_html=True)
self.simulation = Simulation(
self.modeller.topology,
self.system,
@@ -838,12 +834,10 @@ def remove_leading_spaces(text):
file.write(script_content)

print(f"Standalone simulation script written to {directory}/{filename}")
st.markdown("Standalone simulation script written", unsafe_allow_html=True)

def run(self):
# Minimize and Equilibrate
print("Performing energy minimization...")
st.markdown("Performing energy minimization", unsafe_allow_html=True)

self.simulation.minimizeEnergy()
print("Minimization complete!")
@@ -857,19 +851,16 @@ def run(self):
)
self.path_registry.map_path(f"top_{self.sim_id}", top_name, top_description)
print("Initial Positions saved to initial_positions.pdb")
st.markdown("Minimization complete! Equilibrating...", unsafe_allow_html=True)
print("Equilibrating...")
_temp = self.int_params["Temperature"]
self.simulation.context.setVelocitiesToTemperature(_temp)
_eq_steps = self.sim_params.get("equilibrationSteps", 1000)
self.simulation.step(_eq_steps)
# Simulate
print("Simulating...")
st.markdown("Simulating...", unsafe_allow_html=True)
self.simulation.currentStep = 0
self.simulation.step(self.sim_params["Number of Steps"])
print("Done!")
st.markdown("Done!", unsafe_allow_html=True)
if not self.save:
if os.path.exists("temp_trajectory.dcd"):
os.remove("temp_trajectory.dcd")
@@ -950,7 +941,6 @@ def _run(self, **input_args):
openmmsim.create_simulation()

print("simulation set!")
st.markdown("simulation set!", unsafe_allow_html=True)
except ValueError as e:
msg = str(e) + f"This were the inputs {input_args}"
if "No template for" in msg:
@@ -1492,11 +1482,9 @@ def check_system_params(cls, values):
forcefield_files = values.get("forcefield_files")
if forcefield_files is None or forcefield_files is []:
print("Setting default forcefields")
st.markdown("Setting default forcefields", unsafe_allow_html=True)
forcefield_files = ["amber14-all.xml", "amber14/tip3pfb.xml"]
elif len(forcefield_files) == 0:
print("Setting default forcefields v2")
st.markdown("Setting default forcefields", unsafe_allow_html=True)
forcefield_files = ["amber14-all.xml", "amber14/tip3pfb.xml"]
else:
for file in forcefield_files:
2 changes: 0 additions & 2 deletions mdagent/tools/base_tools/util_tools/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
from .git_issues_tool import SerpGitTool
from .registry_tools import ListRegistryPaths, MapPath2Name
from .search_tools import Scholar2ResultLLM

__all__ = [
"ListRegistryPaths",
"MapPath2Name",
"Scholar2ResultLLM",
"SerpGitTool",
]
168 changes: 0 additions & 168 deletions mdagent/tools/base_tools/util_tools/git_issues_tool.py

This file was deleted.

79 changes: 39 additions & 40 deletions mdagent/tools/maketools.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
import os

import streamlit as st
import numpy as np
from dotenv import load_dotenv
from langchain import agents
from langchain.base_language import BaseLanguageModel
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from mdagent.utils import PathRegistry

@@ -73,7 +74,7 @@ def make_all_tools(
all_tools += [
ModifyBaseSimulationScriptTool(path_registry=path_instance, llm=llm),
]
if "OPENAI_API_KEY" in os.environ:
if "OPENAI_API_KEY" in os.environ and "PQA_API_KEY" in os.environ:
all_tools += [Scholar2ResultLLM(llm=llm, path_registry=path_instance)]
if human:
all_tools += [agents.load_tools(["human"], llm)[0]]
@@ -131,46 +132,44 @@ def make_all_tools(
return all_tools


def get_tools(
query,
llm: BaseLanguageModel,
top_k_tools=15,
human=False,
):
ckpt_dir = PathRegistry.get_instance().ckpt_dir
def get_relevant_tools(query, llm: BaseLanguageModel, top_k_tools=15, human=False):
"""
Get most relevant tools for the query using vector similarity search.
Query and tools are vectorized using either OpenAI embeddings or TF-IDF.
If an OpenAI API key is available, it uses embeddings for a more
sophisticated search. Otherwise, it falls back to using TF-IDF for
simpler, term-based matching.
Returns:
- A list of the most relevant tools, or None if no tools are found.
"""

all_tools = make_all_tools(llm, human=human)
if not all_tools:
return None

tool_texts = [f"{tool.name} {tool.description}" for tool in all_tools]

# set vector DB for all tools
vectordb = Chroma(
collection_name="all_tools_vectordb",
embedding_function=OpenAIEmbeddings(),
persist_directory=f"{ckpt_dir}/all_tools_vectordb",
)
# vectordb.delete_collection() #<--- to clear previous vectordb directory
for i, tool in enumerate(all_tools):
vectordb.add_texts(
texts=[tool.description],
ids=[tool.name],
metadatas=[{"tool_name": tool.name, "index": i}],
)

# retrieve 'k' tools
k = min(top_k_tools, vectordb._collection.count())
# convert texts to vectors
if "OPENAI_API_KEY" in os.environ:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
try:
tool_vectors = np.array(embeddings.embed_documents(tool_texts))
query_vector = np.array(embeddings.embed_query(query)).reshape(1, -1)
except Exception as e:
print(f"Error generating embeddings for tool retrieval: {e}")
return None
else:
vectorizer = TfidfVectorizer()
tool_vectors = vectorizer.fit_transform(tool_texts)
query_vector = vectorizer.transform([query])

similarities = cosine_similarity(query_vector, tool_vectors).flatten()
k = min(max(top_k_tools, 1), len(all_tools))
if k == 0:
return None
docs = vectordb.similarity_search(query, k=k)
retrieved_tools = []
for d in docs:
index = d.metadata.get("index")
if index is not None and 0 <= index < len(all_tools):
retrieved_tools.append(all_tools[index])
else:
print(f"Invalid index {index}.")
print("Some tools may be duplicated.")
print(f"Try to delete vector DB at {ckpt_dir}/all_tools_vectordb.")
st.markdown(
"Invalid index. Some tools may be duplicated Try to delete VDB.",
unsafe_allow_html=True,
)
top_k_indices = np.argsort(similarities)[-k:][::-1]
retrieved_tools = [all_tools[i] for i in top_k_indices]

return retrieved_tools
12 changes: 12 additions & 0 deletions mdagent/utils/makellm.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
import importlib.util

from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler


def check_package_exists(package_name, model):
if not importlib.util.find_spec(package_name):
raise ImportError(
f"The package required to run model '{model}' is missing: '{package_name}'."
)


def _make_llm(model, temp, streaming):
if model.startswith("gpt-3.5-turbo") or model.startswith("gpt-4"):
from langchain_openai import ChatOpenAI
@@ -13,6 +22,7 @@ def _make_llm(model, temp, streaming):
callbacks=[StreamingStdOutCallbackHandler()] if streaming else None,
)
elif model.startswith("accounts/fireworks"):
check_package_exists("langchain_fireworks", model)
from langchain_fireworks import ChatFireworks

llm = ChatFireworks(
@@ -24,6 +34,7 @@ def _make_llm(model, temp, streaming):
)
elif model.startswith("together/"):
# user needs to add 'together/' prefix to use TogetherAI provider
check_package_exists("langchain_together", model)
from langchain_together import ChatTogether

llm = ChatTogether(
@@ -34,6 +45,7 @@ def _make_llm(model, temp, streaming):
callbacks=[StreamingStdOutCallbackHandler()] if streaming else None,
)
elif model.startswith("claude"):
check_package_exists("langchain_anthropic", model)
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(
9 changes: 0 additions & 9 deletions setup.py
Original file line number Diff line number Diff line change
@@ -17,19 +17,12 @@
license="MIT",
packages=find_packages(),
install_requires=[
"chromadb",
"google-search-results",
"langchain==0.2.12",
"langchain-anthropic==0.1.22",
"langchain-chroma",
"langchain-community",
"langchain-fireworks==0.1.7",
"langchain-openai==0.1.19",
"langchain-together==0.1.4",
"matplotlib",
"nbformat",
"openai",
"outlines",
"paper-qa==4.0.0rc8 ",
"paper-scraper @ git+https://github.com/blackadad/paper-scraper.git",
"pandas",
@@ -38,8 +31,6 @@
"rdkit",
"requests",
"seaborn",
"streamlit",
"tiktoken",
"scikit-learn",
"scipy==1.14.0",
],
2 changes: 1 addition & 1 deletion tests/test_analysis/test_inertia.py
Original file line number Diff line number Diff line change
@@ -57,4 +57,4 @@ def test_plot_moi_multiple_frames(mock_close, mock_savefig, moi_functions):
result = moi_functions.plot_moi()
assert "Plot of moments of inertia over time saved" in result
mock_savefig.assert_called_once()
mock_close.assert_called_once()
mock_close.mock_close.call_count >= 1
96 changes: 96 additions & 0 deletions tests/test_utils/test_top_k_tools.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
import os
from unittest.mock import MagicMock, patch

import numpy as np
import pytest

from mdagent.tools.maketools import get_relevant_tools


@pytest.fixture
def mock_llm():
return MagicMock()


@pytest.fixture
def mock_tools():
Tool = MagicMock()
tool1 = Tool(name="Tool1", description="This is the first tool")
tool2 = Tool(name="Tool2", description="This is the second tool")
tool3 = Tool(name="Tool3", description="This is the third tool")
return [tool1, tool2, tool3]


@patch("mdagent.tools.maketools.make_all_tools")
@patch("mdagent.tools.maketools.OpenAIEmbeddings")
def test_get_relevant_tools_with_openai_embeddings(
mock_openai_embeddings, mock_make_all_tools, mock_llm, mock_tools
):
mock_make_all_tools.return_value = mock_tools
mock_embed_documents = mock_openai_embeddings.return_value.embed_documents
mock_embed_query = mock_openai_embeddings.return_value.embed_query
mock_embed_documents.return_value = np.random.rand(3, 512)
mock_embed_query.return_value = np.random.rand(512)

with patch.dict(
os.environ, {"OPENAI_API_KEY": "test_key"} # pragma: allowlist secret
):
relevant_tools = get_relevant_tools("test query", mock_llm, top_k_tools=2)
assert len(relevant_tools) == 2
assert relevant_tools[0] in mock_tools
assert relevant_tools[1] in mock_tools


@patch("mdagent.tools.maketools.make_all_tools")
@patch("mdagent.tools.maketools.TfidfVectorizer")
def test_get_relevant_tools_with_tfidf(
mock_tfidf_vectorizer, mock_make_all_tools, mock_llm, mock_tools
):
mock_make_all_tools.return_value = mock_tools
mock_vectorizer = mock_tfidf_vectorizer.return_value
mock_vectorizer.fit_transform.return_value = np.random.rand(3, 10)
mock_vectorizer.transform.return_value = np.random.rand(1, 10)

with patch.dict(os.environ, {}, clear=True): # ensure OPENAI_API_KEY is not set
relevant_tools = get_relevant_tools("test query", mock_llm, top_k_tools=2)
assert len(relevant_tools) == 2
assert relevant_tools[0] in mock_tools
assert relevant_tools[1] in mock_tools


@patch("mdagent.tools.maketools.make_all_tools")
def test_get_relevant_tools_with_no_tools(mock_make_all_tools, mock_llm):
mock_make_all_tools.return_value = []

with patch.dict(os.environ, {}, clear=True):
relevant_tools = get_relevant_tools("test query", mock_llm)
assert relevant_tools is None


@patch("mdagent.tools.maketools.make_all_tools")
@patch("mdagent.tools.maketools.OpenAIEmbeddings")
def test_get_relevant_tools_with_openai_exception(
mock_openai_embeddings, mock_make_all_tools, mock_llm, mock_tools
):
mock_make_all_tools.return_value = mock_tools
mock_embed_documents = mock_openai_embeddings.return_value.embed_documents
mock_embed_documents.side_effect = Exception("Embedding error")

with patch.dict(
os.environ, {"OPENAI_API_KEY": "test_key"} # pragma: allowlist secret
):
relevant_tools = get_relevant_tools("test query", mock_llm)
assert relevant_tools is None


@patch("mdagent.tools.maketools.make_all_tools")
def test_get_relevant_tools_top_k(mock_make_all_tools, mock_llm, mock_tools):
mock_make_all_tools.return_value = mock_tools

with patch.dict(os.environ, {}, clear=True):
relevant_tools = get_relevant_tools("test query", mock_llm, top_k_tools=1)
assert len(relevant_tools) == 1
assert relevant_tools[0] in mock_tools

relevant_tools = get_relevant_tools("test query", mock_llm, top_k_tools=5)
assert len(relevant_tools) == len(mock_tools)

0 comments on commit 8373afe

Please sign in to comment.