Best practices around running multiple local models concurrently for longer running jobs. #312
Replies: 2 comments
-
Could you share the Python code you are using? I would expect the Python API to run as fast as the chat feature, because the chat feature is using it directly like this: Lines 412 to 442 in 17999cf It's basically calling this in a loop: response = conversation.prompt(prompt, system, **validated_options) |
Beta Was this translation helpful? Give feedback.
-
Oh interesting - I might be introducing errors on my side then. Below I included a simplified mock of the code I'm running (these are split over multiple files but including the primary functions here). The idea is to load in the data stored in a json, then spawn a few models and process the data (a text field in each object to analyze) in batches. I believe I am using the same models between runs now based on the IDs I can log. After ~100 items processed the entire process grinds to a halt, with minimal signs of capacity strain for my CPU/GPUs. This can be alleviated using timeouts and re initializing a model if it's failing, but the performance is worse than I expected here (entirely likely this is on my side or a limitation of the mistral model). def init_process(shared_count, lock):
global model_instance
model_instance = llm.get_model("mistral-7b-openorca")
llm_utilities.model_instance = model_instance
def process_activity(input_text):
try:
global model_instance
function_output = model_instance.prompt(input_text).text()
return function_output.strip()
except Exception as e:
print(f"Error processing activity: {e}")
return None
def llm_analysis_parallel(num_processes=4, json_path="___.json"):
activity_timeout = 60
all_results = []
with open(json_path, 'r') as file:
activities = json.load(file)
with Pool(num_processes, initializer=init_process) as pool:
for activity in activities:
input_text = activity['description_text_string']
async_result = pool.apply_async(process_activity, args=(input_text,))
try:
result = async_result.get(timeout=activity_timeout)
if result is not None:
all_results.append(result)
except Exception as e:
print(f"Error retrieving result: {e}")
return all_results |
Beta Was this translation helpful? Give feedback.
-
Hi @simonw, thank you for this tool, it was very fast to get up and running.
I am working on a project and am attempting to use this with llm-gpt4all to run "mistral-7b-openorca" locally for some processing tasks. I've found that the speed on the chat cli tool (where the model is kept around for multiple uses) seems to be significantly faster than using the python API. I looked through the documentation of the python API but did not see a method to expose similar functionality.
Is this possible? Do you or others have suggestions around how to go about reusing the model between processes (e.g., iterating through an array of inputs with the same system prompt).
For reference I am using a macbook pro M1 max with 64GB of ram, and only seeing ~20% of my memory used even when using multiprocessing and loading multiple models.
Beta Was this translation helpful? Give feedback.
All reactions