Best practices around running multiple local models concurrently for longer running jobs. #312

eastcoasting · 2023-10-25T21:32:38Z

eastcoasting
Oct 25, 2023

Hi @simonw, thank you for this tool, it was very fast to get up and running.

I am working on a project and am attempting to use this with llm-gpt4all to run "mistral-7b-openorca" locally for some processing tasks. I've found that the speed on the chat cli tool (where the model is kept around for multiple uses) seems to be significantly faster than using the python API. I looked through the documentation of the python API but did not see a method to expose similar functionality.

Is this possible? Do you or others have suggestions around how to go about reusing the model between processes (e.g., iterating through an array of inputs with the same system prompt).

For reference I am using a macbook pro M1 max with 64GB of ram, and only seeing ~20% of my memory used even when using multiprocessing and loading multiple models.

simonw · 2023-10-25T23:56:54Z

simonw
Oct 25, 2023
Maintainer

Could you share the Python code you are using?

I would expect the Python API to run as fast as the chat feature, because the chat feature is using it directly like this:

llm/llm/cli.py

Lines 412 to 442 in 17999cf

    
           while True: 
        
               prompt = click.prompt("", prompt_suffix="> " if not in_multi else "") 
        
               if prompt.strip().startswith("!multi"): 
        
                   in_multi = True 
        
                   bits = prompt.strip().split() 
        
                   if len(bits) > 1: 
        
                       end_token = "!end {}".format(" ".join(bits[1:])) 
        
                   continue 
        
               if in_multi: 
        
                   if prompt.strip() == end_token: 
        
                       prompt = "\n".join(accumulated) 
        
                       in_multi = False 
        
                       accumulated = [] 
        
                   else: 
        
                       accumulated.append(prompt) 
        
                       continue 
        
               if template_obj: 
        
                   try: 
        
                       prompt, system = template_obj.evaluate(prompt, params) 
        
                   except Template.MissingVariables as ex: 
        
                       raise click.ClickException(str(ex)) 
        
               if prompt.strip() in ("exit", "quit"): 
        
                   break 
        
               response = conversation.prompt(prompt, system, **validated_options) 
        
               # System prompt only sent for the first message: 
        
               system = None 
        
               for chunk in response: 
        
                   print(chunk, end="") 
        
                   sys.stdout.flush() 
        
               response.log_to_db(db) 
        
               print("")

It's basically calling this in a loop:

     response = conversation.prompt(prompt, system, **validated_options)

0 replies

eastcoasting · 2023-10-26T01:57:38Z

eastcoasting
Oct 26, 2023
Author

Oh interesting - I might be introducing errors on my side then. Below I included a simplified mock of the code I'm running (these are split over multiple files but including the primary functions here).

The idea is to load in the data stored in a json, then spawn a few models and process the data (a text field in each object to analyze) in batches. I believe I am using the same models between runs now based on the IDs I can log. After ~100 items processed the entire process grinds to a halt, with minimal signs of capacity strain for my CPU/GPUs. This can be alleviated using timeouts and re initializing a model if it's failing, but the performance is worse than I expected here (entirely likely this is on my side or a limitation of the mistral model).

def init_process(shared_count, lock):
    global model_instance 
    model_instance = llm.get_model("mistral-7b-openorca")
    llm_utilities.model_instance = model_instance


def process_activity(input_text):
    try:
        global model_instance
        function_output = model_instance.prompt(input_text).text()
        return function_output.strip()
    except Exception as e:
        print(f"Error processing activity: {e}")
        return None


def llm_analysis_parallel(num_processes=4, json_path="___.json"):
    activity_timeout = 60
    all_results = []

    with open(json_path, 'r') as file:
        activities = json.load(file)

    with Pool(num_processes, initializer=init_process) as pool:
        for activity in activities:
            input_text = activity['description_text_string'] 

            async_result = pool.apply_async(process_activity, args=(input_text,))
            try:
                result = async_result.get(timeout=activity_timeout)
                if result is not None:
                    all_results.append(result)
            except Exception as e:
                print(f"Error retrieving result: {e}")

    return all_results

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Best practices around running multiple local models concurrently for longer running jobs. #312

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Best practices around running multiple local models concurrently for longer running jobs. #312

Uh oh!

eastcoasting Oct 25, 2023

Replies: 2 comments

Uh oh!

Uh oh!

simonw Oct 25, 2023 Maintainer

Uh oh!

eastcoasting Oct 26, 2023 Author

eastcoasting
Oct 25, 2023

simonw
Oct 25, 2023
Maintainer

eastcoasting
Oct 26, 2023
Author