Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the autopilot open large git repository #195

Open
LZING opened this issue Jun 26, 2024 · 1 comment
Open

Using the autopilot open large git repository #195

LZING opened this issue Jun 26, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@LZING
Copy link

LZING commented Jun 26, 2024

Describe the issue:

I'm running into a problem right now when dealing with large code repositories. When I'm dealing with small code repositories, autopilot works great. But when I'm dealing with large repositories, I have a token overflow problem when interacting with LLM.

So how should we deal with large code repositories? Sending only part of the source code will affect the context.

Version details:

AutoPilot version tested: a6af6713400b4ba64dd17f0324b6ebf5c65dfad9

Node version: v18.16.1

Expected behaviors:

I expected the autopilot to read the entire large code repository. If you cannot read the entire code repository, you should split it into multiple modules and conduct multiple rounds of dialogue.

@LZING LZING added the enhancement New feature or request label Jun 26, 2024
Copy link

codeautopilot bot commented Jun 26, 2024

Potential solution

To solve the issue of token overflow when dealing with large code repositories, we need to implement a mechanism to split the repository into smaller modules and process each module separately. This approach ensures that we stay within the token limits imposed by the language model and can handle large repositories efficiently.

What is causing this bug?

The bug is caused by the current implementation's inability to handle large repositories efficiently. The existing code processes the entire repository in one go, leading to token overflow issues when interacting with language models. Specifically, the following issues contribute to the bug:

  1. Lack of Splitting Mechanism: The code does not split large repositories into smaller, manageable modules.
  2. Single Round Processing: The code processes the entire repository in a single round of dialogue, which is not feasible for large repositories.
  3. Token Management: There is no mechanism to manage token limits when dealing with large repositories.

Code

To address these issues, we need to make changes to several files: interactiveAgent.js, tokenHelper.js, codeBase.js, and summaries.js.

interactiveAgent.js

Modify the runAgent function to handle modules and manage token limits.

const prompts = require('prompts');
const chalk = require('chalk');
const { splitRepositoryIntoModules } = require('./codeBase');
const { countTokens } = require('./tokenHelper');

const TOKEN_LIMIT = 1000; // Example token limit

async function runAgent(agentFunction, var1, var2, interactive=false) {
  console.log(`Agent ${chalk.yellow(agentFunction.name)} is running.`);

  // Split the repository into smaller modules
  const modules = splitRepositoryIntoModules(var1);

  for (const module of modules) {
    // Check token limits
    if (countTokens(module) > TOKEN_LIMIT) {
      console.error(`Module exceeds token limit: ${module.name}`);
      continue;
    }

    if (!interactive) {
      await agentFunction(module, var2);
    } else {
      let res = await agentFunction(module, var2);
      console.dir(res, { depth: null });

      const proceed = await prompts({
        type: 'select',
        name: 'value',
        message: 'Approve agent\'s reply ?',
        choices: [
          { title: 'Approve - continue', value: 'continue' },
          { title: 'Retry - Rerun agent', value: 'retry' },
          { title: 'Abort', value: 'abort' }
        ]
      });

      if (proceed.value === 'continue') continue;
      if (proceed.value === 'retry') await runAgent(agentFunction, module, var2, interactive);
      if (proceed.value === 'abort') process.exit(1);
    }
  }
}

module.exports = { runAgent };

tokenHelper.js

Add a function to split large inputs by token limit and update the countTokens function.

const { get_encoding } = require('@dqbd/tiktoken');

function splitInputByTokenLimit(input, tokenLimit) {
    const encoder = get_encoding("cl100k_base");
    const tokens = encoder.encode(input);
    const chunks = [];
    let currentChunk = [];

    tokens.forEach(token => {
        if (currentChunk.length + 1 > tokenLimit) {
            chunks.push(encoder.decode(currentChunk));
            currentChunk = [];
        }
        currentChunk.push(token);
    });

    if (currentChunk.length > 0) {
        chunks.push(encoder.decode(currentChunk));
    }

    encoder.free();
    return chunks;
}

function countTokens(input) {
    const encoder = get_encoding("cl100k_base");
    const tokens = encoder.encode(input);
    const tokenCount = tokens.length;
    encoder.free();
    return tokenCount;
}

function countTokensInChunks(input, tokenLimit) {
    const chunks = splitInputByTokenLimit(input, tokenLimit);
    return chunks.map(chunk => countTokens(chunk));
}

module.exports = { countTokens, countTokensInChunks, splitInputByTokenLimit };

codeBase.js

Modify the codeBaseFullIndex and codeBaseFullIndexInteractive functions to handle modules and manage token limits.

async function codeBaseFullIndex(codeBaseDirectory, maxTokens = 1000) {
    const files = loadFiles(codeBaseDirectory);
    let currentModule = [];
    let currentTokenCount = 0;

    for (const file of files) {
        const fileContent = file.fileContent;
        const filePathRelative = file.filePath;
        const fileTokenCount = calculateTokensCost(fileContent);

        if (currentTokenCount + fileTokenCount > maxTokens) {
            await processModule(currentModule, codeBaseDirectory);
            currentModule = [];
            currentTokenCount = 0;
        }

        currentModule.push(file);
        currentTokenCount += fileTokenCount;
    }

    if (currentModule.length > 0) {
        await processModule(currentModule, codeBaseDirectory);
    }
}

async function processModule(module, codeBaseDirectory) {
    for (const file of module) {
        const fileContent = file.fileContent;
        const filePathRelative = file.filePath;
        await generateAndWriteFileSummary(codeBaseDirectory, filePathRelative, fileContent);
    }
}

async function codeBaseFullIndexInteractive(codeBaseDirectory, model) {
    printCostEstimation(codeBaseDirectory, model);

    if (await approveIndexing()) {
        await codeBaseFullIndex(codeBaseDirectory);
    } else {
        console.log('Aborted summarizing the project.');
    }
}

summaries.js

Modify the readAllSummaries function to read summaries in chunks and integrate the chunkSummaries function.

async function readAllSummaries(codeBaseDirectory) {
  const db = getDB(codeBaseDirectory);
  const sql = `
  SELECT path, summary 
  FROM files`;
  const summaries = await new Promise((resolve, reject) => {
    db.all(sql, (err, rows) => {
      if (err) {
        reject(err);
      } else {
        resolve(rows);
      }
    });
  });

  if (typeof summaries === 'undefined' || summaries.length === 0) {
    console.log("No matching files found in the database. Indexing is required.");
    throw new Error("Cannot run without summaries. Indexing is required.");
  }

  let summariesString = "";
  for (const summary of summaries) {
    try {
      summariesString += `File Path: ${summary.path}\nSummary:\n${summary.summary}${summaryStringDelimiter}`;
    } catch (error) {
      console.error("Error reading summary from database:", error);
    }
  }

  const summaryChunks = chunkSummaries(summariesString, maxSummaryTokenCount);
  return summaryChunks;
}

How to replicate the bug

  1. Clone a large code repository.
  2. Run the autopilot tool on the large repository.
  3. Observe the token overflow error when the tool tries to process the entire repository in one go.

By implementing the proposed changes, the tool will be able to handle large repositories by splitting them into smaller modules and processing each module separately, thus avoiding token overflow issues.

Click here to create a Pull Request with the proposed solution

Files used for this task:

Changes on modules/interactiveAgent.js

Analysis of modules/interactiveAgent.js

Overview

The interactiveAgent.js file contains a single function, runAgent, which is responsible for running an agent function asynchronously. It supports both interactive and non-interactive modes. In interactive mode, it prompts the user to approve, retry, or abort the agent's reply.

Current Functionality

  • Non-Interactive Mode: Directly runs the agent function with the provided arguments and returns the result.
  • Interactive Mode: Runs the agent function, displays the result, and prompts the user to approve, retry, or abort the operation.

Potential Issues

  1. Handling Large Repositories: The current implementation does not support splitting large repositories into smaller modules or handling multiple rounds of dialogue for each module.
  2. Token Overflow: There is no mechanism to manage token limits when dealing with large repositories.

Required Updates

To address the issue of handling large repositories, the following updates are necessary:

  1. Splitting Repositories: Implement functionality to split large repositories into smaller modules.
  2. Multiple Rounds of Dialogue: Update the interactive agent to handle multiple rounds of dialogue for each module separately.
  3. Token Management: Ensure that token limits are respected by processing each module separately.

Proposed Changes

  1. Modify runAgent to Handle Modules:

    • Add logic to split the repository into smaller modules.
    • Process each module separately in multiple rounds of dialogue.
  2. Token Management:

    • Integrate token counting functionality to ensure that token limits are respected for each module.

Updated Code

Here is a proposed update to the runAgent function to handle large repositories:

const prompts = require('prompts');
const chalk = require('chalk');
const { splitRepositoryIntoModules } = require('./codeBase');
const { countTokens } = require('./tokenHelper');

/**
 * @description Asynchronous function that runs an agent function with given variables.
 * @param {function} agentFunction - The agent function to be executed asynchronously.
 * @param {any} var1 - The first variable to be passed as an argument to the agent function.
 * @param {any} var2 - The second variable to be passed as an argument to the agent function.
 * @param {boolean} interactive=false - A boolean indicating whether or not to prompt the user for approval after running the agent function.
 * @returns {Promise<any>} A Promise that resolves with the return value of the agent function if not in interactive mode, otherwise resolves or rejects based on user input.
*/
async function runAgent(agentFunction, var1, var2, interactive=false) {
  console.log(`Agent ${chalk.yellow(agentFunction.name)} is running.`);

  // Split the repository into smaller modules
  const modules = splitRepositoryIntoModules(var1);

  for (const module of modules) {
    // Check token limits
    if (countTokens(module) > TOKEN_LIMIT) {
      console.error(`Module exceeds token limit: ${module.name}`);
      continue;
    }

    if (!interactive) {
      await agentFunction(module, var2);
    } else {
      let res = await agentFunction(module, var2);
      console.dir(res, { depth: null });

      const proceed = await prompts({
        type: 'select',
        name: 'value',
        message: 'Approve agent\'s reply ?',
        choices: [
          { title: 'Approve - continue', value: 'continue' },
          { title: 'Retry - Rerun agent', value: 'retry' },
          { title: 'Abort', value: 'abort' }
        ]
      });

      if (proceed.value === 'continue') continue;
      if (proceed.value === 'retry') await runAgent(agentFunction, module, var2, interactive);
      if (proceed.value === 'abort') process.exit(1);
    }
  }
}

module.exports = { runAgent };

Summary

The proposed changes to interactiveAgent.js include splitting large repositories into smaller modules, handling multiple rounds of dialogue for each module, and ensuring token limits are respected. This should address the issue of token overflow when dealing with large code repositories.

Changes on modules/tokenHelper.js

Analysis Report for modules/tokenHelper.js

Overview

The modules/tokenHelper.js file contains a utility function countTokens that counts the number of tokens in a given input string. This function uses the @dqbd/tiktoken library to encode the input string and then returns the length of the resulting token array.

Current Functionality

The countTokens function currently processes a single input string and returns the number of tokens in that string. This is useful for determining the token count of a given piece of text, which is essential for managing interactions with language models that have token limits.

Identified Issue

The issue described in the ticket is related to handling large code repositories, which can lead to token overflow problems when interacting with language models. The current implementation of countTokens does not account for splitting large inputs into smaller modules or handling multiple rounds of dialogue.

Required Updates

To address the issue, the countTokens function needs to be updated to handle each module separately and ensure that token limits are respected. This involves:

  1. Splitting Large Inputs: Implementing functionality to split large input strings into smaller chunks or modules.
  2. Token Limit Management: Ensuring that each chunk respects the token limits imposed by the language model.

Proposed Changes

  1. Function to Split Large Inputs: Create a new function that splits a large input string into smaller chunks based on a specified token limit.
  2. Update countTokens: Modify the countTokens function to work with the new splitting function and handle each chunk separately.

Example Implementation

New Function to Split Large Inputs

/**
 * Splits a large input string into smaller chunks based on a specified token limit.
 * @param {string} input - The large input string to split.
 * @param {number} tokenLimit - The maximum number of tokens per chunk.
 * @returns {Array<string>} - An array of input chunks.
 */
function splitInputByTokenLimit(input, tokenLimit) {
    const encoder = get_encoding("cl100k_base");
    const tokens = encoder.encode(input);
    const chunks = [];
    let currentChunk = [];

    tokens.forEach(token => {
        if (currentChunk.length + 1 > tokenLimit) {
            chunks.push(encoder.decode(currentChunk));
            currentChunk = [];
        }
        currentChunk.push(token);
    });

    if (currentChunk.length > 0) {
        chunks.push(encoder.decode(currentChunk));
    }

    encoder.free();
    return chunks;
}

Updated countTokens Function

/**
 * Counts the number of tokens in the input string.
 * @param {string} input - The input string to tokenize.
 * @returns {number} - The number of tokens in the input string.
 */
function countTokens(input) {
    const encoder = get_encoding("cl100k_base");
    const tokens = encoder.encode(input);
    const tokenCount = tokens.length;
    encoder.free();
    return tokenCount;
}

/**
 * Counts the number of tokens in each chunk of a large input string.
 * @param {string} input - The large input string to tokenize.
 * @param {number} tokenLimit - The maximum number of tokens per chunk.
 * @returns {Array<number>} - An array of token counts for each chunk.
 */
function countTokensInChunks(input, tokenLimit) {
    const chunks = splitInputByTokenLimit(input, tokenLimit);
    return chunks.map(chunk => countTokens(chunk));
}

module.exports = { countTokens, countTokensInChunks, splitInputByTokenLimit };

Conclusion

The proposed changes will allow the tokenHelper.js module to handle large inputs by splitting them into smaller chunks and counting tokens for each chunk separately. This will help manage token limits and prevent overflow issues when interacting with large code repositories.

Changes on modules/codeBase.js

Analysis Report

Overview

The file modules/codeBase.js is responsible for handling various operations related to indexing and summarizing a codebase. The primary functions include:

  1. codeBaseGapFill: Identifies discrepancies between the filesystem and the database.
  2. codeBaseFullIndex: Indexes the entire codebase.
  3. printCostEstimation: Estimates the cost of indexing based on token count.
  4. approveIndexing: Prompts the user for approval to proceed with indexing.
  5. codeBaseFullIndexInteractive: Interactively indexes the codebase after user approval.

Potential Causes of the Bug

The issue described involves token overflow when dealing with large repositories. The current implementation does not handle large repositories efficiently, leading to token overflow problems. Here are the potential causes:

  1. Lack of Splitting Mechanism: The current implementation of codeBaseFullIndex processes the entire codebase in one go, which can easily exceed token limits for large repositories.
  2. Single Round Processing: The codeBaseFullIndexInteractive function processes the entire codebase in a single round of dialogue, which is not feasible for large repositories.

Recommendations for Fixes

To address the issue, the following changes should be implemented:

  1. Splitting Large Repositories: Implement functionality to split large repositories into smaller modules and index them separately. This can be done by modifying the codeBaseFullIndex function to handle smaller chunks of the repository.
  2. Multiple Rounds of Dialogue: Update the codeBaseFullIndexInteractive function to handle multiple rounds of dialogue, processing each module separately.

Implementation Plan

  1. Modify codeBaseFullIndex:

    • Introduce logic to split the repository into smaller modules.
    • Index each module separately to avoid token overflow.
  2. Update codeBaseFullIndexInteractive:

    • Modify the function to handle multiple rounds of dialogue.
    • Process each module in separate rounds to ensure token limits are respected.

Example Code Changes

Here is a high-level example of how the code could be modified:

/**
 * Splits the codebase into smaller modules and indexes them separately.
 * @param {string} codeBaseDirectory - The directory of the codebase to be indexed.
 * @param {number} maxTokens - The maximum number of tokens allowed per module.
 * @returns {Promise<void>}
 */
async function codeBaseFullIndex(codeBaseDirectory, maxTokens = 1000) {
    const files = loadFiles(codeBaseDirectory);
    let currentModule = [];
    let currentTokenCount = 0;

    for (const file of files) {
        const fileContent = file.fileContent;
        const filePathRelative = file.filePath;
        const fileTokenCount = calculateTokensCost(fileContent);

        if (currentTokenCount + fileTokenCount > maxTokens) {
            // Process the current module
            await processModule(currentModule, codeBaseDirectory);
            // Reset for the next module
            currentModule = [];
            currentTokenCount = 0;
        }

        currentModule.push(file);
        currentTokenCount += fileTokenCount;
    }

    // Process the last module
    if (currentModule.length > 0) {
        await processModule(currentModule, codeBaseDirectory);
    }
}

async function processModule(module, codeBaseDirectory) {
    for (const file of module) {
        const fileContent = file.fileContent;
        const filePathRelative = file.filePath;
        await generateAndWriteFileSummary(codeBaseDirectory, filePathRelative, fileContent);
    }
}

/**
 * Interactively indexes the full codebase in multiple rounds.
 * @param {string} codeBaseDirectory - The directory of the codebase to be indexed.
 * @param {object} model - The model used for indexing.
 * @returns {Promise<void>}
 */
async function codeBaseFullIndexInteractive(codeBaseDirectory, model) {
    printCostEstimation(codeBaseDirectory, model);

    if (await approveIndexing()) {
        await codeBaseFullIndex(codeBaseDirectory);
    } else {
        console.log('Aborted summarizing the project.');
    }
}

Conclusion

The primary cause of the bug is the lack of handling for large repositories, leading to token overflow. By implementing functionality to split large repositories into smaller modules and processing them in multiple rounds, the issue can be resolved. The provided example code changes outline a potential solution to achieve this.

Changes on modules/summaries.js

Analysis Report for modules/summaries.js

Overview

The modules/summaries.js file is responsible for handling the summarization process of files within a codebase. It includes functions to read summaries from a database, chunk summaries into manageable sizes, and generate summaries for individual files.

Key Functions

  1. chunkSummaries: Splits an array of summary strings into chunks up to a maximum size.
  2. readAllSummaries: Reads all summary files from the database.
  3. getSummaries: Fetches and validates summaries for a given test.
  4. generateAndWriteFileSummary: Processes a file by generating a summary and writing it to a new file.

Potential Issues

  1. Handling Large Repositories:

    • The current implementation reads all summaries into a single string (readAllSummaries), which could lead to token overflow issues when dealing with large repositories.
    • The chunkSummaries function is designed to split summaries into chunks, but it is not integrated into the main summarization workflow.
  2. Token Counting:

    • The countTokens function is used to ensure that summaries do not exceed the maximum token count. However, this is only applied within the chunkSummaries function and not during the initial reading of summaries.
  3. Single File Summary Size:

    • The generateAndWriteFileSummary function checks if a single file's token count exceeds the maximum allowed (maxTokenSingleFile). If it does, the file is skipped, which might lead to incomplete summaries for large files.

Recommendations

To address the issue of handling large repositories and token overflow, the following modifications are recommended:

  1. Modular Summarization:

    • Modify the readAllSummaries function to read summaries in smaller, manageable chunks rather than concatenating all summaries into a single string.
    • Integrate the chunkSummaries function into the main workflow to ensure summaries are processed in chunks that respect the token limits.
  2. Enhanced Token Management:

    • Ensure that token counting is applied consistently throughout the summarization process, including during the initial reading of summaries.
    • Implement a mechanism to handle large files that exceed the token limit by splitting them into smaller sections or summarizing them in parts.
  3. Multiple Rounds of Dialogue:

    • Update the summarization process to handle multiple rounds of dialogue for large repositories. This can be achieved by processing each module separately and maintaining context across multiple interactions.

Proposed Changes

  1. Modify readAllSummaries:

    • Change the function to read summaries in chunks and process each chunk separately.
  2. Integrate chunkSummaries:

    • Use the chunkSummaries function to split summaries into manageable sizes before further processing.
  3. Update generateAndWriteFileSummary:

    • Implement logic to handle large files by splitting them into smaller sections if they exceed the token limit.

Example Code Changes

Here is an example of how the readAllSummaries function can be modified to read summaries in chunks:

async function readAllSummaries(codeBaseDirectory) {
  const db = getDB(codeBaseDirectory);
  const sql = `
  SELECT path, summary 
  FROM files`;
  const summaries = await new Promise((resolve, reject) => {
    db.all(sql, (err, rows) => {
      if (err) {
        reject(err);
      } else {
        resolve(rows);
      }
    });
  });

  if (typeof summaries === 'undefined' || summaries.length === 0) {
    console.log("No matching files found in the database. Indexing is required.");
    throw new Error("Cannot run without summaries. Indexing is required.");
  }

  let summariesString = "";
  for (const summary of summaries) {
    try {
      summariesString += `File Path: ${summary.path}\nSummary:\n${summary.summary}${summaryStringDelimiter}`;
    } catch (error) {
      console.error("Error reading summary from database:", error);
    }
  }

  // Split summaries into chunks
  const summaryChunks = chunkSummaries(summariesString, maxSummaryTokenCount);
  return summaryChunks;
}

By implementing these changes, the summarization process will be better equipped to handle large repositories and avoid token overflow issues.

Disclaimer: This comment was entirely generated using AI. Be aware that the information provided may be incorrect.

Current plan usage: 14.45%

Have feedback or need help?
Discord
Documentation
[email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant