Skip to content

Analyze and prepare entire codebase for fine tuning or RAG

License

Notifications You must be signed in to change notification settings

Moe230391/codebase-analysis

Repository files navigation

README.md

Codebase Analysis and Processing Script This script is designed to analyze and process a large codebase, specifically a full-stack SaaS application, for the purpose of preparing the data for fine-tuning a language model (LLM). The script performs various tasks such as file type detection, code parsing, dependency graph construction, call graph construction, and data validation.

Table of Contents

Features Use Case Modules Getting Started Configuration Running the Script Output Future Enhancements

Features

File type detection and handling for various programming languages and file formats Code parsing and analysis specific to each language (Python, HTML, CSS, JavaScript, TypeScript, JSX, Vue, etc.) Extraction of code elements such as classes, functions, imports, and comments Named Entity Recognition (NER) to identify important entities in the code Dependency graph construction to understand the relationships between files and modules Call graph construction to analyze the function call flow within the codebase Calculation of code metrics such as cyclomatic complexity, function count, and comment density Data validation and serialization into a structured format (JSONL) for further use

Use Case The primary use case of this script is to process a large codebase and extract meaningful information for fine-tuning an LLM. The script aims to:

Identify and categorize files based on their types and languages Extract relevant code elements, entities, and metrics Build dependency graphs and call graphs to understand the codebase structure and flow Validate and serialize the processed data into a structured format (JSONL) for fine-tuning an LLM

The processed data can be used to train an LLM, enabling it to understand and generate code snippets, assist in code completion, and provide insights into the codebase structure and dependencies. Modules The script consists of six main modules, each responsible for specific tasks:

main_controller.py: The main module that orchestrates the entire processing pipeline. It handles directory traversal, file processing, and data aggregation. file_handling.py: Contains functions for file type detection and reading file contents. It uses libraries like filetype and magic to guess the file type and handles text file reading using aiofiles. data_validation.py: Provides functions for validating the processed data against a predefined JSON schema. It ensures the integrity and consistency of the data before further processing. analysis.py: Implements the core analysis functions for different file types (Python, HTML, CSS, JavaScript, etc.). It extracts relevant information such as classes, functions, imports, comments, and applies NER and code metrics calculation. utilities.py: Contains utility functions for saving data to JSONL format, caching analysis results, loading documentation links, and verifying JSONL files. graph_management.py: Handles the construction and visualization of dependency graphs and call graphs. It uses the networkx library for graph manipulation and matplotlib for visualization.

Getting Started To get started with the script, follow these steps:

Clone the repository and navigate to the project directory. Install the required dependencies by running pip install -r requirements.txt. Configure the script by updating the necessary paths and settings in the main_controller.py file. Prepare the codebase directory and ensure it follows the expected structure. Run the script using python main_controller.py.

Configuration The script requires certain configurations to be set in the main_controller.py file:

root_dir: The path to the root directory of the codebase to be processed. output_dir: The directory where the output files and visualizations will be stored. doc_links_csv: The path to the CSV file containing documentation links for the codebase.

Modify these configurations according to your specific setup and requirements. Running the Script To run the script, execute the following command in the terminal: python main_controller.py

The script will start processing the codebase, and progress will be logged in the console and the codebase_mirror.log file. Output The script generates the following output files and directories:

output_dir/: Directory containing the JSONL files for each processed module. dependency_graph.png: Visualization of the dependency graph. call_graph.png: Visualization of the call graph. codebase_mirror.log: Log file capturing the script's execution details and any errors encountered.

The JSONL files in the output_dir/ follow the naming convention <module_path>.jsonl and contain the processed data for each file in the module. Future Enhancements

Support for additional file types and programming languages to expand the script's applicability to diverse codebases. Integration of more advanced code analysis techniques, such as data flow analysis or type inference, to provide deeper insights into the codebase. Optimization of the processing pipeline to handle large codebases efficiently, possibly through parallel processing or distributed computing. Enhanced visualization capabilities, such as interactive graphs or customizable layouts, to make the output more user-friendly and informative. Integration with popular code editors or IDEs as a plugin to provide real-time insights and assistance while working on the codebase. Customizable configuration options to allow users to tailor the script's behavior and output according to their specific needs.

Feel free to contribute to the project by addressing these enhancements, suggesting improvements, or adding new features. Happy coding!

About

Analyze and prepare entire codebase for fine tuning or RAG

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages