Deep Cloud Native Computing Foundation, or DeepCNCF for short, is an open-source LLM aimed to simplify the Cloud Native ecosystem by resolving information overload and fragmentation within the CNCF landscape. The aim is to effortlessly provide users with detailed, context-related answers to any CNCF project.
It was developed as part of the AMOS project. Our industry partner is Kubermatic. The project consists of a pipeline to gather necessary information (including documentation, pdfs, yaml files, jsons, readmes, and corresponding StackOverflow question/answer pairs) about CNCF Landscape projects, create a question/answer pair dataset from the collected data using Google Gemma, merge it with gathered StackOverflow question/answer pairs, and finetune the Google Gemma 2B IT model, Google Gemma 7B IT model, and Google Gemma-2 9B IT using the gathered data.
-
Full data gathering and processing pipeline
-
Training pipeline in
- cncf-raw-data-for-llm-training: raw scraped pdf, readme, json, documentation and yaml data
- cncf-question-and-answer-dataset-for-llm-training: artifical question/answer pair dataset generated from the raw data using Google Gemma
- stackoverflow_QAs: real question/answer pair dataset gathered from StackOverflow. Only a subset of the questions with the highest rated numbers are included.
- Merged_QAs: merged artifical and real question/answer pair dataset
- Benchmark-Questions: multiple choice question/answer pair dataset used to benchmark the finetuned model.
- DeepCNCF: initial model trained on Google Gemma 2B IT model
- DeepCNCFQuantized: quantized version of DeepCNCF
- DeepCNCF2BAdapter: finetuned Google Gemma 2B IT model, trained on whole dataset
- DeepCNCF7BAdapter: finetuned Google Gemma 7B IT model, trained on whole dataset
- DeepCNCF9BAdapter: finetuned Google Gemma-2 9B IT model, trained on whole dataset
- Deliverables Contains all AMOS specific homeworks referenced with the sprint number they were due to.
- Documentation Contains the documentation on how to run the project
- src Contains all the sourcecode of the project.
- src/hpc_scripts Contains sricpts that were specifically tailored to run on the HPC (High Performance Cluster) of the FAU. This is mostly for interacting with LLM's
- src/scripts Contains all general purpose scripts (i.e. scraping data from CNCF Landscape and Stackoverflow, data formatting, deploying the model)
- src/landscape_scraper Contains scripts for scraping the webpages of the CNCF landscape.
- test Contains all unit tests and integration tests.
If you want to run the data gathering and training pipelines yourself or if you want to use them to gather your own data, follow the steps provided in the Documentation
Additional information can be found in the Wiki