GitHub - amosproj/amos2024ss08-cloud-native-llm

Open-source LLM for simplifying and understanding the CNCF ecosystem.

📖 About

Deep Cloud Native Computing Foundation, or DeepCNCF for short, is an open-source LLM aimed to simplify the Cloud Native ecosystem by resolving information overload and fragmentation within the CNCF landscape. The aim is to effortlessly provide users with detailed, context-related answers to any CNCF project.

It was developed as part of the AMOS project. Our industry partner is Kubermatic. The project consists of a pipeline to gather necessary information (including documentation, pdfs, yaml files, jsons, readmes, and corresponding StackOverflow question/answer pairs) about CNCF Landscape projects, create a question/answer pair dataset from the collected data using Google Gemma, merge it with gathered StackOverflow question/answer pairs, and finetune the Google Gemma 2B IT model, Google Gemma 7B IT model, and Google Gemma-2 9B IT using the gathered data.

🚀 Features

Full data gathering and processing pipeline
- src/landscape_scraper
- src/scripts/scraping
- src/scripts/data_preparation
- src/scripts/qa_generation
Training pipeline in
- src/scripts/training
- src/hpc_scripts
- src/hpc_scripts/training

📊 Datasets

cncf-raw-data-for-llm-training: raw scraped pdf, readme, json, documentation and yaml data
cncf-question-and-answer-dataset-for-llm-training: artifical question/answer pair dataset generated from the raw data using Google Gemma
stackoverflow_QAs: real question/answer pair dataset gathered from StackOverflow. Only a subset of the questions with the highest rated numbers are included.
Merged_QAs: merged artifical and real question/answer pair dataset
Benchmark-Questions: multiple choice question/answer pair dataset used to benchmark the finetuned model.

🤖 Models

DeepCNCF: initial model trained on Google Gemma 2B IT model
DeepCNCFQuantized: quantized version of DeepCNCF
DeepCNCF2BAdapter: finetuned Google Gemma 2B IT model, trained on whole dataset
DeepCNCF7BAdapter: finetuned Google Gemma 7B IT model, trained on whole dataset
DeepCNCF9BAdapter: finetuned Google Gemma-2 9B IT model, trained on whole dataset

📁 Folder Structure

Deliverables Contains all AMOS specific homeworks referenced with the sprint number they were due to.
Documentation Contains the documentation on how to run the project
src Contains all the sourcecode of the project.
- src/hpc_scripts Contains sricpts that were specifically tailored to run on the HPC (High Performance Cluster) of the FAU. This is mostly for interacting with LLM's
- src/scripts Contains all general purpose scripts (i.e. scraping data from CNCF Landscape and Stackoverflow, data formatting, deploying the model)
- src/landscape_scraper Contains scripts for scraping the webpages of the CNCF landscape.
test Contains all unit tests and integration tests.

🤔 Getting Started

If you want to run the data gathering and training pipelines yourself or if you want to use them to gather your own data, follow the steps provided in the Documentation

Additional information can be found in the Wiki

Name		Name	Last commit message	Last commit date
Latest commit History 394 Commits
.github		.github
Deliverables		Deliverables
Documentation		Documentation
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
run_all.sh		run_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📖 About

🚀 Features

📊 Datasets

🤖 Models

📁 Folder Structure

🤔 Getting Started

About

Releases 25

Packages

Contributors 9

Languages

License

amosproj/amos2024ss08-cloud-native-llm

Folders and files

Latest commit

History

Repository files navigation

📖 About

🚀 Features

📊 Datasets

🤖 Models

📁 Folder Structure

🤔 Getting Started

About

Resources

License

Stars

Watchers

Forks

Releases 25

Packages 0

Contributors 9

Languages

Packages