Skip to content

bit-current/DistributedTraining

Repository files navigation

There is no passion to be found playing small - in settling for a life that is less than the one you are capable of living. Nelson Mandela.

Distributed Training Framework

Introduction

This project introduces a cutting-edge approach to distributed deep learning, utilizing the Bittensor network. Our method incentivizes participants by rewarding the generation of optimal weights that contribute significantly to minimizing the overall loss of the base model.

To streamline the process and reduce communication overhead between miners, we integrate Hugging Face as a central hub. This serves as an intermediary, facilitating efficient miner-validator communications without the complexities of direct exchanges.

Key Components:

  • Miners: Miners are responsible for training a model. Each miner trains a weight-delta. A weight-delta is the difference between the weights of the trained model and the base model. This delta is then uploaded to Hugging Face, from where it can be accessed by validators.
  • Validators: Validators asses the loss reduction by each miner on a randomized test set. They download the weight deltas from Hugging Face and evaluate them based on their impact on the model’s performance, focusing on metrics such as loss reduction and accuracy.Better performing miners that improve on the base model are assigned better scores.
  • Averager: We also introduce an averager node, a centralized node run by the subnet owner. The averager is responsible for creating the averaged model that becomes the base model for miners and validators, this is repeated every averaging interval. The averager performs a weighted average of the parameters resulting in an averaged model. Currently the weights of the weighted average are also parameterized allowing the process to be optimized to find the best averaged model.

Clone the Repo

git clone https://github.com/bit-current/DistributedTraining

Move into the Repo

cd DistributedTraining

Remove Previous Hivetrain installation

pip uninstall hivetrain

Install Repo + Requirements

pip install -e .

Hugging Face

Continue setting up by following these step:

1. Create a Hugging Face Account

If you don't already have a Hugging Face account, you'll need to create one:

Visit Hugging Face to sign up

2. Create a Hugging Face Model Repository (For miners only)

Once you have your Hugging Face account, you need to create a model repository:

  • Navigate to your profile by clicking on your username in the top right corner.
  • Click on "New Model" (you may find this button under the "Models" section if you have existing models).
  • Fill in the repository name, description, and set the visibility to public.
  • Click on "Create Model" to establish your new model repository.

3. Generate a Token for the Repository (For miners and validators)

To allow programmatic communication with huggingface, you will need to generate an authentication token:

  • From your Hugging Face account, go to "Settings" by clicking on your profile icon.
  • Select the "Access Tokens" tab from the sidebar.
  • Click on "New Token".
  • Name your token and select the "write" access to be able to upload changes.
  • Click on "Create Token".

4. Create a New .env File to Store Your Hugging Face Token

Open your .env file in DistributedTranining directory and store your new token there:

HF_TOKEN="your_huggingface_token_here"

or in terminal enter:

echo "HF_TOKEN=your_huggingface_token_here" >> .env

4. Install git-lfs to to handle upload of large files

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt install git-lfs

Load Wallets and Register to Subnet

btcli regen_coldkey --mnemonic your super secret mnemonic
btcli regen_hotkey --mnemonic your super secret mnemonic
btcli s register --netuid 25

New arguments

storage.averaged_model_repo_id: The repo that is used by the averager. Currently this is Hivetrain/averaging_run_1. Changes with each training run, review changes on the discord channel.
storage.my_repo_id: Repo id for the repo that is used by a miner only to upload the miner's trained model weight delta.

Miner Run Command

python neurons/miner.py --netuid 25 --wallet.name wallet_name --wallet.hotkey hotkey_name --storage.my_repo_id your_hf_username/your_repo --storage.averaged_model_repo_id Hivetrain/averaging_run_1

Validator

Validators need to have at least 1000 TAO to set weights on the main net and 10 TAO on the test net

python neurons/validator.py --netuid 25 --wallet.name wallet_name --wallet.hotkey hotkey_name --storage.averaged_model_repo_id Hivetrain/averaging_run_1

Bug Reporting and Contributions

  • Reporting Issues: Use the GitHub Issues tab to report bugs, providing detailed steps to reproduce along with relevant logs or error messages.
  • Contributing: Contributions are welcome! Fork the repo, make changes, and submit a pull request. Break it in as many ways as possible to help make the system resilient.

Communication and Support

License

Licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Thanks to the PyTorch team for their deep learning library.
  • Gratitude to Bittensor for enabling decentralized computing and finance with TAO rewards.