Skip to content

Commit

Permalink
Updated README.md.
Browse files Browse the repository at this point in the history
  • Loading branch information
mattjhawken committed Dec 9, 2024
1 parent bf559c7 commit 922c826
Show file tree
Hide file tree
Showing 3 changed files with 76 additions and 49 deletions.
89 changes: 50 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
## Plug-and-Play, Peer-to-Peer Neural Network Scaling for PyTorch

**Tensorlink** is a library designed to simplify the scaling of PyTorch model training and inference, offering tools
to easily distribute models across a network of peers and share computing resources both locally and globally. This
to easily distribute models across a network of peers and share computational resources both locally and globally. This
approach enables the training of large models from consumer hardware, eliminating the need for cloud services for
certain ML applications. Tensorlink leverages techniques such as automated model parsing and parallelism to
simplify and enhance the training process, making state-of-the-art models accessible to a wider audience.
Expand All @@ -18,29 +18,39 @@ preserving model workflows while seamlessly harnessing distributed resources. Te
organizations to collaborate, share resources, and scale models dynamically—bringing the power of distributed training
to a broader community.

* **Distributed Model Wrapper**: connects your model to a network of GPUs, managing everything from model distribution
to execution behind the scenes
* Supports `nn.Module` methods and queries (e.g., forward, backward, parameters)
* **Distributed Optimizer:** A coordinated optimizer that works in tandem with distributed models, supporting
essential methods like `step` and `zero_grad`
* **Node Frameworks:** Worker and Validator node frameworks for sharing computing power and securing network activities.
* The architecture enables the creation of private networks/jobs that can function independently of the public network.
- `DistributedModel`: A flexible wrapper for `torch.nn.Module` designed to simplify distributed machine learning workflows.
- Provides methods for parsing, distributing, and integrating PyTorch models across devices.
- Supports standard model operations (e.g., `forward`, `backward`, `parameters`).
- Automatically manages partitioning and synchronization of model components across nodes.
- Seamlessly supports both data and model parallelism.

- `DistributedOptimizer`: An optimizer wrapper built for `DistributedModel` to ensure synchronized parameter updates across distributed nodes.
- Compatible with native PyTorch and Hugging Face optimizers.

- Nodes Types (`tensorlink.nodes`): Tensorlink provides three key node types to enable robust distributed machine learning workflows:
- `UserNode`: Handles job submissions and result retrieval, facilitating interaction with `DistributedModel` for training and inference. Required for public network participation.
- `WorkerNode`: Manages active jobs, connections to users, and processes data for model execution.
- `ValidatorNode`: Secures and coordinates training tasks and node interactions, ensuring job integrity on the public network.

- **Public Computational Resources**: By default, Tensorlink nodes are integrated with a smart contract-secured network, enabling:
- Incentive mechanisms to reward contributors for sharing computational power.
- Access to both free and paid machine learning resources.
- Configuration options for private networks, supporting local or closed group machine learning workflows.

### Limitations in this Release

- Bugs, performance issues, and limited network availability are expected.
- **Model Support**: Tensorlink currently supports scriptable PyTorch models (`torch.jit.script`) and select open-source
Hugging Face models not requiring API-keys.
- **Why?** Security and serialization constraints for un-trusted P2P interactions. We're
actively working on custom serialization methods to support all PyTorch model types. Feedback and contributions to
accelerate this effort are welcome!
- **Job Constraints**: Public jobs are currently limited to one worker due to network availability. As a result, data
parallel acceleration is currently disabled. This will be activated as the pool of workers grows, and can also be
enabled in the source code for local jobs/clusters.
- Internet latency and connection speeds can significantly impact performance of public jobs, which may become
problematic for certain training and inference scenarios.
- **Why?** Security and serialization constraints for un-trusted P2P interactions. We're actively working on custom serialization methods to support all PyTorch model types. Feedback and contributions to accelerate this effort are welcome!
- **Job Constraints**:
- **Model Size**: Due to limited worker availability in this initial release, public jobs are best suited for models under ~1 billion parameters.
- **Future Plans**: We are actively expanding network capacity, and the next update (expected soon) will increase this limit, enabling support for larger models and more complex workflows.
- **Worker Allocation**: Public jobs are currently limited to one worker. Data parallel acceleration is temporarily disabled for public tasks but can be enabled for local jobs or private clusters.
- Internet latency and connection speeds can significantly impact the performance of public jobs, which may become problematic for certain training and inference scenarios.

## Getting Started with Tensorlink

## Training and Inference with Tensorlink

### Installation

Expand All @@ -61,26 +71,25 @@ This command will download and install Tensorlink along with its dependencies. I
(recommended), ensure it's activated before running the installation command.

*Tensorlink aims to be compatible with all models and optimizers built with of PyTorch, however some compatibility
issues can be expected with the pre-alpha release.* To get started you must request a job. Requesting a job will
issues can be expected with the pre-alpha release.*

To get started you must request a job. Requesting a job will
provide you with a distributed model and optimizer objects. The optimizer must be instantiated with kwargs after the
request of a job, leaving out model parameters. When requesting a job, ensure that the request follows the
instantiation of your model and precedes the training segment of your code:

```python
from tensorlink import UserNode
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModelForCausalLM
from torch.optim import AdamW
from torch.nn import CrossEntropyLoss

# Initialize tokenizer, model, optimizer, and loss function
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForCausalLM.from_pretrained("bert-base-uncased")
loss_fn = CrossEntropyLoss()

# Create a Tensorlink user node instance, and request a job with your model
user = UserNode()
distributed_model, distributed_optimizer = user.create_distributed_model(
model=model,
model=model,
training=True,
optimizer_type=AdamW,
)
Expand All @@ -91,41 +100,43 @@ Once the job request is created, you'll be successfully connected to Tensorlink.
Here’s an example of a training loop that uses the distributed model:

```python
from torch.utils.data import DataLoader

# Training loop
epochs = 10
for epoch in range(epochs):
# Iterating over tokenized dataset. See tests/ml/useful_scripts.py
for batch in DataLoader(tokenized_dataset["train"], batch_size=8):
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
b_input_ids = batch['input_ids'].to(device)
b_input_mask = batch['attention_mask'].to(device)
b_labels = batch['label'].to(device)

optimizer.zero_grad()
outputs = distributed_model(input_ids, attention_mask=attention_mask, labels=labels)
loss = loss_fn(outputs, expected_outputs)
distributed_optimizer.zero_grad()
outputs = distributed_model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)
loss = outputs.loss
loss.backward()
optimizer.step()
distributed_optimizer.step()

print(f"Epoch {epoch + 1}/{epochs} completed")
```

Training progress can also be tracked through the Tensorlink/Smartnodes dashboard (TBD).
Training progress and network information will be trackable through the Tensorlink/Smartnodes dashboard.
This feature is a work in progress and is currently not available.

## Running a Node

Tensorlink can be configured for use on local or private networks, but its full potential lies in the public network,
where individuals from around the world contribute compute resources. By running a node, you help power Tensorlink into
becoming a distributed supercomputer.
where individuals from around the world contribute computational resources. By running a Worker node, you can:

### How to Get Started
- Check the **Releases** section on GitHub for prebuilt binaries or scripts to set up a node quickly and easily.
- Follow the included documentation to configure your node and start contributing to the Tensorlink network.

### Why Run a Node?
- **Support Innovation:** Contribute to building a global decentralized compute network.
- **Earn Rewards:** Provide compute resources and receive incentives for your contributions.
- **Earn Rewards:** Provide resources and receive Smartnodes tokens (SNO) for your contributions.
- **Join the Community:** Be part of an open-source project aiming to redefine distributed computing.


### How to Get Started
- Check the **Releases** section on GitHub for binaries or scripts to set up a node quickly and easily.
- Follow the included documentation to configure your node and start contributing to the Tensorlink network.

## Contributing

We welcome contributions from the community to help us build and enhance Tensorlink! There are many ways to get involved:
Expand Down
18 changes: 17 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from setuptools import setup, find_packages
import os

# Version of the package
VERSION = "0.1.0.post1"
Expand All @@ -11,6 +12,21 @@
)


def get_long_description():
with open("README.md", "r", encoding="utf-8") as f:
content = f.read()

# Adjust the content for PyPI: Remove the last 4 lines and replace with markdown link
content_lines = content.splitlines()
if len(content_lines) >= 4:
content_lines = content_lines[:-4] # Remove the last 4 lines
content_lines.append(
'[![Buy Me a Coffee](https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png)]'
'(https://www.buymeacoffee.com/smartnodes)'
)
return "\n".join(content_lines)


# Parse requirements from requirements.txt
def parse_requirements(filename):
"""Load requirements from a pip requirements file."""
Expand All @@ -26,7 +42,7 @@ def parse_requirements(filename):
author="Smartnodes Lab",
author_email="[email protected]",
description=DESCRIPTION,
long_description=open("README.md").read(),
long_description=get_long_description(),
long_description_content_type="text/markdown",
packages=find_packages(), # Automatically find packages in the current directory
include_package_data=True,
Expand Down
18 changes: 9 additions & 9 deletions tests/ml/test_dist_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,19 +56,19 @@
if __name__ == "__main__":
# Launch Nodes
validator = ValidatorNode(upnp=True, off_chain_test=False, local_test=False, print_level=logging.DEBUG)
time.sleep(1)
user = UserNode(upnp=True, off_chain_test=True, local_test=False, print_level=logging.DEBUG)
time.sleep(1)
worker = WorkerNode(upnp=True, off_chain_test=True, local_test=False, print_level=logging.DEBUG)
time.sleep(1)
time.sleep(3)
user = UserNode(upnp=True, off_chain_test=False, local_test=False, print_level=logging.DEBUG)
time.sleep(3)
worker = WorkerNode(upnp=True, off_chain_test=False, local_test=False, print_level=logging.DEBUG)
time.sleep(5)

# Bootstrap roles
val_key, val_host, val_port = validator.send_request("info", None)

worker.send_request("connect_node", (val_key, val_host, val_port))
time.sleep(1)
user.send_request("connect_node", (val_key, val_host, val_port))
time.sleep(1)
# worker.send_request("connect_node", (val_key, val_host, val_port))
# time.sleep(1)
# user.send_request("connect_node", (val_key, val_host, val_port))
# time.sleep(1)

device = "cuda" if torch.cuda.is_available() else "cpu"

Expand Down

0 comments on commit 922c826

Please sign in to comment.