praxis

Praxis is the process by which a theory, lesson, or skill is enacted, embodied, realized, applied, or put into practice.

description

The Praxis platform is an ever-evolving, local-first, peer-to-peer, flexible, modular, extensible and decentralized framework for the practice of computational alchemy. With Hivemind integrated directly into its core (and constantly broken), we are building a multi-modal fleet of AI agents that are small and simple, easy to parallelize, fault-tolerant, portable, and performant at a scale of a hundred or a thousand peers. We will achieve this via a remote mixture of experts, user-weighted multipath routing, symbolic decision-making and prayer (asynchronous).

In short: Praxis is a robust, open-source language model that can be anything, do everything.

features

A Mixture of Depths allows us to route just a subset of all tokens in a sequence through a layer, and to remote peers - reducing the time required for remote computation. All other tokens bypass the layer via a residual connection.
We implement Multihead Latent Attention, originally discovered by DeepSeek-V2.
LayerShuffle proved that transformers can maintain coherence, even when every layer is shuffled at every forward pass. We take this a step further, and implement the PraxisController, which teaches the model how to predict an optimal route through expert layers during inference. The ability to work with out-of-order layers is crucial in a decentralized architecture, where some peers may fail, others may disappear, some may be overloaded, or undertrained, or are otherwise penalized for some reason or another...
As an alternative to LayerShuffle's controller, we have an experiment that implements elements from Graphformer, teaching the model to route through layers as if they were nodes in a graph.
In addition to the shuffling, we implement a simplified version of CALM, which allows the model to early-exit from computation.
We implement RoPE, ALiBi and NoPE as options for positional encoding, because they're easy, work well at sane contexts lengths, and require little to no trainable parameters.
Differential Attention is used to improve hallucination performance, reduce parameter counts required for attention, and filter-out noise in attention maps. Alternatively (and perhaps in-addition to, in the future), we implement an option for Stickbreaking Attention, which naturally-encodes positional information, uses a Sigmoid-based mechanism, instead of a Softmax (i.e. parameters "work together", instead of "competing" against each other). We also implement various methods from MEGA, including the Exponential Moving Average-based attention gating, and Gated Single-Head Attention modules.
Parameter-Efficient Expert Retrieval (PEER) from the Mixture of a Million Experts paper. Here, feedforward layers are replaced with a swarm of singleton MLP networks.
While simple, a Soft-Merging of Experts with Adaptive Routing class allows us to dynamically-route through a dense feedforward layer, while maintaining differentiability and enhancing expressivity.
We support Infini-Attention, from Leave No Context Behind, to reduce the O(n^2) memory complexity of transformer attention to O(n). This is the same technique that Google uses in Gemini.
We have a Kolmogorov-Arnold Networks experiment, which replaces MLPs with KANs.
We implement an optional Byte Latent Tokenizer, which allows us to represent tokens as patches of byte-sequences, instead of discrete tokens. This way, we can remove the tokenizer - and represent data in much more interesting ways, within the latent space.
We support Hyper-Connections, which are an alternative to residual connections.
There's also a mobile app, and a remote controller, called "Axis". We used Godot for that.

installation, configuration, and usage

install

Setup a virtual environment:

source venv.sh

Or, you may use the VSCode command (Ctrl + Shift + P), and choose: Python: Create Environment...

Then, install dependencies:

pip install -e .

run tests

To run unit testing:

pytest tests -x

contribute to the swarm

To run with default settings:

python run.py

To view all supported command-line arguments:

python run.py --help

recommendations

We recommend you use a batch-size of at least 16, if possible. We have implemented an oversampling mechanism, which periodically multiplies your sequence length, and scales quadratically with batch sizes of 1, 4, 16, 64, etc.

We also recommend using an Nvidia GPU.

python run.py --batch-size 16 --device cuda

showcase

do inference

Send a JSON-encoded payload via POST to:

http://localhost:2100/input

This payload should support all arguments in the Transformers text generation API.

Example request:

import requests

url = "http://localhost:2100/input"
payload = {"prompt": "Once upon a time, ", "do_sample": True, "temperature": 0.7}

response = requests.post(url, json=payload)

print(response.status_code)
print(response.json())

local web chat

Chat and swarm management interface is available here:

http://localhost:2100

mobile app

We're building a mobile app, to control your experts! You can see that code in the ./axis directory.

to register with transformers

from transformers import AutoConfig, AutoModel, AutoModelForCausalLM, AutoTokenizer
from praxis import PraxisConfig, PraxisForCausalLM, PraxisModel

AutoConfig.register("praxis", PraxisConfig)
AutoModel.register(PraxisConfig, PraxisModel)
AutoModelForCausalLM.register(PraxisConfig, PraxisForCausalLM)

config = PraxisConfig(
    embed_size=512,
    hidden_size=384,
    depth=6,
    num_heads=8,
    device_map="cuda:0",
)

tokenizer_model = "UNSAFE/praxis-4096"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model)

model = AutoModelForCausalLM.from_config(config)

input_ids = tokenizer.encode("The quick brown fox ")

outputs = model.generate(input_ids, do_sample=True)

print(self.tokenizer.decode(outputs[0], skip_special_tokens=True))
# --> The quick brown fox jumped over a lazy dog.

notes, goals, and observations

a global swarm
self-modeling makes peers less-complex, and easier to model (for other AI)
layers as experts; a marketplace of expert, composable "blocks"
commit to yourself
cascade-style token routing (ping -> pang -> pong -> ping) via a Mixture of Depths; cyclical graph computation
treat every peer as an experiment in hyperparameter search; publish results to the DHT, and ensure that well-performing hparams are assigned more often
build adapters/connectors, allowing people to integrate their nodes with external data sources
a proper and robust DHT
central and persistent relay peers, to act as global bootstrap nodes
helix, octopi, pyramids
multi-block, heirarchical experts
peer validation (zero knowledge proofs, differential privacy)
T-FREE Tokenizer
Mini-Sequence Transformer (probably not worth it on the smaller scale; this is basically the same thing as Infini-Attention, but implemented in a more-naive way)
embed training code within the model architecture itself, such that loading a model automatically starts the training, as well
cascading assistant models via hivemind (speculative decoding)
TokenMonster
Linear Recurrent Units (not recommended; they are extremely slow without specialized kernels)
DualFormer (this one would be tough to do, because it seems to require detailed reasoning steps and structured trace dropping; i.e. data we don't have); this dataset might work
novel activations
xLSTM
Denny Zhou (Founded & lead reasoning team at Google DeepMind) - "We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient." source At first pass, this may sound stupid, because everyone knows that transformers are "universal function approximators" already; the problem is that the search space becomes so large as to be computationally infeasible. However, the more-meaningful takeaway here is this: with human guidance (i.e. prompting, iteration, development, persistence, re-focusing), a human/AI duo could solve any problem... including AGI.
Differentiable Lambda Calculus (i.e. symbolic reasoning)
the way that Mixture of Depths token indices interact with ALiBi is potentially interesting and worth evaluating.
Simple Recurrent Units
Forward-mode automatic differentiation, and Gradients without Backpropagation
Human-like Episodic Memory for Infinite Context LLMs (we tried a version of this, located at misc/episodic_memory.py, but it was terrible, slow, didn't work, used a ton of memory, and was horribly complex)
Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge
https://github.com/timurgepard/nanoFFT (a somewhat novel idea)
TokenFormer
Funnel Transformers for sequence compression
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding
https://github.com/jinyeom/lsh-knn/tree/main
Training Large Language Models to Reason in a Continuous Latent Space
tons of useful Pytorch modules from Fairseq
Multimodal Latent Language Modeling with Next-Token Diffusion
Transformers Struggle to Learn to Search
Survey on Memory-Augmented Neural Networks: Cognitive Insights to AI Applications
Explore "Scheduled Sampling", where the model is trained on it's own predictions.
Explore contrastive learning, to mitigate text repetition, by showing the model good and bad examples during training, create embeddings compute similarities, and calculate contrastive loss.
Titans: Learning to Memorize at Test Time (a possible alternative to Infini-Attention, which is closer to the Human-Like Episodic Memory project)
Guidance
NoProp: Training Neural Networks without Back-propagation or Forward-propagation
The Belief State Transformer
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, https://huggingface.co/tomg-group-umd/huginn-0125/blob/main/raven_modeling_minimal.py

won't do

cryptocurrency

community

limitations

You will quickly run into rate limits with the Huggingface Datasets API. This is because anonymous users tend to be bots, and are subjected to severe restrictions. To alleviate this problem, you can install huggingface-cli, and authenticate with a real user account. Praxis will use this user automatically.
Praxis is a fluid architecture. Whatever decentralized solution we implement, it must respect the fact that peers are independent models, and will need to convert intermediate tensors into some singular, standardized format - for every remote operation. Data passed to remote peers will likely need reduction/projection, at a minimum.

Name		Name	Last commit message	Last commit date
Latest commit History 1,659 Commits
.vscode		.vscode
adapters		adapters
axis		axis
eval		eval
praxis		praxis
research		research
staging		staging
static		static
templates		templates
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api.py		api.py
archivist.py		archivist.py
builders.py		builders.py
cli.py		cli.py
interface.py		interface.py
pyproject.toml		pyproject.toml
run.py		run.py
run_alpha.py		run_alpha.py
train_tokenizer.py		train_tokenizer.py
venv.sh		venv.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

praxis

description

install

run tests

contribute to the swarm

recommendations

do inference

local web chat

mobile app

to register with transformers

limitations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

0-5788719150923125/praxis

Folders and files

Latest commit

History

Repository files navigation

praxis

description

install

run tests

contribute to the swarm

recommendations

do inference

local web chat

mobile app

to register with transformers

limitations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages