We are excited to announce a stable API release of Llama Stack, which enables developers to build RAG applications and Agents using tools and safety shields, monitor and those agents with telemetry, and evaluate the agent with scoring functions.
Context
GenAI application developers need more than just an LLM - they need to integrate tools, connect with their data sources, establish guardrails, and ground the LLM responses effectively. Currently, developers must piece together various tools and APIs, complicating the development lifecycle and increasing costs. The result is that developers are spending more time on these integrations rather than focusing on the application logic itself. The bespoke coupling of components also makes it challenging to adopt state-of-the-art solutions in the rapidly evolving GenAI space. This is particularly difficult for open models like Llama, as best practices are not widely established in the open.
Llama Stack was created to provide developers with a comprehensive and coherent interface that simplifies AI application development and codifies best practices across the Llama ecosystem. Since our launch in September 2024, we have seen a huge uptick in interest in Llama Stack APIs by both AI developers and from partners building AI services with Llama models. Partners like Nvidia, Fireworks, and Ollama have collaborated with us to develop implementations across various APIs, including inference, memory, and safety.
With Llama Stack, you can easily build a RAG agent which can also search the web, do complex math, and custom tool calling. You can use telemetry to inspect those traces, and convert telemetry into evals datasets. And with Llama Stack’s plugin architecture and prepackage distributions, you choose to run your agent anywhere - in the cloud with our partners, deploy your own environment using virtualenv, conda, or Docker, operate locally with Ollama, or even run on mobile devices with our SDKs. Llama Stack offers unprecedented flexibility while also simplifying the developer experience.
Release
After iterating on the APIs for the last 3 months, today we’re launching a stable release (V1) of the Llama Stack APIs and the corresponding llama-stack server and client packages(v0.1.0). We now have automated tests for providers. These tests make sure that all provider implementations are verified. Developers can now easily and reliably select distributions or providers based on their specific requirements.
There are example standalone apps in llama-stack-apps.
Key Features of this release
-
Unified API Layer
- Inference: Run LLM models
- RAG: Store and retrieve knowledge for RAG
- Agents: Build multi-step agentic workflows
- Tools: Register tools that can be called by the agent
- Safety: Apply content filtering and safety policies
- Evaluation: Test model and agent quality
- Telemetry: Collect and analyze usage data and complex agentic traces
- Post Training ( Coming Soon ): Fine tune models for specific use cases
-
Rich Provider Ecosystem
- Local Development: Meta's Reference, Ollama
- Cloud: Fireworks, Together, Nvidia, AWS Bedrock, Groq, Cerebras
- On-premises: Nvidia NIM, vLLM, TGI, Dell-TGI
- On-device: iOS and Android support
-
Built for Production
- Pre-packaged distributions for common deployment scenarios
- Backwards compatibility across model versions
- Comprehensive evaluation capabilities
- Full observability and monitoring
-
Multiple developer interfaces
- CLI: Command line interface
- Python SDK
- Swift iOS SDK
- Kotlin Android SDK
-
Sample llama stack applications
- Python
- iOS
- Android
What's Changed
- [4/n][torchtune integration] support lazy load model during inference by @SLR722 in #620
- remove unused telemetry related code for console by @dineshyv in #659
- Fix Meta reference GPU implementation by @ashwinb in #663
- Fixed imports for inference by @cdgamarose-nv in #661
- fix trace starting in library client by @dineshyv in #655
- Add Llama 70B 3.3 to fireworks by @aidando73 in #654
- Tools API with brave and MCP providers by @dineshyv in #639
- [torchtune integration] post training + eval by @SLR722 in #670
- Fix post training apis broken by torchtune release by @SLR722 in #674
- Add missing venv option in --image-type by @terrytangyuan in #677
- Removed unnecessary CONDA_PREFIX env var in installation guide by @terrytangyuan in #683
- Add 3.3 70B to Ollama inference provider by @aidando73 in #681
- docs: update evals_reference/index.md by @eltociear in #675
- [remove import ][1/n] clean up import & in apis/ by @yanxi0830 in #689
- [bugfix] fix broken vision inference, change serialization for bytes by @yanxi0830 in #693
- Minor Quick Start documentation updates. by @derekslager in #692
- [bugfix] fix meta-reference agents w/ safety multiple model loading pytest by @yanxi0830 in #694
- [bugfix] fix prompt_adapter interleaved_content_convert_to_raw by @yanxi0830 in #696
- Add missing "inline::" prefix for providers in building_distro.md by @terrytangyuan in #702
- Fix failing flake8 E226 check by @terrytangyuan in #701
- Add missing newlines before printing the Dockerfile content by @terrytangyuan in #700
- Add JSON structured outputs to Ollama Provider by @aidando73 in #680
- [#407] Agents: Avoid calling tools that haven't been explicitly enabled by @aidando73 in #637
- Made changes to readme and pinning to llamastack v0.0.61 by @heyjustinai in #624
- [rag evals][1/n] refactor base scoring fn & data schema check by @yanxi0830 in #664
- [Post Training] Fix missing import by @SLR722 in #705
- Import from the right path by @SLR722 in #708
- [#432] Add Groq Provider - chat completions by @aidando73 in #609
- Change post training run.yaml inference config by @SLR722 in #710
- [Post training] make validation steps configurable by @SLR722 in #715
- Fix incorrect entrypoint for broken
llama stack run
by @terrytangyuan in #706 - Fix assert message and call to completion_request_to_prompt in remote:vllm by @terrytangyuan in #709
- Fix Groq invalid self.config reference by @aidando73 in #719
- support llama3.1 8B instruct in post training by @SLR722 in #698
- remove default logger handlers when using libcli with notebook by @dineshyv in #718
- move DataSchemaValidatorMixin into standalone utils by @yanxi0830 in #720
- add 3.3 to together inference provider by @yanxi0830 in #729
- Update CODEOWNERS - add sixianyi0721 as the owner by @sixianyi0721 in #731
- fix links for distro by @yanxi0830 in #733
- add --version to llama stack CLI & /version endpoint by @yanxi0830 in #732
- agents to use tools api by @dineshyv in #673
- Add X-LlamaStack-Client-Version, rename ProviderData -> Provider-Data by @ashwinb in #735
- Check version incompatibility by @ashwinb in #738
- Add persistence for localfs datasets by @VladOS95-cyber in #557
- Fixed typo in default VLLM_URL in remote-vllm.md by @terrytangyuan in #723
- Consolidating Memory tests under client-sdk by @vladimirivic in #703
- Expose LLAMASTACK_PORT in cli.stack.run by @terrytangyuan in #722
- remove conflicting default for tool prompt format in chat completion by @dineshyv in #742
- rename LLAMASTACK_PORT to LLAMA_STACK_PORT for consistency with other env vars by @raghotham in #744
- Add inline vLLM inference provider to regression tests and fix regressions by @frreiss in #662
- [CICD] github workflow to push nightly package to testpypi by @yanxi0830 in #734
- Replaced zrangebylex method in the range method by @cheesecake100201 in #521
- Improve model download doc by @SLR722 in #748
- Support building UBI9 base container image by @terrytangyuan in #676
- update notebook to use new tool defs by @dineshyv in #745
- Add provider data passing for library client by @dineshyv in #750
- [Fireworks] Update model name for Fireworks by @benjibc in #753
- Consolidating Inference tests under client-sdk tests by @vladimirivic in #751
- Consolidating Safety tests from various places under client-sdk by @vladimirivic in #699
- [CI/CD] more robust re-try for downloading testpypi package by @yanxi0830 in #749
- [#432] Add Groq Provider - tool calls by @aidando73 in #630
- Rename ipython to tool by @ashwinb in #756
- Fix incorrect Python binary path for UBI9 image by @terrytangyuan in #757
- Update Cerebras docs to include header by @henrytwo in #704
- Add init files to post training folders by @SLR722 in #711
- Switch to use importlib instead of deprecated pkg_resources by @terrytangyuan in #678
- [bugfix] fix streaming GeneratorExit exception with LlamaStackAsLibraryClient by @yanxi0830 in #760
- Fix telemetry to work on reinstantiating new lib cli by @dineshyv in #761
- [post training] define llama stack post training dataset format by @SLR722 in #717
- add braintrust to experimental-post-training template by @SLR722 in #763
- added support of PYPI_VERSION in stack build by @jeffxtang in #762
- Fix broken tests in test_registry by @vladimirivic in #707
- Fix fireworks run-with-safety template by @vladimirivic in #766
- Free up memory after post training finishes by @SLR722 in #770
- Fix issue when generating distros by @terrytangyuan in #755
- Convert
SamplingParams.strategy
to a union by @hardikjshah in #767 - [CICD] Github workflow for publishing Docker images by @yanxi0830 in #764
- [bugfix] fix llama guard parsing ContentDelta by @yanxi0830 in #772
- rebase eval test w/ tool_runtime fixtures by @yanxi0830 in #773
- More idiomatic REST API by @dineshyv in #765
- add nvidia distribution by @cdgamarose-nv in #565
- bug fixes on inference tests by @sixianyi0721 in #774
- [bugfix] fix inference sdk test for v1 by @yanxi0830 in #775
- fix routing in library client by @dineshyv in #776
- [bugfix] fix client-sdk tests for v1 by @yanxi0830 in #777
- fix nvidia inference provider by @yanxi0830 in #781
- Make notebook testable by @hardikjshah in #780
- Fix telemetry by @dineshyv in #787
- fireworks add completion logprobs adapter by @yanxi0830 in #778
- Idiomatic REST API: Inspect by @dineshyv in #779
- Idiomatic REST API: Evals by @dineshyv in #782
- Add notebook testing to nightly build job by @hardikjshah in #785
- [test automation] support run tests on config file by @sixianyi0721 in #730
- Idiomatic REST API: Telemetry by @dineshyv in #786
- Make llama stack build not create a new conda by default by @ashwinb in #788
- REST API fixes by @dineshyv in #789
- fix cerebras template by @yanxi0830 in #790
- [Test automation] generate custom test report by @sixianyi0721 in #739
- cerebras template update for memory by @yanxi0830 in #792
- Pin torchtune pkg version by @SLR722 in #791
- fix the code execution test in sdk tests by @dineshyv in #794
- add default toolgroups to all providers by @dineshyv in #795
- Fix tgi adapter by @yanxi0830 in #796
- Remove llama-guard in Cerebras template & improve agent test by @yanxi0830 in #798
- meta reference inference fixes by @ashwinb in #797
- fix provider model list test by @hardikjshah in #800
- fix playground for v1 by @yanxi0830 in #799
- fix eval notebook & add test to workflow by @yanxi0830 in #803
- add json_schema_type to ParamType deps by @dineshyv in #808
- Fixing small typo in quick start guide by @pmccarthy in #807
- cannot import name 'GreedySamplingStrategy' by @aidando73 in #806
- optional api dependencies by @ashwinb in #793
- fix vllm template by @yanxi0830 in #813
- More generic image type for OCI-compliant container technologies by @terrytangyuan in #802
- add mcp runtime as default to all providers by @dineshyv in #816
- fix vllm base64 image inference by @yanxi0830 in #815
- fix again vllm for non base64 by @yanxi0830 in #818
- Fix incorrect RunConfigSettings due to the removal of conda_env by @terrytangyuan in #801
- Fix incorrect image type in publish-to-docker workflow by @terrytangyuan in #819
- test report for v0.1 by @sixianyi0721 in #814
- [CICD] add simple test step for docker build workflow, fix prefix bug by @yanxi0830 in #821
- add section for mcp tool usage in notebook by @dineshyv in #831
- [ez] structured output for /completion ollama & enable tests by @sixianyi0721 in #822
- add pytest option to generate a functional report for distribution by @sixianyi0721 in #833
- bug fix for distro report generation by @sixianyi0721 in #836
- [memory refactor][1/n] Rename Memory -> VectorIO, MemoryBanks -> VectorDBs by @ashwinb in #828
- [memory refactor][2/n] Update faiss and make it pass tests by @ashwinb in #830
- [memory refactor][3/n] Introduce RAGToolRuntime as a specialized sub-protocol by @ashwinb in #832
- [memory refactor][4/n] Update the client-sdk test for RAG by @ashwinb in #834
- [memory refactor][5/n] Migrate all vector_io providers by @ashwinb in #835
- [memory refactor][6/n] Update naming and routes by @ashwinb in #839
- Fix fireworks client sdk chat completion with images by @hardikjshah in #840
- [inference api] modify content types so they follow a more standard structure by @ashwinb in #841
- fix experimental-post-training template by @SLR722 in #842
- Improved report generation for providers by @hardikjshah in #844
- [client sdk test] add options for inference_model, safety_shield, embedding_model by @sixianyi0721 in #843
- add distro report by @sixianyi0721 in #847
- Update Documentation by @hardikjshah in #838
- Update OpenAPI generator to output discriminator by @ashwinb in #848
- update docs for tools and telemetry by @dineshyv in #846
- Add vLLM raw completions API by @aidando73 in #823
- update doc for client-sdk testing by @sixianyi0721 in #849
- Delete docs/to_situate directory by @raghotham in #851
- Fixed distro documentation by @hardikjshah in #852
- remove getting started notebook by @dineshyv in #853
- More Updates to Read the Docs by @hardikjshah in #856
- Llama_Stack_Building_AI_Applications.ipynb -> getting_started.ipynb by @dineshyv in #854
- update docs for adding new API providers by @dineshyv in #855
- Add Runpod Provider + Distribution by @pandyamarut in #362
- Sambanova inference provider by @snova-edwardm in #555
- Updates to ReadTheDocs by @hardikjshah in #859
- sync readme.md to index.md by @dineshyv in #860
- More updates to ReadTheDocs by @hardikjshah in #861
- make default tool prompt format none in agent config by @dineshyv in #863
- update the client reference by @dineshyv in #864
- update python sdk reference by @dineshyv in #866
- remove logger handler only in notebook by @dineshyv in #868
- Update 'first RAG agent' in gettingstarted doc by @ehhuang in #867
New Contributors
- @cdgamarose-nv made their first contribution in #661
- @eltociear made their first contribution in #675
- @derekslager made their first contribution in #692
- @VladOS95-cyber made their first contribution in #557
- @frreiss made their first contribution in #662
- @pmccarthy made their first contribution in #807
- @pandyamarut made their first contribution in #362
- @snova-edwardm made their first contribution in #555
- @ehhuang made their first contribution in #867
Full Changelog: v0.0.63...v0.1.0