Skip to content

Commit 7503a09

Browse files
authored
Merge branch 'main' into patch-1
Signed-off-by: Bill Murdock <[email protected]>
2 parents b1e1a83 + d6f77b1 commit 7503a09

8 files changed

+206
-6
lines changed

.github/workflows/actionlint.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
runs-on: ubuntu-latest
3131
steps:
3232
- name: "Harden Runner"
33-
uses: step-security/harden-runner@91182cccc01eb5e619899d80e4e971d6181294a7 # v2.10.1
33+
uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
3434
with:
3535
egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
3636

.github/workflows/docs.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -33,14 +33,14 @@ jobs:
3333
runs-on: ubuntu-latest
3434
steps:
3535
- name: "Harden Runner"
36-
uses: step-security/harden-runner@91182cccc01eb5e619899d80e4e971d6181294a7 # v2.10.1
36+
uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
3737
with:
3838
egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
3939
- name: "Checkout"
4040
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
4141
with:
4242
fetch-depth: 0
4343
- name: "Check Markdown documents"
44-
uses: DavidAnson/markdownlint-cli2-action@db43aef879112c3119a410d69f66701e0d530809 # v17.0.0
44+
uses: DavidAnson/markdownlint-cli2-action@eb5ca3ab411449c66620fe7f1b3c9e10547144b0 # v18.0.0
4545
with:
4646
globs: '**/*.md'

.github/workflows/spellcheck.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ jobs:
3232
runs-on: ubuntu-latest
3333
steps:
3434
- name: "Harden Runner"
35-
uses: step-security/harden-runner@91182cccc01eb5e619899d80e4e971d6181294a7 # v2.10.1
35+
uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
3636
with:
3737
egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
3838

@@ -42,4 +42,4 @@ jobs:
4242
fetch-depth: 0
4343

4444
- name: Spellcheck
45-
uses: rojopolis/spellcheck-github-actions@74c2a1451c617e7dd9532340b199e18d5411b168 # v0.44.0
45+
uses: rojopolis/spellcheck-github-actions@403efe0642148e94ecb3515e89c767b85a32371a # v0.45.0

.github/workflows/stale_bot.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ jobs:
2424
runs-on: ubuntu-latest
2525
steps:
2626
- name: "Harden Runner"
27-
uses: step-security/harden-runner@91182cccc01eb5e619899d80e4e971d6181294a7 # v2.10.1
27+
uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
2828
with:
2929
disable-sudo: true
3030
egress-policy: block

.markdownlint-cli2.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
config:
44
line-length: false
55
no-emphasis-as-header: false
6+
no-emphasis-as-heading: false
67
first-line-heading: false
78
code-block-style: false
89
no-duplicate-header: false

.spellcheck-en-custom.txt

+11
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# make spellcheck-sort
22
# Please keep this file sorted:
33
Abhishek
4+
ADR
45
Akash
56
AMDGPU
67
Anil
@@ -42,6 +43,7 @@ disambiguating
4243
ditaa
4344
Docling
4445
docstring
46+
downstreams
4547
dr
4648
Dropdown
4749
env
@@ -75,6 +77,7 @@ hipBLAS
7577
ilab
7678
impactful
7779
Inferencing
80+
init
7881
instantiation
7982
instructlab
8083
io
@@ -96,6 +99,7 @@ LLM
9699
llms
97100
LLVM
98101
lora
102+
Makefiles
99103
Markdownlint
100104
md
101105
Mergify
@@ -110,6 +114,7 @@ mlx
110114
MMLU
111115
modularize
112116
modularized
117+
Murdock
113118
Nakamura
114119
num
115120
NVidia
@@ -122,6 +127,7 @@ OpenStax
122127
optimizers
123128
orchestrator
124129
ots
130+
PaRAGon
125131
Params
126132
Pareja
127133
PEFT
@@ -137,7 +143,10 @@ postprocessing
137143
pre
138144
preprint
139145
preprocessing
146+
prereqs
147+
productize
140148
PR's
149+
Pydantic
141150
pyenv
142151
PyPI
143152
pyproject
@@ -154,6 +163,7 @@ Ren
154163
repo
155164
repos
156165
RHEL
166+
roadmapping
157167
ROCm
158168
RTX
159169
runtime
@@ -180,6 +190,7 @@ subcommand
180190
subcommands
181191
subdirectory
182192
Sudalairaj
193+
supportability
183194
Taj
184195
tatsu
185196
TBD

docs/instructlab-cli-1.0.0.md

+79
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# The Road to 1.0.0
2+
3+
_Or: How I Learned to Stop Worrying and Love to GA_
4+
5+
## Context and Goals
6+
7+
The `instructlab/instructlab` repo started off as `instructlab/cli` - a basic Python Click-based command-line interface designed to prototype an application capable of
8+
running the LAB methodology created by IBM Research. As the project evolved and the organization looked into creating a proper PyPI package for it, the decision was made
9+
to rename the repo to `instructlab/instructlab` to keep the repo name consistent with the PyPI package name. The rest of this document will being using "InstructLab" to
10+
refer to this repo and Python package.
11+
12+
Today, InstructLab has gone from a scrappy research project to an upstream community serving as the basis for multiple downstreams, with the goal
13+
to continuing to evolve the community to encourage more participation from additional stakeholders. To wit, it would behoove us to determine what exactly we should be
14+
roadmapping between now and a proper 1.0.0 release, which demonstrates the following to existing and potential community members:
15+
16+
1. An official goalpost for the community denoting the evolution of InstructLab from a pre-1.0 project lacking the stability and supportability typically seen from 1.0-and-beyond projects.
17+
1. A dedicated set of V1 interfaces, both for internal configs and an API, that can be counted on for continuous usage of InstructLab 1.0 with future provisions made for backwards compatibility for subsequent Y-Streams and Z-Streams.
18+
1. A commitment from the Oversight Committee and Maintainer teams to continue to maintain InstructLab throughout a 1.y cycle and work towards an eventual 2.0.
19+
20+
## MVP for an InstructLab 1.0.0
21+
22+
At a high-level, these are the items the Maintainer teams believe should serve as prereqs for releasing an InstructLab 1.0.0:
23+
24+
### Updating relevant references of "CLI" to "Core"
25+
26+
As noted in the `Context and Goals` section, InstructLab started off as just as a CLI - however, we are planning for this package to serve as a more general "Engine" -
27+
being a place where a future REST API can be defined that is used by both the CLI aspect as well as an official GUI for orchestrating the entire LAB workflow. Despite
28+
this, the repo is often still referred to as "the CLI". We as an organization need a better term to refer to this repo as, and should adopt the relevant documentation
29+
and meetings accordingly.
30+
31+
An open community vote made as part of the drafting of this document decided that "Core" would be the new term used. You can see a record of the vote
32+
[here](https://github.com/instructlab/dev-docs/pull/159#issuecomment-2514885516). This name change will begin to go into effect after the merging of this document
33+
and should be completed by the time of a 1.0.0.
34+
35+
### A fully-realized configuration scheme, centered around the usage of system profiles
36+
37+
The InstructLab configuration scheme has transformed in many ways since the project's inception, from the `config.yaml` file that initially served as the user's config,
38+
to the addition of code-based Pydantic defaults, to train profiles, to system profiles. We need to fully-decouple this config from the Click library, remove the need for
39+
a `config.yaml` file, and have a consistent config scheme that can be easily extended.
40+
41+
### An official v1 REST API schema
42+
43+
We need to have a defined v1 REST API schema - while this does not preclude future updates, something mature enough to serve as a v1 API throughout subsequent Y-Streams
44+
for an InstructLab 1.0 is a must for such a milestone.
45+
46+
### Integration of InstructLab with RAG
47+
48+
RAG is currently being planned on being integrated into InstructLab - that work should be in a stable state adhering to our v1 API standard.
49+
50+
### An upgrade path to subsequent Y-Streams and an eventual 2.0
51+
52+
Any user wishing to install an InstructLab 1.0 must have an upgrade path to 1.1, 1.2, ..., 1.n. Upon being ready for an InstructLab 2.0, we should also be expecting to
53+
provide a path for users wishing to upgrade from our final 1.y stream to 2.0.
54+
55+
### Backwards compatibility across the 1.y stream
56+
57+
Any user going down our upgrade path described above should expect that the release they upgrade to is backwards-compatible with the release they upgrade from.
58+
59+
### An official hardware support matrix
60+
61+
We need to have a documented matrix of what hardware footprints we support and to what extent - this includes hardware we know will not work, hardware that we know might
62+
work, and hardware we have confirmed will work with regular CI testing.
63+
64+
### A robust CI ecosystem
65+
66+
We should have a CI ecosystem that includes linting as well as unit, functional, and integration/end-to-end (E2E) tests in the InstructLab repo, along with proper documentation and Makefiles that allow developers to easily run subsets of them locally on their machines.
67+
68+
## Q&A
69+
70+
**Q. What about the libraries? Will they 1.0.0 as well?**
71+
72+
A. It depends - we historically have not aligned the InstructLab and Library releases on a particular version numbering scheme, apart from matching Y-Streams to Y-Streams (e.g., InstructLab 0.20 used SDG 0.4, Training 0.5, and Eval 0.3). At this stage, this document scopes only the prereqs we want for the InstructLab package.
73+
74+
## Conclusions and Decision Outcome
75+
76+
This document will be debated and updated as part of the Pull Request review process. Upon reaching a lazy consensus by the Oversight Committee and Maintainer teams, the author of this document (Nathan Weinberg) will merge the document, denoting the following:
77+
78+
1. The items in the above section `MVP for an InstructLab 1.0.0` will become official prerequisites for the InstructLab CLI Maintainer team to releasing a `1.0.0` of the InstructLab.
79+
2. Any amendments to this list can only be made with a subsequent PR editing this document, subject to the same review process.

docs/rag/rag-initial-code-location.md

+109
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# Code location for RAG
2+
3+
| Created | Dec 5, 2024 |
4+
| -------- | -------- |
5+
| Authors | Bill Murdock |
6+
| Replaces | N/A |
7+
| Replaced by | N/A |
8+
9+
## What
10+
11+
We want a retrieval-augmented generation (RAG) capability that provides outstanding results with minimal effort, is seamlessly integrated with InstructLab, and is also general enough to be used in other applications as well.
12+
13+
## Why
14+
15+
Many InstructLab users want to train a model and then use it to RAG. Often they build something simple themselves for this purpose. Two problems with this approach:
16+
17+
- Building their own RAG is extra work.
18+
- Users who are not experts on RAG might not build a RAG that provides outstanding results.
19+
20+
There is a very simple RAG capability at <https://github.com/instructlab/rag> . It is not tightly integrated with InstructLab and it does not use any advanced RAG capabilities. However, we have a request from a stakeholder to not just unilaterally delete it or replace it with something radically different.
21+
22+
## Goals
23+
24+
Provide a built-in alternative for users who do not want to build their own RAG. Keep the existing capability at <https://github.com/instructlab/rag> somewhere, but potentially somewhere other than it is now (e.g., in a new branch of the existing repository).
25+
26+
## Non-goals
27+
28+
Evaluation of RAG will be addressed in one or more other development documents. That topic is out of scope for this document.
29+
30+
## Decision
31+
32+
- For now, RAG will be located in the core repository in its own directory: `src/instructlab/rag` in the core InstructLab repository (<https://github.com/instructlab/instructlab>).
33+
34+
## How
35+
36+
### Phase 1
37+
38+
- RAG will be located in the core repository in its own directory: `src/instructlab/rag` in the core InstructLab repository (<https://github.com/instructlab/instructlab>).
39+
- This directory will include all of the following:
40+
- Loading the content from Docling-format JSON files (that are produced by SDG preprocessing).
41+
- Chunking that content to sizes that fit the requirements of the selected embedding model for vector database storage and retrieval.
42+
- Storing those chunks with their vector representations in a vector database.
43+
- End-to-end runtime RAG. The initial version of this includes the following:
44+
- Taking as input a session history (including a current user query) and providing a response (e.g., something along the lines of the [OpenAI chat completion API](https://platform.openai.com/docs/api-reference/chat/create)).
45+
- During that processing, it retrieves relevant search results from the vector database, it converts those into a prompt to send to the response generation model, it prompts that model, and it returns the response from that model.
46+
- This will be invoked from the existing `ilab` CLI, as described in the [RAG ingestion and chat pipelines](https://github.com/instructlab/dev-docs/pull/161) dev doc.
47+
48+
### Future phases
49+
50+
- In the near future, RAG might be moved to the existing <https://github.com/instructlab/rag> repository.
51+
- If so, something will be done with the existing code in <https://github.com/instructlab/rag>, e.g., moving it to a branch of that repository or moving it to a different repository.
52+
- Alternatively, some or all of it might move to a new repository.
53+
- For example, maybe the indexing and retrieval portions move to a separate retrieval repository while the rest of end-to-end runtime RAG might move somewhere else.
54+
- If/when we move ahead with any of these options, *we will open a new ADR for that decision*.
55+
- Also, the capabilities will keep improving and adding more functionality.
56+
57+
## Alternatives
58+
59+
- Put the indexing and run-time RAG code in a new repository.
60+
- Pro: Having a dedicated repository gives the RAG team the most freedom and flexibility to make technical decisions that work for that team.
61+
- Pro: Starting with a new repository provides a blank slate that can be set up in whatever way makes the most sense for that functionality.
62+
- Pro: Having the capability in one repository makes it easier for consumers such as RamaLama to reuse it for their purposes too.
63+
- Con: Creating and configuring a new repository is some work. (This is a fairly small con, but a real one.)
64+
- Con: Integrating a new repository into the continuous integration and delivery capabilities for both upstream InstructLab and downstream consumers is a *lot* of work. This is a much bigger con.
65+
- Con: All that extra work would almost certainly result in slower time to market. This risks missing some market opportunities.
66+
- Put the indexing code in <https://github.com/instructlab/sdg> (SDG) and the run-time RAG code in <https://github.com/instructlab/instructlab> (core)
67+
- Pro: This has the advantage of not adding any new dependencies.
68+
- Pro: The document processing is already in SDG and chat functionality is already in core so this would require the fewest code changes.
69+
- Con: Splitting the RAG functionality across multiple repositories makes it more complicated to reuse in other applications outside of InstructLab.
70+
- Con: Many things we will want to do to add advanced functionality to make RAG more effective will require changes to both indexing and run-time RAG. If those components are split across multiple repositories, that will make delivering such changes more complicated.
71+
- Start by putting the code into existing InstructLab repositories (either of the above options) and then split if off into its own repository later.
72+
- Pro: Gets us integrated into InstructLab sooner.
73+
- Con: Adds extra work to the second phase where we have to split it off into its own repository.
74+
- Con: There is a risk that we never get around to splitting it off and we wind up stuck with the cons of being jammed in to other components indefinitely.
75+
- Put the indexing and run-time RAG code in a new repo outside <https://github.com/instructlab/>.
76+
- Pro: This signals that this is not specific to InstructLab but is instead intended to be useful in a variety of applications. That makes it more likely the work could have broader impact.
77+
- Con: If we put this out there as something that is intended to be useful in a variety of applications, the pressure is on us to make sure it is differentiated from other broadly applicable RAG capabilities. Hopefully that will be true eventually, but it probably won't be true for a while. It might make more sense to give this some time to mature as a local component of InstructLab before trying to spin it off as its own thing.
78+
- Con: If we put it out there as its own open source project, that project needs all of the infrastructure of a full open source activity (governing structures, communication tools and protocols, etc.). That's a lot of work to set up. Keeping it inside InstructLab for now lets us keep using the infrastructure that InstructLab has for this purpose).
79+
- Con: If we put it out there as its own open source project, it needs a name. It is a lot of work to come up with a good name and there will be a lot of stakeholders with an interest in the name that comes up.
80+
- Keep the indexing and run-time RAG code in <https://github.com/redhat-et/PaRAGon> which is an emerging technologies prototype for this work.
81+
- Mostly the same pros and cons as putting it in a new repo outside InstructLab plus the following:
82+
- Pro: A prototype for the code we want is already there.
83+
- Pro: It already has its own distinctive name (PaRAGon).
84+
- Con: The existing repository has its own simple command-line interface which is useful for the prototype but we don't want it in the capability we release because too many command-line interfaces will confuse users.
85+
- Con: The name PaRAGon seems fine to me, but probably more stakeholders need to weigh in on what a name would be.
86+
- Con: The `redhat-et` label suggests that this is something "owned" by Red Hat which makes sense for the prototype but not so much for something we want a community to own in the long run.
87+
- Put the indexing and run-time RAG code in <https://github.com/instructlab/rag> AND keep the existing RAG functionality in that repository intact.
88+
- Pro: It already exists.
89+
- Pro: It avoids the confusion of having two different RAG repositories in <https://github.com/instructlab/>.
90+
- Con: It creates the confusion of having two different RAG solutions in the same repository. We could mitigate that with developer documentation and marking legacy stuff as "deprecated".
91+
- Put the indexing and run-time RAG code in <https://github.com/instructlab/rag> AND eliminate the existing RAG functionality in that repository.
92+
- Pro: It already exists.
93+
- Pro: It avoids the confusion of having two different RAG repositories in <https://github.com/instructlab/>.
94+
- Pro: It avoids the confusion of having two different RAG solutions since we'd be eliminating the old one.
95+
- Con: There is still some interest in keeping this around.
96+
97+
## Risks
98+
99+
- Putting the RAG functionality in the core repository requires any application that wants to use this functionality to bring in the entire core which then brings in all of the libraries it depend on, so this becomes an enormous dependency. This discourages reuse in other applications. It *encourages* either of the following behaviors that would be unfortunate:
100+
- Other applications pull directly from <https://github.com/redhat-et/PaRAGon> and in doing so duplicate the ongoing effort to harden that code base.
101+
- Other applications may implement their own RAG solutions or pull from some other upstream unrelated to ours.
102+
- As noted earlier, putting the capability inside <https://github.com/instructlab/> signals that this is a component of InstructLab and not a generally useful feature. That creates a risk that the work could miss out on additional opportunities for impact. We hope to mitigate that risk by spinning it off to its own open source project when it is mature enough, but there is a risk that we will get distracted by other things and never get around to this.
103+
- The flow for document processing for InstructLab winds up being quite complicated in this proposal. Since the existing document processing is in SDG, the flow for indexing for RAG winds up being a bit complicated (i.e., it starts with a CLI call handled by the core repo then goes to SDG for some of the document processing and then back to the core `/data` directory which then calls out the the `core/rag` directory for chunking and vector database indexing). Having the document processing move from core to SDG and back to core and forward to RAG makes that capability more difficult to understand and maintain. This complexity will be partially mitigated when the preprocessing code moves from SDG to core. It will be further mitigated by having a clear, well-documented contract between core and the RAG repository indicating the responsibilities of each.
104+
105+
## References
106+
107+
- <https://github.com/redhat-et/PaRAGon>
108+
- <https://github.com/instructlab>
109+
- <https://github.com/instructlab/rag>

0 commit comments

Comments
 (0)