Merge branch 'main' into patch-1

jwm4 · web-flow · commit 7503a09da406 · 2024-12-17T11:02:01.000-05:00
Signed-off-by: Bill Murdock &lt;bmurdock@redhat.com&gt;
diff --git a/.github/workflows/actionlint.yml b/.github/workflows/actionlint.yml
@@ -30,7 +30,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: "Harden Runner"
-        uses: step-security/harden-runner@91182cccc01eb5e619899d80e4e971d6181294a7 # v2.10.1
+        uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
         with:
           egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
 
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -33,14 +33,14 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: "Harden Runner"
-        uses: step-security/harden-runner@91182cccc01eb5e619899d80e4e971d6181294a7 # v2.10.1
+        uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
         with:
           egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
       - name: "Checkout"
         uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
         with:
           fetch-depth: 0
       - name: "Check Markdown documents"
-        uses: DavidAnson/markdownlint-cli2-action@db43aef879112c3119a410d69f66701e0d530809 # v17.0.0
+        uses: DavidAnson/markdownlint-cli2-action@eb5ca3ab411449c66620fe7f1b3c9e10547144b0 # v18.0.0
         with:
           globs: '**/*.md'
diff --git a/.github/workflows/spellcheck.yml b/.github/workflows/spellcheck.yml
@@ -32,7 +32,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: "Harden Runner"
-        uses: step-security/harden-runner@91182cccc01eb5e619899d80e4e971d6181294a7 # v2.10.1
+        uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
         with:
          egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
 
@@ -42,4 +42,4 @@ jobs:
           fetch-depth: 0
 
       - name: Spellcheck
-        uses: rojopolis/spellcheck-github-actions@74c2a1451c617e7dd9532340b199e18d5411b168 # v0.44.0
+        uses: rojopolis/spellcheck-github-actions@403efe0642148e94ecb3515e89c767b85a32371a # v0.45.0
diff --git a/.github/workflows/stale_bot.yml b/.github/workflows/stale_bot.yml
@@ -24,7 +24,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: "Harden Runner"
-        uses: step-security/harden-runner@91182cccc01eb5e619899d80e4e971d6181294a7 # v2.10.1
+        uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
         with:
           disable-sudo: true
           egress-policy: block
diff --git a/.markdownlint-cli2.yaml b/.markdownlint-cli2.yaml
@@ -3,6 +3,7 @@
 config:
   line-length: false
   no-emphasis-as-header: false
+  no-emphasis-as-heading: false
   first-line-heading: false
   code-block-style: false
   no-duplicate-header: false
diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt
@@ -1,6 +1,7 @@
 # make spellcheck-sort
 # Please keep this file sorted:
 Abhishek
+ADR
 Akash
 AMDGPU
 Anil
@@ -42,6 +43,7 @@ disambiguating
 ditaa
 Docling
 docstring
+downstreams
 dr
 Dropdown
 env
@@ -75,6 +77,7 @@ hipBLAS
 ilab
 impactful
 Inferencing
+init
 instantiation
 instructlab
 io
@@ -96,6 +99,7 @@ LLM
 llms
 LLVM
 lora
+Makefiles
 Markdownlint
 md
 Mergify
@@ -110,6 +114,7 @@ mlx
 MMLU
 modularize
 modularized
+Murdock
 Nakamura
 num
 NVidia
@@ -122,6 +127,7 @@ OpenStax
 optimizers
 orchestrator
 ots
+PaRAGon
 Params
 Pareja
 PEFT
@@ -137,7 +143,10 @@ postprocessing
 pre
 preprint
 preprocessing
+prereqs
+productize
 PR's
+Pydantic
 pyenv
 PyPI
 pyproject
@@ -154,6 +163,7 @@ Ren
 repo
 repos
 RHEL
+roadmapping
 ROCm
 RTX
 runtime
@@ -180,6 +190,7 @@ subcommand
 subcommands
 subdirectory
 Sudalairaj
+supportability
 Taj
 tatsu
 TBD
diff --git a/docs/instructlab-cli-1.0.0.md b/docs/instructlab-cli-1.0.0.md
@@ -0,0 +1,79 @@
+# The Road to 1.0.0
+
+_Or: How I Learned to Stop Worrying and Love to GA_
+
+## Context and Goals
+
+The `instructlab/instructlab` repo started off as `instructlab/cli` - a basic Python Click-based command-line interface designed to prototype an application capable of
+running the LAB methodology created by IBM Research. As the project evolved and the organization looked into creating a proper PyPI package for it, the decision was made
+to rename the repo to `instructlab/instructlab` to keep the repo name consistent with the PyPI package name. The rest of this document will being using "InstructLab" to
+refer to this repo and Python package.
+
+Today, InstructLab has gone from a scrappy research project to an upstream community serving as the basis for multiple downstreams, with the goal
+to continuing to evolve the community to encourage more participation from additional stakeholders. To wit, it would behoove us to determine what exactly we should be
+roadmapping between now and a proper 1.0.0 release, which demonstrates the following to existing and potential community members:
+
+1. An official goalpost for the community denoting the evolution of InstructLab from a pre-1.0 project lacking the stability and supportability typically seen from 1.0-and-beyond projects.
+1. A dedicated set of V1 interfaces, both for internal configs and an API, that can be counted on for continuous usage of InstructLab 1.0 with future provisions made for backwards compatibility for subsequent Y-Streams and Z-Streams.
+1. A commitment from the Oversight Committee and Maintainer teams to continue to maintain InstructLab throughout a 1.y cycle and work towards an eventual 2.0.
+
+## MVP for an InstructLab 1.0.0
+
+At a high-level, these are the items the Maintainer teams believe should serve as prereqs for releasing an InstructLab 1.0.0:
+
+### Updating relevant references of "CLI" to "Core"
+
+As noted in the `Context and Goals` section, InstructLab started off as just as a CLI - however, we are planning for this package to serve as a more general "Engine" -
+being a place where a future REST API can be defined that is used by both the CLI aspect as well as an official GUI for orchestrating the entire LAB workflow. Despite
+this, the repo is often still referred to as "the CLI". We as an organization need a better term to refer to this repo as, and should adopt the relevant documentation
+and meetings accordingly.
+
+An open community vote made as part of the drafting of this document decided that "Core" would be the new term used. You can see a record of the vote
+[here](https://github.com/instructlab/dev-docs/pull/159#issuecomment-2514885516). This name change will begin to go into effect after the merging of this document
+and should be completed by the time of a 1.0.0.
+
+### A fully-realized configuration scheme, centered around the usage of system profiles
+
+The InstructLab configuration scheme has transformed in many ways since the project's inception, from the `config.yaml` file that initially served as the user's config,
+to the addition of code-based Pydantic defaults, to train profiles, to system profiles. We need to fully-decouple this config from the Click library, remove the need for
+a `config.yaml` file, and have a consistent config scheme that can be easily extended.
+
+### An official v1 REST API schema
+
+We need to have a defined v1 REST API schema - while this does not preclude future updates, something mature enough to serve as a v1 API throughout subsequent Y-Streams
+for an InstructLab 1.0 is a must for such a milestone.
+
+### Integration of InstructLab with RAG
+
+RAG is currently being planned on being integrated into InstructLab - that work should be in a stable state adhering to our v1 API standard.
+
+### An upgrade path to subsequent Y-Streams and an eventual 2.0
+
+Any user wishing to install an InstructLab 1.0 must have an upgrade path to 1.1, 1.2, ..., 1.n. Upon being ready for an InstructLab 2.0, we should also be expecting to
+provide a path for users wishing to upgrade from our final 1.y stream to 2.0.
+
+### Backwards compatibility across the 1.y stream
+
+Any user going down our upgrade path described above should expect that the release they upgrade to is backwards-compatible with the release they upgrade from.
+
+### An official hardware support matrix
+
+We need to have a documented matrix of what hardware footprints we support and to what extent - this includes hardware we know will not work, hardware that we know might
+work, and hardware we have confirmed will work with regular CI testing.
+
+### A robust CI ecosystem
+
+We should have a CI ecosystem that includes linting as well as unit, functional, and integration/end-to-end (E2E) tests in the InstructLab repo, along with proper documentation and Makefiles that allow developers to easily run subsets of them locally on their machines.
+
+## Q&A
+
+**Q. What about the libraries? Will they 1.0.0 as well?**
+
+A. It depends - we historically have not aligned the InstructLab and Library releases on a particular version numbering scheme, apart from matching Y-Streams to Y-Streams (e.g., InstructLab 0.20 used SDG 0.4, Training 0.5, and Eval 0.3). At this stage, this document scopes only the prereqs we want for the InstructLab package.
+
+## Conclusions and Decision Outcome
+
+This document will be debated and updated as part of the Pull Request review process. Upon reaching a lazy consensus by the Oversight Committee and Maintainer teams, the author of this document (Nathan Weinberg) will merge the document, denoting the following:
+
+1. The items in the above section `MVP for an InstructLab 1.0.0` will become official prerequisites for the InstructLab CLI Maintainer team to releasing a `1.0.0` of the InstructLab.
+2. Any amendments to this list can only be made with a subsequent PR editing this document, subject to the same review process.
diff --git a/docs/rag/rag-initial-code-location.md b/docs/rag/rag-initial-code-location.md
@@ -0,0 +1,109 @@
+# Code location for RAG
+
+| Created  | Dec 5, 2024 |
+| -------- | -------- |
+| Authors | Bill Murdock |
+| Replaces | N/A |
+| Replaced by | N/A |
+
+## What
+
+We want a retrieval-augmented generation (RAG) capability that provides outstanding results with minimal effort, is seamlessly integrated with InstructLab, and is also general enough to be used in other applications as well.
+
+## Why
+
+Many InstructLab users want to train a model and then use it to RAG.  Often they build something simple themselves for this purpose.  Two problems with this approach:
+
+- Building their own RAG is extra work.
+- Users who are not experts on RAG might not build a RAG that provides outstanding results.
+
+There is a very simple RAG capability at <https://github.com/instructlab/rag> .  It is not tightly integrated with InstructLab and it does not use any advanced RAG capabilities.  However, we have a request from a stakeholder to not just unilaterally delete it or replace it with something radically different.
+
+## Goals
+
+Provide a built-in alternative for users who do not want to build their own RAG.  Keep the existing capability at <https://github.com/instructlab/rag> somewhere, but potentially somewhere other than it is now (e.g., in a new branch of the existing repository).
+
+## Non-goals
+
+Evaluation of RAG will be addressed in one or more other development documents.  That topic is out of scope for this document.
+
+## Decision
+
+- For now, RAG will be located in the core repository in its own directory: `src/instructlab/rag` in the core InstructLab repository (<https://github.com/instructlab/instructlab>).
+
+## How
+
+### Phase 1
+
+- RAG will be located in the core repository in its own directory: `src/instructlab/rag` in the core InstructLab repository (<https://github.com/instructlab/instructlab>).
+- This directory will include all of the following:
+  - Loading the content from Docling-format JSON files (that are produced by SDG preprocessing).
+  - Chunking that content to sizes that fit the requirements of the selected embedding model for vector database storage and retrieval.
+  - Storing those chunks with their vector representations in a vector database.
+  - End-to-end runtime RAG.  The initial version of this includes the following:
+    - Taking as input a session history (including a current user query) and providing a response (e.g., something along the lines of the [OpenAI chat completion API](https://platform.openai.com/docs/api-reference/chat/create)).
+    - During that processing, it retrieves relevant search results from the vector database, it converts those into a prompt to send to the response generation model, it prompts that model, and it returns the response from that model.
+- This will be invoked from the existing `ilab` CLI, as described in the [RAG ingestion and chat pipelines](https://github.com/instructlab/dev-docs/pull/161) dev doc.
+
+### Future phases
+
+- In the near future, RAG might be moved to the existing <https://github.com/instructlab/rag> repository.
+  - If so, something will be done with the existing code in <https://github.com/instructlab/rag>, e.g., moving it to a branch of that repository or moving it to a different repository.
+- Alternatively, some or all of it might move to a new repository.
+  - For example, maybe the indexing and retrieval portions move to a separate retrieval repository while the rest of end-to-end runtime RAG might move somewhere else.
+- If/when we move ahead with any of these options, *we will open a new ADR for that decision*.
+- Also, the capabilities will keep improving and adding more functionality.
+
+## Alternatives
+
+- Put the indexing and run-time RAG code in a new repository.
+  - Pro: Having a dedicated repository gives the RAG team the most freedom and flexibility to make technical decisions that work for that team.
+  - Pro: Starting with a new repository provides a blank slate that can be set up in whatever way makes the most sense for that functionality.
+  - Pro: Having the capability in one repository makes it easier for consumers such as RamaLama to reuse it for their purposes too.
+  - Con: Creating and configuring a new repository is some work.  (This is a fairly small con, but a real one.)
+  - Con: Integrating a new repository into the continuous integration and delivery capabilities for both upstream InstructLab and downstream consumers is a *lot* of work.  This is a much bigger con.
+  - Con: All that extra work would almost certainly result in slower time to market.  This risks missing some market opportunities.
+- Put the indexing code in <https://github.com/instructlab/sdg> (SDG) and the run-time RAG code in <https://github.com/instructlab/instructlab> (core)
+  - Pro: This has the advantage of not adding any new dependencies.
+  - Pro: The document processing is already in SDG and chat functionality is already in core so this would require the fewest code changes.
+  - Con: Splitting the RAG functionality across multiple repositories makes it more complicated to reuse in other applications outside of InstructLab.
+  - Con: Many things we will want to do to add advanced functionality to make RAG more effective will require changes to both indexing and run-time RAG.  If those components are split across multiple repositories, that will make delivering such changes more complicated.
+- Start by putting the code into existing InstructLab repositories (either of the above options) and then split if off into its own repository later.
+  - Pro: Gets us integrated into InstructLab sooner.
+  - Con: Adds extra work to the second phase where we have to split it off into its own repository.
+  - Con: There is a risk that we never get around to splitting it off and we wind up stuck with the cons of being jammed in to other components indefinitely.
+- Put the indexing and run-time RAG code in a new repo outside <https://github.com/instructlab/>.
+  - Pro: This signals that this is not specific to InstructLab but is instead intended to be useful in a variety of applications.  That makes it more likely the work could have broader impact.
+  - Con: If we put this out there as something that is intended to be useful in a variety of applications, the pressure is on us to make sure it is differentiated from other broadly applicable RAG capabilities.  Hopefully that will be true eventually, but it probably won't be true for a while.  It might make more sense to give this some time to mature as a local component of InstructLab before trying to spin it off as its own thing.
+  - Con: If we put it out there as its own open source project, that project needs all of the infrastructure of a full open source activity (governing structures, communication tools and protocols, etc.).  That's a lot of work to set up.  Keeping it inside InstructLab for now lets us keep using the infrastructure that InstructLab has for this purpose).
+  - Con: If we put it out there as its own open source project, it needs a name.  It is a lot of work to come up with a good name and there will be a lot of stakeholders with an interest in the name that comes up.
+- Keep the indexing and run-time RAG code in <https://github.com/redhat-et/PaRAGon> which is an emerging technologies prototype for this work.
+  - Mostly the same pros and cons as putting it in a new repo outside InstructLab plus the following:
+  - Pro: A prototype for the code we want is already there.
+  - Pro: It already has its own distinctive name (PaRAGon).
+  - Con: The existing repository has its own simple command-line interface which is useful for the prototype but we don't want it in the capability we release because too many command-line interfaces will confuse users.
+  - Con: The name PaRAGon seems fine to me, but probably more stakeholders need to weigh in on what a name would be.
+  - Con: The `redhat-et` label suggests that this is something "owned" by Red Hat which makes sense for the prototype but not so much for something we want a community to own in the long run.
+- Put the indexing and run-time RAG code in <https://github.com/instructlab/rag> AND keep the existing RAG functionality in that repository intact.
+  - Pro: It already exists.
+  - Pro: It avoids the confusion of having two different RAG repositories in <https://github.com/instructlab/>.
+  - Con: It creates the confusion of having two different RAG solutions in the same repository.  We could mitigate that with developer documentation and marking legacy stuff as "deprecated".
+- Put the indexing and run-time RAG code in <https://github.com/instructlab/rag> AND eliminate the existing RAG functionality in that repository.
+  - Pro: It already exists.
+  - Pro: It avoids the confusion of having two different RAG repositories in <https://github.com/instructlab/>.
+  - Pro: It avoids the confusion of having two different RAG solutions since we'd be eliminating the old one.
+  - Con: There is still some interest in keeping this around.
+
+## Risks
+
+- Putting the RAG functionality in the core repository requires any application that wants to use this functionality to bring in the entire core which then brings in all of the libraries it depend on, so this becomes an enormous dependency.  This discourages reuse in other applications.  It *encourages* either of the following behaviors that would be unfortunate:
+  - Other applications pull directly from <https://github.com/redhat-et/PaRAGon> and in doing so duplicate the ongoing effort to harden that code base.
+  - Other applications may implement their own RAG solutions or pull from some other upstream unrelated to ours.
+- As noted earlier, putting the capability inside <https://github.com/instructlab/> signals that this is a component of InstructLab and not a generally useful feature.  That creates a risk that the work could miss out on additional opportunities for impact.  We hope to mitigate that risk by spinning it off to its own open source project when it is mature enough, but there is a risk that we will get distracted by other things and never get around to this.
+- The flow for document processing for InstructLab winds up being quite complicated in this proposal.  Since the existing document processing is in SDG, the flow for indexing for RAG winds up being a bit complicated (i.e., it starts with a CLI call handled by the core repo then goes to SDG for some of the document processing and then back to the core `/data` directory which then calls out the the `core/rag` directory for chunking and vector database indexing).  Having the document processing move from core to SDG and back to core and forward to RAG makes that capability more difficult to understand and maintain.  This complexity will be partially mitigated when the preprocessing code moves from SDG to core.  It will be further mitigated by having a clear, well-documented contract between core and the RAG repository indicating the responsibilities of each.
+
+## References
+
+- <https://github.com/redhat-et/PaRAGon>
+- <https://github.com/instructlab>
+- <https://github.com/instructlab/rag>