Skip to content

Commit 9bcde30

Browse files
authored
Merge pull request #441 from instructlab/mergify/bp/release-v0.6/pr-440
Update CHANGELOG.md for release v0.6.2 (backport #440)
2 parents 8e13b1c + d7628de commit 9bcde30

File tree

2 files changed

+25
-0
lines changed

2 files changed

+25
-0
lines changed

.markdownlint-cli2.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,12 @@ config:
77
code-block-style: false
88
no-duplicate-header: false
99
single-trailing-newline: false
10+
no-duplicate-heading: false
1011
globs:
1112
- "**/*.md"
1213
ignores:
1314
- ".github/**"
15+
- ".tox/**"
1416
- "venv/**"
1517
- ".venv/**"
1618
- "**/testdata/**"

CHANGELOG.md

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
## v0.6.2
2+
3+
### Fixes
4+
5+
* Fixed a bug in our version specification of `docling` and `docling_parse` dependencies that were causing new installs of InstructLab to pull in incompatible versions of these. We also fixed a similar bug in the `mypy` dependency, but that one only impacts developers of SDG as opposed to users of InstructLab.
6+
7+
## v0.6.1
8+
9+
### Fixes
10+
11+
* Fixed a bug where generating data from a taxonomy with 2 or more changed knowledge leaf nodes would fail with a message about a destination path `already exists and is not an empty directory`
12+
13+
## v0.6.0
14+
15+
### Features
16+
17+
* Small knowledge datasets will automatically get upsampled during final data mixing based on the length of any precomputed skills datasets used during data mixing. This avoids issues where very large precomputed skills datasets were swamping the comparatively minor number of knowledge samples, resulting in lower than optimal knowledge retention during multiphase training. If a large precomputed dataset isn't in use during mixing (which is how things operate by default), this change is a no-op.
18+
* When chunking PDF documents, we'll now look for the docling models on-disk in `$XDG_DATA_HOME/instructlab/sdg/models` (as well as `$XDG_DATA_DIRS` with the same `instructlab/sdg/models` subdirectory). If they are not found on disk, they'll automatically be downloaded from HuggingFace.
19+
* When chunking PDF documents with Docling, we'll automatically configure Docling to use `tesserocr` if a working implementation is found instead of relying on `easyocr`. We fallback to `easyocr` if Tesseract is not properly configured for use by `tesserocr`.
20+
21+
### Breaking Changes
22+
23+
* Teacher model tokenizers are loaded from the local teacher model on-disk and not downloaded automatically from HuggingFace. The typical workflows in use so far expect the teacher model to exist on-disk, and this enforces that at least its tokenizer exists.

0 commit comments

Comments
 (0)