Skip to content

Commit b1e1a83

Browse files
authored
Update sdg-refactor.md
Signed-off-by: Bill Murdock <[email protected]>
1 parent a268c7b commit b1e1a83

File tree

1 file changed

+7
-1
lines changed

1 file changed

+7
-1
lines changed

docs/sdg/sdg-refactor.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,13 @@
22

33
## Goals
44

5-
We want to modularize the parts of the codebase that deal with the data augmentation phase of the end to end workflow. In order to modularize it effectively, we need to identify and distinguish pre-processing, data generation, and post-processing. Each of these elements need to be located somewhere. This document discusses pros and cons of different options and proposes specific conclusions.
5+
We want to modularize the parts of the codebase that deal with the data augmentation phase of the end to end workflow. In order to modularize it effectively, we need to identify and distinguish pre-processing, data generation, and post-processing. Each of these elements need to be located somewhere. This document discusses pros and cons of different options and proposes specific conclusions. Specifically, it concludes:
6+
7+
- The synthetic data generation will remain in the SDG repository.
8+
- The preprocessing that is used for synthetic data generation (e.g., document conversion) will move to the core repository.
9+
- The postprocessing that is used for synthetic data generation (e.g., data mixing) will move to the core repository.
10+
11+
Ensuring that *only* synthetic data generation is in the SDG repository ensures that this component has a clear, well-defined mission. Furthermore, moving preprocessing and postprocessing to core will make it easier for those capabilities to be used by other components in the future. For example, some of the same preprocessing that is done for SDG (e.g., document conversion) is also useful for indexing content for RAG.
612

713
## Context
814

0 commit comments

Comments
 (0)