-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata_preparation.qmd
96 lines (70 loc) · 3.29 KB
/
data_preparation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
title: Data Preparation
order: 3
---
## 📂 Data Loading
The **Image ML Pod** leverages HuggingFace's `datasets` library for robust image data handling. Specifically, the `HFImageFolderDataSet` is used as a convenient wrapper around the `ImageFolder` dataset from HuggingFace's library, enabling seamless loading of datasets stored in a folder structure.
**Key Features**:
- Compatibility with HuggingFace's `datasets`.
- Easy handling of standard folder-based datasets.
- **Read-Only**: The dataset does not support saving changes.
### Directory Structure
Organize your dataset with the following structure:
```
data_dir/
├── train/
│ ├── class_a/
│ │ ├── image1.jpg
│ │ ├── image2.jpg
│ └── ...
├── validation/
│ ├── class_a/
│ │ ├── image3.jpg
│ │ ├── image4.jpg
│ └── ...
├── test/
│ ├── class_a/
│ │ ├── image5.jpg
│ │ ├── image6.jpg
│ └── ...
```
- **Splits**: `train`, `validation`, and `test` directories.
- **Labels**: Subdirectories represent class labels (e.g., `class_a`, `class_b`).
### Using `HFImageFolderDataSet`
To integrate `HFImageFolderDataSet` into your Kedro project, add the following entry to your `catalog.yml`:
```yaml
my_image_dataset:
type: image_ml_pod.datasets.HFImageFolderDataSet
data_dir: data/01_raw/images
```
📚 **Reference**: [Hugging Face ImageFolder Dataset](https://huggingface.co/docs/datasets/en/image_dataset#imagefolder)
---
### Processed Dataset Handling
For saving and loading processed datasets, use `HFDatasetWrapper`. This ensures datasets are stored in a format suitable for training pipelines or inference workflows.
To include it in your Kedro project:
```yaml
my_huggingface_dataset:
type: image_ml_pod.datasets.HFDatasetWrapper
dataset_path: data/processed/my_dataset
```
📚 **Reference**: [Hugging Face Dataset Documentation](https://huggingface.co/docs/datasets/package_reference/main_classes.html)
---
## 🛠 Data Preprocessing
The `data_preprocessing` pipeline provides a modular approach for preparing datasets for training. It includes nodes for:
1. **Loading Data**: Reads images from the source directory or dataset.
2. **Applying Transforms**: Incorporates `torchvision.transforms` for preprocessing tasks such as resizing, normalization, and augmentation.
3. **Saving Processed Data**: Stores processed datasets for use in downstream tasks.
### Customization
- **Template Nodes**: Edit template nodes for specific preprocessing tasks such as custom augmentation or data splitting.
- **Dataset Format**: Use the `set_format` method of the `Dataset` class to convert data into formats like `torch.Tensor` or `numpy.ndarray`. Note: By default, images loaded from `ImageFolder` are in the PIL format.
- **Single-Image Support**: Modify nodes to accept single-image inputs, enabling reusability for inference workflows.
**Example**: Adding a custom resize and normalize transform:
```python
from torchvision import transforms
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
```