Skip to content

Commit

Permalink
feat: Improve AbstractBaseWriter with indexing and add docs (#176) (#189
Browse files Browse the repository at this point in the history
)

- **New Features**
- Introduced a unified file writing framework that streamlines saving
image and array data across various formats (e.g., HDF5, NIFTI, NumPy).
- Enhanced file handling with improved validations for naming,
compression levels, and managing existing files, ensuring smooth data
export and logging.

- **Tests**
- Added comprehensive testing to verify file saving operations, error
handling, and file indexing, ensuring robust performance across
different usage scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
  • Loading branch information
jjjermiah authored Feb 6, 2025
1 parent c16ae7c commit e147629
Show file tree
Hide file tree
Showing 15 changed files with 2,376 additions and 12 deletions.
3 changes: 2 additions & 1 deletion docs/.pages
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
nav:
- Home: index.md
- Usage: usage
- CLI Reference: cli
- API Reference: reference
- API Reference: reference
3 changes: 3 additions & 0 deletions docs/usage/.pages
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
nav:
- Home: index.md
- ...
5 changes: 5 additions & 0 deletions docs/usage/Writers/.pages
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
title: Writers

nav:
- Abstract Base Writer: BaseWriter.md
- ImplementingWriters.md
306 changes: 306 additions & 0 deletions docs/usage/Writers/BaseWriter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,306 @@
# Abstract Base Writer

The `AbstractBaseWriter` class is the foundation for all writers in this library.

It provides a standard interface, reusable methods, and tools that writers can extend
to handle file writing tasks efficiently and consistently.

If you’re building a writer to manage file outputs with custom paths, filenames, or formats,
this is where you start!

For details on implementing the `AbstractBaseWriter` in your custom writer, see the
[Implementing Writers](./ImplementingWriters.md) guide.

## Introduction

### What is the `AbstractBaseWriter`?

The `AbstractBaseWriter` is:

- **A reusable template**: Manage file-writing tasks consistently across different writer implementations.
- **Customizable**: Extend it to handle your file formats and workflows.
- **Safe and robust**: Features context management, filename sanitization, and optional CSV indexing.

### When should you extend `AbstractBaseWriter` for your custom writer?

If you write many files with dynamic paths and filenames, or need
to manage file existence scenarios, you might consider extending `AbstractBaseWriter`
(or even one of its subclasses) to simplify your implementation.

`AbstractBaseWriter` is useful when you need:

- Dynamic paths and filenames (e.g., `{subject_id}/{study_date}.nii.gz`).
- Configurable handling of existing files (`OVERWRITE`, `SKIP`, etc.).
- Logging of saved files via an optional CSV index.
- Thread-safe and multiprocessing-compatible file writing.
- A consistent interface across different types of writers.

---

## Core Concepts

### Root Directory and Filename Format Parameters

**Root Directory**:

- Base folder for all saved files, automatically created if missing (via `create_dirs` parameter)

**Filename Format**:

- A string template defining your file and folder names.
- Uses placeholders like `{key}` to insert context values dynamically.

Example:

```python
writer = ExampleWriter(
root_directory="./data",
filename_format="{person_name}/{date}_{message_type}.txt",
)

# Save a file with context variables
data = "Hello, World!"
writer.save(
data,
person_name="JohnDoe",
date="2025-01-01",
message_type="greeting"
)

# Saved file path:
# ./data/JohnDoe/2025-01-01_greeting.txt
```

### File Existence Modes

**Why It Matters**:

- When your writer saves a file, it needs to decide what to do if a file with the same name already exists.
- This is especially important in batch operations or when writing to shared directories.
- The `AbstractBaseWriter` provides several options to handle this scenario through the use of
an `enum` called `ExistingFileMode`.

It is important to handle these options carefully in your writer's `save()` method to
avoid data loss or conflicts.
<!-- adding this here too early might be confusing
```python
from imgtools.io.writers import ExistingFileMode
writer = ExampleWriter(
root_directory="./data",
filename_format="{person_name}/{date}_{message_type}.txt",
existing_file_mode=ExistingFileMode.OVERWRITE,
)
data = "Hello, World!"
writer.save(
data,
person_name="JohnDoe",
date="2025-01-01",
message_type="greeting"
)
# Should see a log message (assuming log level is set to `DEBUG`)
# [DEBUG] File data/JohnDoe/2025-01-01_greeting.txt exists. Deleting and overwriting.
```
-->

::: imgtools.io.writers.ExistingFileMode
options:
heading_level: 4

## Advanced Concepts

### Lifecycle Management

**Context Manager Support**:

- Writers can be used with `with` statements to ensure proper setup and cleanup.

**What Happens on Exit?**:

- Removes lock files used for the index file.
- Deletes empty directories created during the writing process (if no files were written).

Example:

```python
with TextWriter(root_directory="/data", filename_format="{id}.txt") as writer:
data = "Hello, World!"
writer.save(data, id="1234")
```

### Previewing File Paths and Caching Context

In the simplest usage of a writer, users can pass in the context information as
keyword arguments to each `save()` call.

However, this can become cumbersome when the same context variables are used across multiple
save operations.

**Example:**

In the above example, the `date` and `message_type` context variables are the same for all
students. Instead of passing them in every time, you can store these variables in the writer
itself and update them as needed.

Let's use the following example to illustrate this:

Say we want to save greetings for students in a particular highschool class:

```python
writer = TextWriter(
root_directory="./data",
filename_format="{grade}/{class_subject}/{person_name}/{date}_{message_type}.txt",
)
```

=== "Basic Usage"

We see here that the context variables for `grade`,
`class_subject`, `date`, and `message_type`
are the same for all students.

This can become even worse with more
context variables, allowing for mistakes, and making the code harder to read.

```python

student, message = "Alice", "Hello, Alice!"
writer.save(
message,
person_name=student,
grade="12",
class_subject="math",
date="2025-01-01",
message_type="greeting"
)

student, message = "Bob", "Good morning, Bob!"
writer.save(
message,
person_name=student,
grade="12",
class_subject="math",
date="2025-01-01",
message_type="greeting"
)
```

=== "Setting Context Variables manually"

Instead of passing in the context variables every time,
you can store these variables in the writer and update them as needed
using the `set_context()` method.

Then only pass in the unique context variables for each `.save()` operation.

```python
writer.set_context(
grade="12",
class_subject="math",
date="2025-01-01",
message_type="greeting"
)

student, message = "Alice", "Hello, Alice!"
writer.save(message, person_name=student)

student, message = "Bob", "Good morning, Bob!"
writer.save(message, person_name=student)
```

=== "Setting Context Variables during Initialization"

If majority of the context variables are the same across all save
operations, you can set context when initializing the writer.

Note that here, we must pass as a dictionary to the `context` parameter
instead of individual keyword arguments.

```python
writer = TextWriter(
root_directory="./data",
filename_format="{class_subject}/{person_name}/{date}_{message_type}.txt",
context={"grade": "12", "class_subject": "math", "date": "2025-01-01", "message_type": "greeting"}
)

student, message = "Alice", "Hello, Alice!"
writer.save(message, person_name=student)

student, message = "Bob", "Good morning, Bob!"
writer.save(message, person_name=student)
```

#### Previewing File Paths

Oftentimes, you may want to check if a file exists before performing an expensive computation.
If you set the existence mode to `ExistingFileMode.SKIP`, the `preview_path()` method will return `None` if the
file already exists, allowing you to skip the computation.

This method also caches the additional context variables for future use.

Here's an example of how you might handle this:

```python
# assuming writer is already initialized with `existing_file_mode=ExistingFileMode.SKIP`

# set some context variables
writer.set_context(class_subject="math", date="2025-01-01", message_type="greeting")

if (path := writer.preview_path(person_name="Alice")) is None:
print("File already exists, skipping computation.")
else:
print(f"Proceed with computation for {path}")
...
# perform expensive computation
...
writer.save(content="Hello, world!")
```

### Index File Management

**What is the Index File?**:

- A CSV file used to log details about saved files, like their paths and context variables.
- Helps track what files have been written, especially useful in batch operations.
- Additionally can save all the context variables used for each file, convienient for
saving additional metadata, while improving traceability.

**How It Works**:

- By default, it is named `{root_directory.name}_index.csv`.
- You can customize the filename or provide an absolute path for more control.
- This is something that the `save()` method in the `AbstractBaseWriter` should **optionally**
implement, or let users decide to include the index by calling `add_to_index(path)` after `save()`.

**Key Features**:

- **Customizable Filename**: Use `index_filename` to set a custom name or absolute path.
- **Absolute/Relative Paths**: Control file paths in the index with `absolute_paths_in_index`.
- **Inter-process Locking**: Prevents conflicts in concurrent writing environments.

### Sanitizing Filenames

**Why Sanitize Filenames?**:

- To ensure that filenames are safe and compatible across different operating systems.

**How It Works**:

- Replaces illegal characters (e.g., `<`, `>`, `:`, `"`, `/`, `\`, `|`, `?`, `*`) with underscores.
- Trims leading or trailing spaces and periods to avoid issues.

**When Is It Applied?**:

- Automatically applied when generating filenames, unless disabled by setting `sanitize_filenames=False`.

### Multiprocessing Compatibility

**Why It Matters**:

- In batch operations or high-performance use cases, multiple processes may write files simultaneously.

**Key Features**:

- Supports multiprocessing with inter-process locking to ensure thread-safe file writes.
- Avoids conflicts or data corruption when multiple instances of a writer are running.
Loading

0 comments on commit e147629

Please sign in to comment.