feat: add io/loaders module #83

strixy16 · 2024-12-11T21:27:04Z

Includes features, general, and images specific loading functions

Summary by CodeRabbit

New Features
- Introduced a new module for loading various data types in the READII pipeline.
- Added functions for loading imaging feature sets, dataset configurations, and data files into DataFrames.
- New configuration files for the 4D-Lung and NSCLC_Radiogenomics datasets created.
Bug Fixes
- Enhanced error handling for file loading functions.
Tests
- Added unit tests for configuration loading and feature extraction validation.
- Updated existing tests for clarity and specificity.
- Introduced new fixtures to support testing.
Chores
- Updated .gitignore to exclude test output files.

…ic loading functions

coderabbitai · 2024-12-11T21:27:11Z

Walkthrough

This pull request introduces a new module for loading various data types within the READII pipeline. It includes multiple new functions for loading imaging feature sets, dataset configurations, and data files into pandas DataFrames. Additionally, two new YAML configuration files for specific datasets are added, along with new tests to validate the functionality of the loading functions and configuration handling.

Changes

File Path	Change Summary
`src/readii/io/loaders/__init__.py`	New module created; functions added to `__all__`: `loadFeatureFilesFromImageTypes`, `loadFileToDataFrame`, `loadImageDatasetConfig`, `getImageTypesFromDirectory`.
`src/readii/io/loaders/features.py`	Method added: `loadFeatureFilesFromImageTypes(extracted_feature_dir: str, image_types: list, drop_labels: Optional[bool] = True, labels_to_drop: Optional[list] = None) -> Dict[str, pd.DataFrame]`.
`src/readii/io/loaders/general.py`	Methods updated: `loadImageDatasetConfig(dataset_name: str, config_dir_path: str
`src/readii/io/loaders/images.py`	Method added: `getImageTypesFromDirectory(raw_data_dir: Union[Path
`tests/4D-Lung/4D-Lung.yaml`	New configuration file created outlining parameters for the 4D-Lung dataset.
`tests/NSCLC_Radiogenomics/NSCLC_Radiogenomics.yaml`	New configuration file created for the NSCLC_Radiogenomics dataset with defined sections for clinical variables, exclusion criteria, and image types.
`tests/io/loaders/test_general.py`	New test file created with unit tests for `loadImageDatasetConfig`; tests validate loading for "NSCLC_Radiogenomics" and "4D-Lung" datasets.
`tests/test_feature_extraction.py`	New fixture added: `lung4DMetadataPath`; method renamed: `test_radiomicFeatureExtraction_output` to `test_NSCLC_radiomicFeatureExtraction_output`; new test method added: `test_4DLung_radiomicFeatureExtraction_output`.
`.gitignore`	New entries added to ignore test output files: `tests//procdata/`, `tests//results/`.

Possibly related PRs

refactor(feature-extraction): extract-method refactor to simplify workflow, prepare for class-based feature extraction #69: This PR involves refactoring the feature extraction process, which may relate to the new functions introduced in the main PR for loading feature files and managing image datasets, as both are focused on enhancing the feature extraction workflow.
feat: Add BaseWriter, and example NIFTIWriter with notebooks documenting them #84: The introduction of the BaseWriter and NIFTIWriter classes in this PR relates to the overall data handling and writing processes, which could connect with the new loading functionalities introduced in the main PR, particularly in managing image datasets and configurations.

Suggested reviewers

strixy16

Poem

In the meadow where data flows,
New paths for features and datasets grow.
With YAML files and tests in tow,
The READII pipeline's ready to glow!
Hops of joy, let the coding spree,
For every rabbit, it's a data jubilee! 🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2024-12-11T21:30:50Z

Codecov Report

Attention: Patch coverage is 30.92784% with 67 lines in your changes missing coverage. Please review.

Project coverage is 49.04%. Comparing base (7228770) to head (94111ba).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/readii/io/loaders/features.py	18.18%	27 Missing ⚠️
src/readii/io/loaders/general.py	39.53%	26 Missing ⚠️
src/readii/io/loaders/images.py	17.64%	14 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #83      +/-   ##
==========================================
- Coverage   50.89%   49.04%   -1.86%     
==========================================
  Files          21       25       +4     
  Lines         894      991      +97     
==========================================
+ Hits          455      486      +31     
- Misses        439      505      +66

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…put can be used for other tests

coderabbitai

Actionable comments posted: 8

🧹 Outside diff range and nitpick comments (7)

src/readii/io/loaders/features.py (2)
23-25: Avoid referencing notebooks in docstrings

Referencing data_setup_for_modeling.ipynb in the docstring may not be helpful for users who are not familiar with that notebook. Consider providing a clear description without external references.

Apply this diff to improve the docstring:
 labels_to_drop : list, optional
     List of labels to drop from the dataframes. The default is ["patient_ID", "survival_time_in_years", "survival_event_binary"].
-    based on code in data_setup_for_modeling.ipynb.
70-70: Remove unused variable e in exception handling

The variable e is assigned but not used in the except clause. You can remove as e to clean up the code.

Apply this diff to fix the issue:
-    except KeyError as e:
+    except KeyError:
🧰 Tools

🪛 Ruff (0.8.2)

70-70: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)
tests/io/loaders/test_general.py (1)
23-24: Use is for comparison to None

Comparisons to None should use is or is not to check for identity.

Apply this diff to fix the comparisons:
-    assert config["outcome_variables"]["event_label"] == None
-    assert config["outcome_variables"]["event_value_mapping"] == None
+    assert config["outcome_variables"]["event_label"] is None
+    assert config["outcome_variables"]["event_value_mapping"] is None
🧰 Tools

🪛 Ruff (0.8.2)

23-23: Comparison to None should be cond is None

Replace with cond is None

(E711)

24-24: Comparison to None should be cond is None

Replace with cond is None

(E711)
src/readii/io/loaders/general.py (1)
39-39: Use raise ... from ... to preserve exception context

When re-raising exceptions in an except block, use raise ... from e to maintain the original traceback and provide better debugging information.

Apply this diff to improve exception handling:
 except yaml.YAMLError as e:
-    raise ValueError(f"Invalid YAML in config file: {e}")  
+    raise ValueError(f"Invalid YAML in config file: {e}") from e

 # ...

 except pd.errors.EmptyDataError:
-    raise ValueError("File is empty")
+    raise ValueError("File is empty") from None

 except (pd.errors.ParserError, ValueError) as e:
-    raise ValueError(f"Error parsing file: {e}")
+    raise ValueError(f"Error parsing file: {e}") from e
Also applies to: 79-79, 81-81

🧰 Tools

🪛 Ruff (0.8.2)

39-39: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)
tests/4D-Lung/4D-Lung.yaml (1)
20-20: Fix YAML formatting for better readability

The image types list needs proper spacing after commas.
-image_types: ["original", "shuffled_full","shuffled_roi","shuffled_non_roi","randomized_sampled_full","randomized_sampled_roi","randomized_sampled_non_roi"]
+image_types: ["original", "shuffled_full", "shuffled_roi", "shuffled_non_roi", "randomized_sampled_full", "randomized_sampled_roi", "randomized_sampled_non_roi"]
🧰 Tools

🪛 yamllint (1.35.1)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)
tests/NSCLC_Radiogenomics/NSCLC_Radiogenomics.yaml (2)
20-20: Fix YAML formatting for consistency

Similar to the 4D-Lung config, the image types list needs proper spacing.
-image_types: ["original", "shuffled_full","shuffled_roi","shuffled_non_roi","randomized_sampled_full","randomized_sampled_roi","randomized_sampled_non_roi"]
+image_types: ["original", "shuffled_full", "shuffled_roi", "shuffled_non_roi", "randomized_sampled_full", "randomized_sampled_roi", "randomized_sampled_non_roi"]
🧰 Tools

🪛 yamllint (1.35.1)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

12-17: Document empty configuration sections

The exclusion_variables and train_test_split sections are empty. Consider adding comments to explain why these are empty or if they will be populated in the future.
-exclusion_variables: 
+exclusion_variables: # No exclusion criteria defined for this dataset
 
 train_test_split:
     split: False
-    split_variable: 
-    impute: 
+    split_variable: # Not applicable when split is False
+    impute: # Not applicable when split is False
🧰 Tools

🪛 yamllint (1.35.1)

[error] 12-12: trailing spaces

(trailing-spaces)

[error] 16-16: trailing spaces

(trailing-spaces)

[error] 17-17: trailing spaces

(trailing-spaces)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e868f6 and 35d4dbc.

⛔ Files ignored due to path filters (1)

tests/4D-Lung/results/features/radiomicfeatures_original_4D-Lung.csv is excluded by !**/*.csv

📒 Files selected for processing (8)

src/readii/io/loaders/__init__.py (1 hunks)
src/readii/io/loaders/features.py (1 hunks)
src/readii/io/loaders/general.py (1 hunks)
src/readii/io/loaders/images.py (1 hunks)
tests/4D-Lung/4D-Lung.yaml (1 hunks)
tests/NSCLC_Radiogenomics/NSCLC_Radiogenomics.yaml (1 hunks)
tests/io/loaders/test_general.py (1 hunks)
tests/test_feature_extraction.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

src/readii/io/loaders/init.py

🧰 Additional context used

🪛 Ruff (0.8.2)

tests/io/loaders/test_general.py

1-1: from readii.io.loaders.general import * used; unable to detect undefined names

(F403)

13-13: loadImageDatasetConfig may be undefined, or defined from star imports

(F405)

20-20: loadImageDatasetConfig may be undefined, or defined from star imports

(F405)

23-23: Comparison to None should be cond is None

Replace with cond is None

(E711)

24-24: Comparison to None should be cond is None

Replace with cond is None

(E711)

src/readii/io/loaders/general.py

39-39: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

79-79: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

81-81: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

src/readii/io/loaders/features.py

70-70: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

tests/test_feature_extraction.py

117-117: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

127-127: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

🪛 yamllint (1.35.1)

tests/4D-Lung/4D-Lung.yaml

[error] 2-2: trailing spaces

(trailing-spaces)

[error] 7-7: trailing spaces

(trailing-spaces)

[error] 8-8: trailing spaces

(trailing-spaces)

[error] 10-10: trailing spaces

(trailing-spaces)

[error] 12-12: trailing spaces

(trailing-spaces)

[error] 16-16: trailing spaces

(trailing-spaces)

[error] 17-17: trailing spaces

(trailing-spaces)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

tests/NSCLC_Radiogenomics/NSCLC_Radiogenomics.yaml

[error] 2-2: trailing spaces

(trailing-spaces)

[error] 12-12: trailing spaces

(trailing-spaces)

[error] 16-16: trailing spaces

(trailing-spaces)

[error] 17-17: trailing spaces

(trailing-spaces)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

[warning] 20-20: too few spaces after comma

(commas)

🔇 Additional comments (2)

tests/test_feature_extraction.py (1)

49-51: LGTM: New fixture for 4D Lung metadata path

The fixture is well-defined and follows the same pattern as the existing nsclcMetadataPath fixture.

tests/NSCLC_Radiogenomics/NSCLC_Radiogenomics.yaml (1)

6-10: LGTM: Well-defined outcome variables

The outcome variables section is properly configured with clear mappings for survival status.

src/readii/io/loaders/features.py

tests/io/loaders/test_general.py

src/readii/io/loaders/images.py

coderabbitai · 2024-12-12T22:09:51Z

tests/test_feature_extraction.py

+def test_NSCLC_radiomicFeatureExtraction_output(nsclcMetadataPath):
+    """Test output creation from radiomic feature extraction for SEG dataset"""
    actual = radiomicFeatureExtraction(nsclcMetadataPath,
                                       imageDirPath = "tests/",
                                       roiNames = None,
-                                       outputDirPath = "tests/output/")
+                                       outputDirPath = "tests/NSCLC_Radiogenomics/results/")
    expected_path = "tests/output/features/radiomicfeatures_original_NSCLC_Radiogenomics.csv"
-    assert os.path.exists(expected_path)
+    assert os.path.exists(expected_path)
+


🛠️ Refactor suggestion

Fix unused variable warning and consider adding assertions

The test function has been renamed for better clarity, but there are two issues to address:

The actual variable is assigned but never used

The test only verifies file existence without checking the content

Consider this improvement:

def test_NSCLC_radiomicFeatureExtraction_output(nsclcMetadataPath): """Test output creation from radiomic feature extraction for SEG dataset""" - actual = radiomicFeatureExtraction(nsclcMetadataPath, + radiomicFeatureExtraction(nsclcMetadataPath, imageDirPath = "tests/", roiNames = None, outputDirPath = "tests/NSCLC_Radiogenomics/results/") expected_path = "tests/output/features/radiomicfeatures_original_NSCLC_Radiogenomics.csv" - assert os.path.exists(expected_path) + assert os.path.exists(expected_path), "Output file was not created" + # Consider adding assertions to verify the content of the output file + df = pd.read_csv(expected_path) + assert not df.empty, "Output file is empty" + assert "original_shape_MeshVolume" in df.columns, "Expected features are missing"

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.8.2)

117-117: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

tests/test_feature_extraction.py

tests/4D-Lung/4D-Lung.yaml

jjjermiah

solid, have a PR at #87 for with some minor changes for one of the files and a couple changes in the feature file

generally try to use pathlib over os.path to make reading easier

src/readii/io/loaders/features.py

src/readii/io/loaders/images.py

tests/4D-Lung/4D-Lung.yaml

…87) these are mostly just qc stuff from ruff - pathlib over os.path - formatted docstrings (if looking to add these functions on the mkdocs, the numpydoc parser is not really forgiving) - exception handling is a lot easier to debug when you pass exception context back to the stack trace with `raise .... from e`

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (5)

tests/io/loaders/test_general.py (3)

4-10: Use path utilities for test directories

The hardcoded paths in fixtures could make tests fragile. Consider using pathlib or os.path utilities to construct paths relative to the test file location.

Here's a more robust implementation:

+import os
+from pathlib import Path
+
 @pytest.fixture
 def nsclcConfigDirPath():
-    return "tests/NSCLC_Radiogenomics"
+    return str(Path(__file__).parent.parent.parent / "NSCLC_Radiogenomics")
 
 @pytest.fixture
 def lung4DConfigDirPath():
-    return "tests/4D-Lung"
+    return str(Path(__file__).parent.parent.parent / "4D-Lung")

12-17: Improve test readability and coverage

The test verifies the happy path but could be improved in two ways:

Break down the long image_types list for better readability
Add error case testing

Here's how to improve readability:

 def test_NSCLC_loadImageDatasetConfig(nsclcConfigDirPath):
     config = loadImageDatasetConfig("NSCLC_Radiogenomics", nsclcConfigDirPath)
+    expected_image_types = [
+        "original",
+        "shuffled_full",
+        "shuffled_roi",
+        "shuffled_non_roi",
+        "randomized_sampled_full",
+        "randomized_sampled_roi",
+        "randomized_sampled_non_roi"
+    ]
     assert config["dataset_name"] == "NSCLC_Radiogenomics"
-    assert config["image_types"] == ["original", "shuffled_full","shuffled_roi","shuffled_non_roi","randomized_sampled_full","randomized_sampled_roi","randomized_sampled_non_roi"]
+    assert config["image_types"] == expected_image_types
     assert config["outcome_variables"]["event_label"] == "Survival Status"
     assert config["outcome_variables"]["event_value_mapping"] == {'Alive': 0, 'Dead': 1}

Would you like me to help generate additional test cases for error scenarios?

1-24: Reduce code duplication by sharing image types list

The same image_types list is duplicated in both test functions. Consider moving it to a shared fixture or constant.

Here's how to implement this:

 from readii.io.loaders.general import loadImageDatasetConfig
 import pytest
 
+@pytest.fixture
+def expected_image_types():
+    return [
+        "original",
+        "shuffled_full",
+        "shuffled_roi",
+        "shuffled_non_roi",
+        "randomized_sampled_full",
+        "randomized_sampled_roi",
+        "randomized_sampled_non_roi"
+    ]
+
 @pytest.fixture
 def nsclcConfigDirPath():
     return "tests/NSCLC_Radiogenomics"

Then update both test functions to use this fixture:

def test_NSCLC_loadImageDatasetConfig(nsclcConfigDirPath, expected_image_types):
    config = loadImageDatasetConfig("NSCLC_Radiogenomics", nsclcConfigDirPath)
    assert config["dataset_name"] == "NSCLC_Radiogenomics"
    assert config["image_types"] == expected_image_types
    # ... rest of the test

def test_lung4D_loadImageDatasetConfig(lung4DConfigDirPath, expected_image_types):
    config = loadImageDatasetConfig("4D-Lung", lung4DConfigDirPath)
    assert config["dataset_name"] == "4D-Lung"
    assert config["image_types"] == expected_image_types
    # ... rest of the test

🧰 Tools

🪛 Ruff (0.8.2)

23-23: Comparison to None should be cond is None

Replace with cond is None

(E711)

24-24: Comparison to None should be cond is None

Replace with cond is None

(E711)

src/readii/io/loaders/general.py (2)

7-15: Enhance exception classes with more descriptive docstrings and default messages.

While custom exceptions are good practice, they could be more informative.

Consider this enhancement:

 class ConfigError(Exception):
-    """Base class for errors in the config module."""
+    """Exception raised for errors during configuration loading and parsing.
+    
+    This exception is raised when there are issues with loading or parsing
+    configuration files, such as invalid YAML syntax or empty configurations.
+    """
-    pass
+    def __init__(self, message="Configuration error occurred"):
+        super().__init__(message)

 class DataFrameLoadError(Exception):
-    """Custom exception for DataFrame loading errors."""
+    """Exception raised for errors during DataFrame loading operations.
+    
+    This exception is raised when there are issues with loading data into
+    DataFrames, such as empty files or parsing errors.
+    """
-    pass
+    def __init__(self, message="Error occurred while loading DataFrame"):
+        super().__init__(message)

91-96: Enhance error handling with more specific error messages.

The current error handling could provide more detailed information about the nature of the error.

Consider this enhancement:

     except pd.errors.EmptyDataError as e:
-        raise DataFrameLoadError("File is empty") from e
+        raise DataFrameLoadError(f"File {file_path} is empty or contains no data") from e

     except (pd.errors.ParserError, ValueError) as e:
-        raise DataFrameLoadError("Error parsing file") from e
+        raise DataFrameLoadError(f"Error parsing file {file_path}: {str(e)}") from e

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35d4dbc and 89bc370.

⛔ Files ignored due to path filters (1)

tests/output/ct_to_seg_match_list_4D-Lung.csv is excluded by !**/*.csv

📒 Files selected for processing (3)

.gitignore (1 hunks)
src/readii/io/loaders/general.py (1 hunks)
tests/io/loaders/test_general.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

.gitignore

🧰 Additional context used

🪛 Ruff (0.8.2)

tests/io/loaders/test_general.py

23-23: Comparison to None should be cond is None

Replace with cond is None

(E711)

24-24: Comparison to None should be cond is None

Replace with cond is None

(E711)

🔇 Additional comments (1)

src/readii/io/loaders/general.py (1)

1-5: LGTM! Clean and well-organized imports.

The imports follow good practices with standard library imports first, followed by third-party imports. All imports are necessary for the implemented functionality.

tests/io/loaders/test_general.py

src/readii/io/loaders/general.py

coderabbitai · 2024-12-13T16:47:07Z

src/readii/io/loaders/general.py

+    try:
+        if file_extension == '.xlsx':
+            df = pd.read_excel(file_path)
+        elif file_extension == '.csv':
+            df = pd.read_csv(file_path)
+        else:
+            msg = f"Unsupported file format {file_extension}. Please provide a .csv or .xlsx file."
+            raise ValueError(msg)
+


💡 Codebase verification

Based on the gathered information, I can now provide the final response:

Add validation for potential Excel file format exploits

The current implementation only handles basic file extension checks and empty file errors, but lacks specific Excel format validation. Add the following security measures:

Use openpyxl engine explicitly with read_only=True to prevent formula execution

Add file size limits to prevent memory exhaustion attacks

Validate the Excel file structure before processing

Example locations that need attention:

src/readii/io/loaders/general.py:84: Update pd.read_excel() call with security parameters

🔗 Analysis chain

Add validation for Excel file format.

The current implementation doesn't validate Excel file formats, which could lead to security issues.

Let's verify the Excel file handling in the codebase:

Let me gather more context about the Excel file handling implementation.

Let me check for any additional error handling or validation mechanisms in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Search for Excel file handling patterns rg -l "\.xlsx?" . # Check for potential Excel-related security measures ast-grep --pattern 'pd.read_excel($$$)'

Length of output: 199

Script:

#!/bin/bash # Check for any Excel-related validation or security measures rg -A 5 -B 5 "\.xlsx" . # Look for any Excel-related error handling rg -A 5 -B 5 "read_excel" . # Check for any custom Excel validation functions ast-grep --pattern 'def $_($$$) { $$$ xlsx $$$ }'

Length of output: 2634

Script:

#!/bin/bash # Check for custom exceptions and error handling rg -A 10 "class.*Error" . # Look for any validation utilities rg -A 5 "validate|check|verify" . # Check the full context of the file loading function ast-grep --pattern 'def load_$_($$$) { $$$ }'

Length of output: 18408

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)

src/readii/io/loaders/features.py (4)
11-33: Fix docstring inconsistency with function signature.

The docstring indicates that image_types is optional with a default value of ['original'], but the parameter is required in the function signature. Additionally, consider using consistent type references between docstring and type hints (Dict vs dict).

Apply this diff to fix the docstring:
-    image_types : list, optional
-        List of image types to load in. The default is ['original'].
+    image_types : list
+        List of image types to load in.
34-40: Consider making default labels configurable.

The hardcoded default labels (["patient_ID","survival_time_in_years","survival_event_binary"]) are specific to survival analysis. Consider moving these to a configuration file or module-level constant to make them more maintainable and reusable for different types of analyses.

56-59: Include exception details in warning message.

The caught exception variable e is unused. Consider including its details in the warning message for better debugging.

Apply this diff:
-            logger.warning(f"No {image_type} feature csv files found in {extracted_feature_dir}")
+            logger.warning(f"No {image_type} feature csv files found in {extracted_feature_dir}: {str(e)}")
🧰 Tools

🪛 Ruff (0.8.2)

56-56: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

73-76: Include exception details in warning message.

Similarly, include the KeyError details in the warning message.

Apply this diff:
-            logger.warning(f"{feature_file_path} does not have the labels {labels_to_drop} to drop.")
+            logger.warning(f"{feature_file_path} does not have the labels {labels_to_drop} to drop: {str(e)}")
🧰 Tools

🪛 Ruff (0.8.2)

73-73: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 89bc370 and 9b87aae.

⛔ Files ignored due to path filters (1)

pixi.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

src/readii/io/loaders/features.py (1 hunks)
tests/io/loaders/test_general.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/io/loaders/test_general.py

🧰 Additional context used

🪛 Ruff (0.8.2)

src/readii/io/loaders/features.py

56-56: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

73-73: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

🔇 Additional comments (3)

src/readii/io/loaders/features.py (3)

1-10: LGTM! Well-organized imports.

The imports are properly organized and all imported modules are utilized in the code.

41-46: LGTM! Proper directory validation.

The code appropriately validates the directory existence and efficiently retrieves the file list once.

81-85: LGTM! Proper final validation.

The code appropriately validates that at least one feature set was loaded before returning.

src/readii/io/loaders/features.py

…or any dataset

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

tests/test_metadata.py (1)

52-66: Consider enhancing test parameterization and path handling

While the parameterization is a good start, there are several potential improvements:

Add more test cases to fully utilize parameterization
Use temporary directories for output files
Add cleanup after test execution

Here's a suggested improvement:

 @pytest.mark.parametrize(
     "summary_file_path, modality, output_file_path",
     [
-        ("nsclcSummaryFilePath","SEG","tests/NSCLC_Radiogenomics/procdata/ct_to_seg_match_list_NSCLC_Radiogenomics.csv")
+        ("nsclcSummaryFilePath", "SEG", "ct_to_seg_match_list_NSCLC_Radiogenomics.csv"),
+        ("lung4DSummaryFilePath", "RTSTRUCT", "ct_to_seg_match_list_4D-Lung.csv")
     ]
 )
-def test_matchCTtoSegmentation_output(summary_file_path, modality, output_file_path, request):
+def test_matchCTtoSegmentation_output(summary_file_path, modality, output_file_path, request, tmp_path):
     """Test saving output of summary file"""
     summary_file_path = request.getfixturevalue(summary_file_path)
+    output_path = tmp_path / output_file_path
 
     matchCTtoSegmentation(summary_file_path, 
                           segType = modality,
-                          outputFilePath = output_file_path)
+                          outputFilePath = str(output_path))
     
-    assert Path(output_file_path).exists(), \
+    assert output_path.exists(), \
         "Output does not exist, double check output file is named correctly."

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b87aae and f4cc548.

📒 Files selected for processing (3)

.gitignore (1 hunks)
tests/test_feature_extraction.py (2 hunks)
tests/test_metadata.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

.gitignore

🧰 Additional context used

🪛 Ruff (0.8.2)

tests/test_feature_extraction.py

117-117: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

127-127: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

🔇 Additional comments (3)

tests/test_feature_extraction.py (2)

47-51: LGTM: Consistent fixture paths

The metadata path fixtures follow a consistent pattern and structure within the test directory.

115-132: 🛠️ Refactor suggestion

Fix unused variables and enhance test coverage

Both test functions have similar issues that need addressing:

The actual variables are assigned but never used
The tests only verify file existence without validating the content

Consider this improvement for both functions:

 def test_NSCLC_radiomicFeatureExtraction_output(nsclcMetadataPath):
     """Test output creation from radiomic feature extraction for SEG dataset"""
-    actual = radiomicFeatureExtraction(nsclcMetadataPath,
+    radiomicFeatureExtraction(nsclcMetadataPath,
                                    imageDirPath = "tests/",
                                    roiNames = None,
                                    outputDirPath = "tests/NSCLC_Radiogenomics/results/")
     expected_path = "tests/NSCLC_Radiogenomics/results/features/radiomicfeatures_original_NSCLC_Radiogenomics.csv"
-    assert os.path.exists(expected_path)
+    assert os.path.exists(expected_path), "Output file was not created"
+    df = pd.read_csv(expected_path)
+    assert not df.empty, "Output file is empty"
+    assert "original_shape_MeshVolume" in df.columns, "Expected features are missing"

 def test_4DLung_radiomicFeatureExtraction_output(lung4DMetadataPath):
     """Test output creation from radiomic feature extraction for RTSTRUCT dataset"""
-    actual = radiomicFeatureExtraction(lung4DMetadataPath,
+    radiomicFeatureExtraction(lung4DMetadataPath,
                                    imageDirPath = "tests/",
                                    roiNames = "Tumor_c40",
                                    outputDirPath = "tests/4D-Lung/results/")
     expected_path = "tests/4D-Lung/results/features/radiomicfeatures_original_4D-Lung.csv"
-    assert os.path.exists(expected_path)
+    assert os.path.exists(expected_path), "Output file was not created"
+    df = pd.read_csv(expected_path)
+    assert not df.empty, "Output file is empty"
+    assert "original_shape_MeshVolume" in df.columns, "Expected features are missing"
+    assert all(df["roi_name"] == "Tumor_c40"), "Unexpected ROI names in output"

🧰 Tools

🪛 Ruff (0.8.2)

117-117: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

127-127: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

tests/test_metadata.py (1)

2-2: LGTM! Good modernization using pathlib.Path

Using pathlib.Path instead of os.path is a better approach as it provides more robust cross-platform path handling and is the recommended way in modern Python.

tests/test_metadata.py

…emove them after tests for cleanliness

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

tests/test_feature_extraction.py (1)

49-63: Consider reducing code duplication in fixtures

Both fixtures share identical logic for file copying and cleanup. Consider extracting the common functionality into a helper function.

+def _setup_metadata_path(oldpath: str, newpath: str) -> str:
+    oldpath = Path(oldpath)
+    newpath = Path(newpath)
+    newpath.parent.mkdir(parents=True, exist_ok=True)
+    shutil.copy(oldpath, newpath)
+    return newpath

@pytest.fixture
def nsclcMetadataPath():
-    oldpath = Path("tests/output/ct_to_seg_match_list_NSCLC_Radiogenomics.csv")
-    newpath = Path("tests/NSCLC_Radiogenomics/procdata/ct_to_seg_match_list_NSCLC_Radiogenomics.csv")
-    newpath.parent.mkdir(parents=True, exist_ok=True)
-    shutil.copy(oldpath, newpath)
-    yield newpath.as_posix()
-    newpath.unlink()
+    newpath = _setup_metadata_path(
+        "tests/output/ct_to_seg_match_list_NSCLC_Radiogenomics.csv",
+        "tests/NSCLC_Radiogenomics/procdata/ct_to_seg_match_list_NSCLC_Radiogenomics.csv"
+    )
+    yield newpath.as_posix()
+    newpath.unlink()

@pytest.fixture
def lung4DMetadataPath():
-    oldpath = Path("tests/output/ct_to_seg_match_list_4D-Lung.csv")
-    newpath = Path("tests/4D-Lung/procdata/ct_to_seg_match_list_4D-Lung.csv")
-    newpath.parent.mkdir(parents=True, exist_ok=True)
-    shutil.copy(oldpath, newpath)
-    yield newpath.as_posix()
-    newpath.unlink()
+    newpath = _setup_metadata_path(
+        "tests/output/ct_to_seg_match_list_4D-Lung.csv",
+        "tests/4D-Lung/procdata/ct_to_seg_match_list_4D-Lung.csv"
+    )
+    yield newpath.as_posix()
+    newpath.unlink()

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4cc548 and 94111ba.

⛔ Files ignored due to path filters (1)

pixi.lock is excluded by !**/*.lock

📒 Files selected for processing (1)

tests/test_feature_extraction.py (3 hunks)

🧰 Additional context used

🪛 Ruff (0.8.2)

tests/test_feature_extraction.py

129-129: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

139-139: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

🔇 Additional comments (1)

tests/test_feature_extraction.py (1)

127-143: 🛠️ Refactor suggestion

Enhance test coverage and fix unused variables

Both test functions have similar issues that need to be addressed:

The actual variable is assigned but never used
The tests only verify file existence without validating the content

Apply these improvements to both test functions:

 def test_NSCLC_radiomicFeatureExtraction_output(nsclcMetadataPath):
     """Test output creation from radiomic feature extraction for SEG dataset"""
-    actual = radiomicFeatureExtraction(nsclcMetadataPath,
+    radiomicFeatureExtraction(nsclcMetadataPath,
                                    imageDirPath = "tests/",
                                    roiNames = None,
                                    outputDirPath = "tests/NSCLC_Radiogenomics/results/")
     expected_path = "tests/NSCLC_Radiogenomics/results/features/radiomicfeatures_original_NSCLC_Radiogenomics.csv"
-    assert os.path.exists(expected_path)
+    assert os.path.exists(expected_path), "Output file was not created"
+    df = pd.read_csv(expected_path)
+    assert not df.empty, "Output file is empty"
+    assert "original_shape_MeshVolume" in df.columns, "Expected features are missing"

 def test_4DLung_radiomicFeatureExtraction_output(lung4DMetadataPath):
     """Test output creation from radiomic feature extraction for RTSTRUCT dataset"""
-    actual = radiomicFeatureExtraction(lung4DMetadataPath,
+    radiomicFeatureExtraction(lung4DMetadataPath,
                                    imageDirPath = "tests/",
                                    roiNames = "Tumor_c40",
                                    outputDirPath = "tests/4D-Lung/results/")
     expected_path = "tests/4D-Lung/results/features/radiomicfeatures_original_4D-Lung.csv"
-    assert os.path.exists(expected_path)
+    assert os.path.exists(expected_path), "Output file was not created"
+    df = pd.read_csv(expected_path)
+    assert not df.empty, "Output file is empty"
+    assert "original_shape_MeshVolume" in df.columns, "Expected features are missing"
+    assert all(df["roi_name"] == "Tumor_c40"), "Unexpected ROI names in output"

🧰 Tools

🪛 Ruff (0.8.2)

129-129: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

139-139: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

feat: add io/loaders module with features, general, and images specif…

fe6a77b

…ic loading functions

strixy16 marked this pull request as draft December 11, 2024 21:27

strixy16 added 4 commits December 12, 2024 09:58

test: test the loadImageDatasetConfig function

9010f03

feat: add READII config files for the two test datasets

08f8151

tests: add radiomic feature extraction output test for 4D lung so out…

1d3d96a

…put can be used for other tests

feat: test radiomic feature output for 4D Lung

35d4dbc

strixy16 marked this pull request as ready for review December 12, 2024 22:05

strixy16 requested a review from jjjermiah December 12, 2024 22:06

coderabbitai bot reviewed Dec 12, 2024

View reviewed changes

strixy16 added 5 commits December 12, 2024 18:15

test: add ct to seg match file for 4D Lung so tests work

c29d791

refactor: improve import in test_general

186851f

chore: ignore test outputs in new data structure

42f1370

chore: remove output feature file for 4D-lung

b2c61e4

Merge remote-tracking branch 'origin/main' into katys/add-io

4745b40

jjjermiah requested changes Dec 13, 2024

View reviewed changes

coderabbitai bot reviewed Dec 13, 2024

View reviewed changes

bhklab deleted a comment from coderabbitai bot Dec 13, 2024

strixy16 added 3 commits December 13, 2024 11:51

refactor: apply suggest updates for error handling, add logger

8386e71

refactor: made expected image types a fixture

e1aa3d8

build: readii updated to 1.20.0

9b87aae

strixy16 requested a review from jjjermiah December 13, 2024 16:56

coderabbitai bot reviewed Dec 13, 2024

View reviewed changes

src/readii/io/loaders/features.py Show resolved Hide resolved

strixy16 added 2 commits December 13, 2024 13:05

fix: update output paths for ct_to_seg files for new data arrangement

229663b

feat: update test output ignore to have procdata along with results f…

f4cc548

…or any dataset

coderabbitai bot reviewed Dec 13, 2024

View reviewed changes

tests/test_metadata.py Show resolved Hide resolved

jjjermiah approved these changes Dec 13, 2024

View reviewed changes

jjjermiah added 3 commits December 13, 2024 14:02

Merge branch 'main' into katys/add-io

0c7c45a

chore: update lockfile

11def04

feat: update test fixtures to copy metadata files to new paths, and r…

94111ba

…emove them after tests for cleanliness

coderabbitai bot reviewed Dec 13, 2024

View reviewed changes

jjjermiah merged commit 81fbd78 into main Dec 13, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add io/loaders module #83

feat: add io/loaders module #83

strixy16 commented Dec 11, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 11, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

codecov bot commented Dec 11, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot Dec 12, 2024

jjjermiah left a comment

coderabbitai bot left a comment

coderabbitai bot Dec 13, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

feat: add io/loaders module #83

feat: add io/loaders module #83

Conversation

strixy16 commented Dec 11, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 11, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

codecov bot commented Dec 11, 2024 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 12, 2024

Choose a reason for hiding this comment

jjjermiah left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 13, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

strixy16 commented Dec 11, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 11, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Dec 11, 2024 •

edited

Loading