Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add io/loaders module #83

Merged
merged 19 commits into from
Dec 13, 2024
Merged

feat: add io/loaders module #83

merged 19 commits into from
Dec 13, 2024

Conversation

strixy16
Copy link
Collaborator

@strixy16 strixy16 commented Dec 11, 2024

Includes features, general, and images specific loading functions

Summary by CodeRabbit

  • New Features

    • Introduced a new module for loading various data types in the READII pipeline.
    • Added functions for loading imaging feature sets, dataset configurations, and data files into DataFrames.
    • New configuration files for the 4D-Lung and NSCLC_Radiogenomics datasets created.
  • Bug Fixes

    • Enhanced error handling for file loading functions.
  • Tests

    • Added unit tests for configuration loading and feature extraction validation.
    • Updated existing tests for clarity and specificity.
    • Introduced new fixtures to support testing.
  • Chores

    • Updated .gitignore to exclude test output files.

@strixy16 strixy16 marked this pull request as draft December 11, 2024 21:27
Copy link
Contributor

coderabbitai bot commented Dec 11, 2024

Walkthrough

This pull request introduces a new module for loading various data types within the READII pipeline. It includes multiple new functions for loading imaging feature sets, dataset configurations, and data files into pandas DataFrames. Additionally, two new YAML configuration files for specific datasets are added, along with new tests to validate the functionality of the loading functions and configuration handling.

Changes

File Path Change Summary
src/readii/io/loaders/__init__.py New module created; functions added to __all__: loadFeatureFilesFromImageTypes, loadFileToDataFrame, loadImageDatasetConfig, getImageTypesFromDirectory.
src/readii/io/loaders/features.py Method added: loadFeatureFilesFromImageTypes(extracted_feature_dir: str, image_types: list, drop_labels: Optional[bool] = True, labels_to_drop: Optional[list] = None) -> Dict[str, pd.DataFrame].
src/readii/io/loaders/general.py Methods updated: `loadImageDatasetConfig(dataset_name: str, config_dir_path: str
src/readii/io/loaders/images.py Method added: `getImageTypesFromDirectory(raw_data_dir: Union[Path
tests/4D-Lung/4D-Lung.yaml New configuration file created outlining parameters for the 4D-Lung dataset.
tests/NSCLC_Radiogenomics/NSCLC_Radiogenomics.yaml New configuration file created for the NSCLC_Radiogenomics dataset with defined sections for clinical variables, exclusion criteria, and image types.
tests/io/loaders/test_general.py New test file created with unit tests for loadImageDatasetConfig; tests validate loading for "NSCLC_Radiogenomics" and "4D-Lung" datasets.
tests/test_feature_extraction.py New fixture added: lung4DMetadataPath; method renamed: test_radiomicFeatureExtraction_output to test_NSCLC_radiomicFeatureExtraction_output; new test method added: test_4DLung_radiomicFeatureExtraction_output.
.gitignore New entries added to ignore test output files: tests/*/procdata/*, tests/*/results/*.

Possibly related PRs

Suggested reviewers

  • strixy16

Poem

In the meadow where data flows,
New paths for features and datasets grow.
With YAML files and tests in tow,
The READII pipeline's ready to glow!
Hops of joy, let the coding spree,
For every rabbit, it's a data jubilee! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 30.92784% with 67 lines in your changes missing coverage. Please review.

Project coverage is 49.04%. Comparing base (7228770) to head (94111ba).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/readii/io/loaders/features.py 18.18% 27 Missing ⚠️
src/readii/io/loaders/general.py 39.53% 26 Missing ⚠️
src/readii/io/loaders/images.py 17.64% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #83      +/-   ##
==========================================
- Coverage   50.89%   49.04%   -1.86%     
==========================================
  Files          21       25       +4     
  Lines         894      991      +97     
==========================================
+ Hits          455      486      +31     
- Misses        439      505      +66     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@strixy16 strixy16 marked this pull request as ready for review December 12, 2024 22:05
@strixy16 strixy16 requested a review from jjjermiah December 12, 2024 22:06
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Outside diff range and nitpick comments (7)
src/readii/io/loaders/features.py (2)

23-25: Avoid referencing notebooks in docstrings

Referencing data_setup_for_modeling.ipynb in the docstring may not be helpful for users who are not familiar with that notebook. Consider providing a clear description without external references.

Apply this diff to improve the docstring:

 labels_to_drop : list, optional
     List of labels to drop from the dataframes. The default is ["patient_ID", "survival_time_in_years", "survival_event_binary"].
-    based on code in data_setup_for_modeling.ipynb.

70-70: Remove unused variable e in exception handling

The variable e is assigned but not used in the except clause. You can remove as e to clean up the code.

Apply this diff to fix the issue:

-    except KeyError as e:
+    except KeyError:
🧰 Tools
🪛 Ruff (0.8.2)

70-70: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

tests/io/loaders/test_general.py (1)

23-24: Use is for comparison to None

Comparisons to None should use is or is not to check for identity.

Apply this diff to fix the comparisons:

-    assert config["outcome_variables"]["event_label"] == None
-    assert config["outcome_variables"]["event_value_mapping"] == None
+    assert config["outcome_variables"]["event_label"] is None
+    assert config["outcome_variables"]["event_value_mapping"] is None
🧰 Tools
🪛 Ruff (0.8.2)

23-23: Comparison to None should be cond is None

Replace with cond is None

(E711)


24-24: Comparison to None should be cond is None

Replace with cond is None

(E711)

src/readii/io/loaders/general.py (1)

39-39: Use raise ... from ... to preserve exception context

When re-raising exceptions in an except block, use raise ... from e to maintain the original traceback and provide better debugging information.

Apply this diff to improve exception handling:

 except yaml.YAMLError as e:
-    raise ValueError(f"Invalid YAML in config file: {e}")  
+    raise ValueError(f"Invalid YAML in config file: {e}") from e

 # ...

 except pd.errors.EmptyDataError:
-    raise ValueError("File is empty")
+    raise ValueError("File is empty") from None

 except (pd.errors.ParserError, ValueError) as e:
-    raise ValueError(f"Error parsing file: {e}")
+    raise ValueError(f"Error parsing file: {e}") from e

Also applies to: 79-79, 81-81

🧰 Tools
🪛 Ruff (0.8.2)

39-39: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

tests/4D-Lung/4D-Lung.yaml (1)

20-20: Fix YAML formatting for better readability

The image types list needs proper spacing after commas.

-image_types: ["original", "shuffled_full","shuffled_roi","shuffled_non_roi","randomized_sampled_full","randomized_sampled_roi","randomized_sampled_non_roi"]
+image_types: ["original", "shuffled_full", "shuffled_roi", "shuffled_non_roi", "randomized_sampled_full", "randomized_sampled_roi", "randomized_sampled_non_roi"]
🧰 Tools
🪛 yamllint (1.35.1)

[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)

tests/NSCLC_Radiogenomics/NSCLC_Radiogenomics.yaml (2)

20-20: Fix YAML formatting for consistency

Similar to the 4D-Lung config, the image types list needs proper spacing.

-image_types: ["original", "shuffled_full","shuffled_roi","shuffled_non_roi","randomized_sampled_full","randomized_sampled_roi","randomized_sampled_non_roi"]
+image_types: ["original", "shuffled_full", "shuffled_roi", "shuffled_non_roi", "randomized_sampled_full", "randomized_sampled_roi", "randomized_sampled_non_roi"]
🧰 Tools
🪛 yamllint (1.35.1)

[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


12-17: Document empty configuration sections

The exclusion_variables and train_test_split sections are empty. Consider adding comments to explain why these are empty or if they will be populated in the future.

-exclusion_variables: 
+exclusion_variables: # No exclusion criteria defined for this dataset
 
 train_test_split:
     split: False
-    split_variable: 
-    impute: 
+    split_variable: # Not applicable when split is False
+    impute: # Not applicable when split is False
🧰 Tools
🪛 yamllint (1.35.1)

[error] 12-12: trailing spaces

(trailing-spaces)


[error] 16-16: trailing spaces

(trailing-spaces)


[error] 17-17: trailing spaces

(trailing-spaces)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e868f6 and 35d4dbc.

⛔ Files ignored due to path filters (1)
  • tests/4D-Lung/results/features/radiomicfeatures_original_4D-Lung.csv is excluded by !**/*.csv
📒 Files selected for processing (8)
  • src/readii/io/loaders/__init__.py (1 hunks)
  • src/readii/io/loaders/features.py (1 hunks)
  • src/readii/io/loaders/general.py (1 hunks)
  • src/readii/io/loaders/images.py (1 hunks)
  • tests/4D-Lung/4D-Lung.yaml (1 hunks)
  • tests/NSCLC_Radiogenomics/NSCLC_Radiogenomics.yaml (1 hunks)
  • tests/io/loaders/test_general.py (1 hunks)
  • tests/test_feature_extraction.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/readii/io/loaders/init.py
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/io/loaders/test_general.py

1-1: from readii.io.loaders.general import * used; unable to detect undefined names

(F403)


13-13: loadImageDatasetConfig may be undefined, or defined from star imports

(F405)


20-20: loadImageDatasetConfig may be undefined, or defined from star imports

(F405)


23-23: Comparison to None should be cond is None

Replace with cond is None

(E711)


24-24: Comparison to None should be cond is None

Replace with cond is None

(E711)

src/readii/io/loaders/general.py

39-39: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


79-79: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


81-81: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

src/readii/io/loaders/features.py

70-70: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

tests/test_feature_extraction.py

117-117: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)


127-127: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

🪛 yamllint (1.35.1)
tests/4D-Lung/4D-Lung.yaml

[error] 2-2: trailing spaces

(trailing-spaces)


[error] 7-7: trailing spaces

(trailing-spaces)


[error] 8-8: trailing spaces

(trailing-spaces)


[error] 10-10: trailing spaces

(trailing-spaces)


[error] 12-12: trailing spaces

(trailing-spaces)


[error] 16-16: trailing spaces

(trailing-spaces)


[error] 17-17: trailing spaces

(trailing-spaces)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)

tests/NSCLC_Radiogenomics/NSCLC_Radiogenomics.yaml

[error] 2-2: trailing spaces

(trailing-spaces)


[error] 12-12: trailing spaces

(trailing-spaces)


[error] 16-16: trailing spaces

(trailing-spaces)


[error] 17-17: trailing spaces

(trailing-spaces)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)


[warning] 20-20: too few spaces after comma

(commas)

🔇 Additional comments (2)
tests/test_feature_extraction.py (1)

49-51: LGTM: New fixture for 4D Lung metadata path

The fixture is well-defined and follows the same pattern as the existing nsclcMetadataPath fixture.

tests/NSCLC_Radiogenomics/NSCLC_Radiogenomics.yaml (1)

6-10: LGTM: Well-defined outcome variables

The outcome variables section is properly configured with clear mappings for survival status.

src/readii/io/loaders/features.py Outdated Show resolved Hide resolved
tests/io/loaders/test_general.py Outdated Show resolved Hide resolved
src/readii/io/loaders/images.py Show resolved Hide resolved
Comment on lines 112 to 123
def test_NSCLC_radiomicFeatureExtraction_output(nsclcMetadataPath):
"""Test output creation from radiomic feature extraction for SEG dataset"""
actual = radiomicFeatureExtraction(nsclcMetadataPath,
imageDirPath = "tests/",
roiNames = None,
outputDirPath = "tests/output/")
outputDirPath = "tests/NSCLC_Radiogenomics/results/")
expected_path = "tests/output/features/radiomicfeatures_original_NSCLC_Radiogenomics.csv"
assert os.path.exists(expected_path) No newline at end of file
assert os.path.exists(expected_path)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix unused variable warning and consider adding assertions

The test function has been renamed for better clarity, but there are two issues to address:

  1. The actual variable is assigned but never used
  2. The test only verifies file existence without checking the content

Consider this improvement:

 def test_NSCLC_radiomicFeatureExtraction_output(nsclcMetadataPath):
     """Test output creation from radiomic feature extraction for SEG dataset"""
-    actual = radiomicFeatureExtraction(nsclcMetadataPath,
+    radiomicFeatureExtraction(nsclcMetadataPath,
                                    imageDirPath = "tests/",
                                    roiNames = None,
                                    outputDirPath = "tests/NSCLC_Radiogenomics/results/")
     expected_path = "tests/output/features/radiomicfeatures_original_NSCLC_Radiogenomics.csv"
-    assert os.path.exists(expected_path)
+    assert os.path.exists(expected_path), "Output file was not created"
+    # Consider adding assertions to verify the content of the output file
+    df = pd.read_csv(expected_path)
+    assert not df.empty, "Output file is empty"
+    assert "original_shape_MeshVolume" in df.columns, "Expected features are missing"

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.8.2)

117-117: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

tests/test_feature_extraction.py Outdated Show resolved Hide resolved
tests/4D-Lung/4D-Lung.yaml Show resolved Hide resolved
Copy link
Contributor

@jjjermiah jjjermiah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solid, have a PR at #87 for with some minor changes for one of the files and a couple changes in the feature file

generally try to use pathlib over os.path to make reading easier

src/readii/io/loaders/features.py Outdated Show resolved Hide resolved
src/readii/io/loaders/features.py Outdated Show resolved Hide resolved
src/readii/io/loaders/features.py Outdated Show resolved Hide resolved
src/readii/io/loaders/images.py Show resolved Hide resolved
tests/4D-Lung/4D-Lung.yaml Show resolved Hide resolved
tests/4D-Lung/4D-Lung.yaml Show resolved Hide resolved
…87)

these are mostly just qc stuff from ruff 

- pathlib over os.path 
- formatted docstrings (if looking to add these functions on the mkdocs,
the numpydoc parser is not really forgiving)
- exception handling is a lot easier to debug when you pass exception
context back to the stack trace with `raise .... from e`
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (5)
tests/io/loaders/test_general.py (3)

4-10: Use path utilities for test directories

The hardcoded paths in fixtures could make tests fragile. Consider using pathlib or os.path utilities to construct paths relative to the test file location.

Here's a more robust implementation:

+import os
+from pathlib import Path
+
 @pytest.fixture
 def nsclcConfigDirPath():
-    return "tests/NSCLC_Radiogenomics"
+    return str(Path(__file__).parent.parent.parent / "NSCLC_Radiogenomics")
 
 @pytest.fixture
 def lung4DConfigDirPath():
-    return "tests/4D-Lung"
+    return str(Path(__file__).parent.parent.parent / "4D-Lung")

12-17: Improve test readability and coverage

The test verifies the happy path but could be improved in two ways:

  1. Break down the long image_types list for better readability
  2. Add error case testing

Here's how to improve readability:

 def test_NSCLC_loadImageDatasetConfig(nsclcConfigDirPath):
     config = loadImageDatasetConfig("NSCLC_Radiogenomics", nsclcConfigDirPath)
+    expected_image_types = [
+        "original",
+        "shuffled_full",
+        "shuffled_roi",
+        "shuffled_non_roi",
+        "randomized_sampled_full",
+        "randomized_sampled_roi",
+        "randomized_sampled_non_roi"
+    ]
     assert config["dataset_name"] == "NSCLC_Radiogenomics"
-    assert config["image_types"] == ["original", "shuffled_full","shuffled_roi","shuffled_non_roi","randomized_sampled_full","randomized_sampled_roi","randomized_sampled_non_roi"]
+    assert config["image_types"] == expected_image_types
     assert config["outcome_variables"]["event_label"] == "Survival Status"
     assert config["outcome_variables"]["event_value_mapping"] == {'Alive': 0, 'Dead': 1}

Would you like me to help generate additional test cases for error scenarios?


1-24: Reduce code duplication by sharing image types list

The same image_types list is duplicated in both test functions. Consider moving it to a shared fixture or constant.

Here's how to implement this:

 from readii.io.loaders.general import loadImageDatasetConfig
 import pytest
 
+@pytest.fixture
+def expected_image_types():
+    return [
+        "original",
+        "shuffled_full",
+        "shuffled_roi",
+        "shuffled_non_roi",
+        "randomized_sampled_full",
+        "randomized_sampled_roi",
+        "randomized_sampled_non_roi"
+    ]
+
 @pytest.fixture
 def nsclcConfigDirPath():
     return "tests/NSCLC_Radiogenomics"

Then update both test functions to use this fixture:

def test_NSCLC_loadImageDatasetConfig(nsclcConfigDirPath, expected_image_types):
    config = loadImageDatasetConfig("NSCLC_Radiogenomics", nsclcConfigDirPath)
    assert config["dataset_name"] == "NSCLC_Radiogenomics"
    assert config["image_types"] == expected_image_types
    # ... rest of the test

def test_lung4D_loadImageDatasetConfig(lung4DConfigDirPath, expected_image_types):
    config = loadImageDatasetConfig("4D-Lung", lung4DConfigDirPath)
    assert config["dataset_name"] == "4D-Lung"
    assert config["image_types"] == expected_image_types
    # ... rest of the test
🧰 Tools
🪛 Ruff (0.8.2)

23-23: Comparison to None should be cond is None

Replace with cond is None

(E711)


24-24: Comparison to None should be cond is None

Replace with cond is None

(E711)

src/readii/io/loaders/general.py (2)

7-15: Enhance exception classes with more descriptive docstrings and default messages.

While custom exceptions are good practice, they could be more informative.

Consider this enhancement:

 class ConfigError(Exception):
-    """Base class for errors in the config module."""
+    """Exception raised for errors during configuration loading and parsing.
+    
+    This exception is raised when there are issues with loading or parsing
+    configuration files, such as invalid YAML syntax or empty configurations.
+    """
-    pass
+    def __init__(self, message="Configuration error occurred"):
+        super().__init__(message)

 class DataFrameLoadError(Exception):
-    """Custom exception for DataFrame loading errors."""
+    """Exception raised for errors during DataFrame loading operations.
+    
+    This exception is raised when there are issues with loading data into
+    DataFrames, such as empty files or parsing errors.
+    """
-    pass
+    def __init__(self, message="Error occurred while loading DataFrame"):
+        super().__init__(message)

91-96: Enhance error handling with more specific error messages.

The current error handling could provide more detailed information about the nature of the error.

Consider this enhancement:

     except pd.errors.EmptyDataError as e:
-        raise DataFrameLoadError("File is empty") from e
+        raise DataFrameLoadError(f"File {file_path} is empty or contains no data") from e

     except (pd.errors.ParserError, ValueError) as e:
-        raise DataFrameLoadError("Error parsing file") from e
+        raise DataFrameLoadError(f"Error parsing file {file_path}: {str(e)}") from e
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35d4dbc and 89bc370.

⛔ Files ignored due to path filters (1)
  • tests/output/ct_to_seg_match_list_4D-Lung.csv is excluded by !**/*.csv
📒 Files selected for processing (3)
  • .gitignore (1 hunks)
  • src/readii/io/loaders/general.py (1 hunks)
  • tests/io/loaders/test_general.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/io/loaders/test_general.py

23-23: Comparison to None should be cond is None

Replace with cond is None

(E711)


24-24: Comparison to None should be cond is None

Replace with cond is None

(E711)

🔇 Additional comments (1)
src/readii/io/loaders/general.py (1)

1-5: LGTM! Clean and well-organized imports.

The imports follow good practices with standard library imports first, followed by third-party imports. All imports are necessary for the implemented functionality.

tests/io/loaders/test_general.py Outdated Show resolved Hide resolved
src/readii/io/loaders/general.py Show resolved Hide resolved
src/readii/io/loaders/general.py Show resolved Hide resolved
Comment on lines +82 to +90
try:
if file_extension == '.xlsx':
df = pd.read_excel(file_path)
elif file_extension == '.csv':
df = pd.read_csv(file_path)
else:
msg = f"Unsupported file format {file_extension}. Please provide a .csv or .xlsx file."
raise ValueError(msg)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Based on the gathered information, I can now provide the final response:

Add validation for potential Excel file format exploits

The current implementation only handles basic file extension checks and empty file errors, but lacks specific Excel format validation. Add the following security measures:

  • Use openpyxl engine explicitly with read_only=True to prevent formula execution
  • Add file size limits to prevent memory exhaustion attacks
  • Validate the Excel file structure before processing

Example locations that need attention:

  • src/readii/io/loaders/general.py:84: Update pd.read_excel() call with security parameters
🔗 Analysis chain

Add validation for Excel file format.

The current implementation doesn't validate Excel file formats, which could lead to security issues.

Let's verify the Excel file handling in the codebase:


Let me gather more context about the Excel file handling implementation.


Let me check for any additional error handling or validation mechanisms in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for Excel file handling patterns
rg -l "\.xlsx?" .

# Check for potential Excel-related security measures
ast-grep --pattern 'pd.read_excel($$$)' 

Length of output: 199


Script:

#!/bin/bash
# Check for any Excel-related validation or security measures
rg -A 5 -B 5 "\.xlsx" .

# Look for any Excel-related error handling
rg -A 5 -B 5 "read_excel" .

# Check for any custom Excel validation functions
ast-grep --pattern 'def $_($$$) {
  $$$
  xlsx
  $$$
}'

Length of output: 2634


Script:

#!/bin/bash
# Check for custom exceptions and error handling
rg -A 10 "class.*Error" .

# Look for any validation utilities
rg -A 5 "validate|check|verify" .

# Check the full context of the file loading function
ast-grep --pattern 'def load_$_($$$) {
  $$$
}'

Length of output: 18408

@bhklab bhklab deleted a comment from coderabbitai bot Dec 13, 2024
@bhklab bhklab deleted a comment from coderabbitai bot Dec 13, 2024
@strixy16 strixy16 requested a review from jjjermiah December 13, 2024 16:56
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)
src/readii/io/loaders/features.py (4)

11-33: Fix docstring inconsistency with function signature.

The docstring indicates that image_types is optional with a default value of ['original'], but the parameter is required in the function signature. Additionally, consider using consistent type references between docstring and type hints (Dict vs dict).

Apply this diff to fix the docstring:

-    image_types : list, optional
-        List of image types to load in. The default is ['original'].
+    image_types : list
+        List of image types to load in.

34-40: Consider making default labels configurable.

The hardcoded default labels (["patient_ID","survival_time_in_years","survival_event_binary"]) are specific to survival analysis. Consider moving these to a configuration file or module-level constant to make them more maintainable and reusable for different types of analyses.


56-59: Include exception details in warning message.

The caught exception variable e is unused. Consider including its details in the warning message for better debugging.

Apply this diff:

-            logger.warning(f"No {image_type} feature csv files found in {extracted_feature_dir}")
+            logger.warning(f"No {image_type} feature csv files found in {extracted_feature_dir}: {str(e)}")
🧰 Tools
🪛 Ruff (0.8.2)

56-56: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


73-76: Include exception details in warning message.

Similarly, include the KeyError details in the warning message.

Apply this diff:

-            logger.warning(f"{feature_file_path} does not have the labels {labels_to_drop} to drop.")
+            logger.warning(f"{feature_file_path} does not have the labels {labels_to_drop} to drop: {str(e)}")
🧰 Tools
🪛 Ruff (0.8.2)

73-73: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 89bc370 and 9b87aae.

⛔ Files ignored due to path filters (1)
  • pixi.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • src/readii/io/loaders/features.py (1 hunks)
  • tests/io/loaders/test_general.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/io/loaders/test_general.py
🧰 Additional context used
🪛 Ruff (0.8.2)
src/readii/io/loaders/features.py

56-56: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


73-73: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

🔇 Additional comments (3)
src/readii/io/loaders/features.py (3)

1-10: LGTM! Well-organized imports.

The imports are properly organized and all imported modules are utilized in the code.


41-46: LGTM! Proper directory validation.

The code appropriately validates the directory existence and efficiently retrieves the file list once.


81-85: LGTM! Proper final validation.

The code appropriately validates that at least one feature set was loaded before returning.

src/readii/io/loaders/features.py Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
tests/test_metadata.py (1)

52-66: Consider enhancing test parameterization and path handling

While the parameterization is a good start, there are several potential improvements:

  1. Add more test cases to fully utilize parameterization
  2. Use temporary directories for output files
  3. Add cleanup after test execution

Here's a suggested improvement:

 @pytest.mark.parametrize(
     "summary_file_path, modality, output_file_path",
     [
-        ("nsclcSummaryFilePath","SEG","tests/NSCLC_Radiogenomics/procdata/ct_to_seg_match_list_NSCLC_Radiogenomics.csv")
+        ("nsclcSummaryFilePath", "SEG", "ct_to_seg_match_list_NSCLC_Radiogenomics.csv"),
+        ("lung4DSummaryFilePath", "RTSTRUCT", "ct_to_seg_match_list_4D-Lung.csv")
     ]
 )
-def test_matchCTtoSegmentation_output(summary_file_path, modality, output_file_path, request):
+def test_matchCTtoSegmentation_output(summary_file_path, modality, output_file_path, request, tmp_path):
     """Test saving output of summary file"""
     summary_file_path = request.getfixturevalue(summary_file_path)
+    output_path = tmp_path / output_file_path
 
     matchCTtoSegmentation(summary_file_path, 
                           segType = modality,
-                          outputFilePath = output_file_path)
+                          outputFilePath = str(output_path))
     
-    assert Path(output_file_path).exists(), \
+    assert output_path.exists(), \
         "Output does not exist, double check output file is named correctly."
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b87aae and f4cc548.

📒 Files selected for processing (3)
  • .gitignore (1 hunks)
  • tests/test_feature_extraction.py (2 hunks)
  • tests/test_metadata.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .gitignore
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_feature_extraction.py

117-117: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)


127-127: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

🔇 Additional comments (3)
tests/test_feature_extraction.py (2)

47-51: LGTM: Consistent fixture paths

The metadata path fixtures follow a consistent pattern and structure within the test directory.


115-132: 🛠️ Refactor suggestion

Fix unused variables and enhance test coverage

Both test functions have similar issues that need addressing:

  1. The actual variables are assigned but never used
  2. The tests only verify file existence without validating the content

Consider this improvement for both functions:

 def test_NSCLC_radiomicFeatureExtraction_output(nsclcMetadataPath):
     """Test output creation from radiomic feature extraction for SEG dataset"""
-    actual = radiomicFeatureExtraction(nsclcMetadataPath,
+    radiomicFeatureExtraction(nsclcMetadataPath,
                                    imageDirPath = "tests/",
                                    roiNames = None,
                                    outputDirPath = "tests/NSCLC_Radiogenomics/results/")
     expected_path = "tests/NSCLC_Radiogenomics/results/features/radiomicfeatures_original_NSCLC_Radiogenomics.csv"
-    assert os.path.exists(expected_path)
+    assert os.path.exists(expected_path), "Output file was not created"
+    df = pd.read_csv(expected_path)
+    assert not df.empty, "Output file is empty"
+    assert "original_shape_MeshVolume" in df.columns, "Expected features are missing"

 def test_4DLung_radiomicFeatureExtraction_output(lung4DMetadataPath):
     """Test output creation from radiomic feature extraction for RTSTRUCT dataset"""
-    actual = radiomicFeatureExtraction(lung4DMetadataPath,
+    radiomicFeatureExtraction(lung4DMetadataPath,
                                    imageDirPath = "tests/",
                                    roiNames = "Tumor_c40",
                                    outputDirPath = "tests/4D-Lung/results/")
     expected_path = "tests/4D-Lung/results/features/radiomicfeatures_original_4D-Lung.csv"
-    assert os.path.exists(expected_path)
+    assert os.path.exists(expected_path), "Output file was not created"
+    df = pd.read_csv(expected_path)
+    assert not df.empty, "Output file is empty"
+    assert "original_shape_MeshVolume" in df.columns, "Expected features are missing"
+    assert all(df["roi_name"] == "Tumor_c40"), "Unexpected ROI names in output"
🧰 Tools
🪛 Ruff (0.8.2)

117-117: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)


127-127: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

tests/test_metadata.py (1)

2-2: LGTM! Good modernization using pathlib.Path

Using pathlib.Path instead of os.path is a better approach as it provides more robust cross-platform path handling and is the recommended way in modern Python.

tests/test_metadata.py Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
tests/test_feature_extraction.py (1)

49-63: Consider reducing code duplication in fixtures

Both fixtures share identical logic for file copying and cleanup. Consider extracting the common functionality into a helper function.

+def _setup_metadata_path(oldpath: str, newpath: str) -> str:
+    oldpath = Path(oldpath)
+    newpath = Path(newpath)
+    newpath.parent.mkdir(parents=True, exist_ok=True)
+    shutil.copy(oldpath, newpath)
+    return newpath

@pytest.fixture
def nsclcMetadataPath():
-    oldpath = Path("tests/output/ct_to_seg_match_list_NSCLC_Radiogenomics.csv")
-    newpath = Path("tests/NSCLC_Radiogenomics/procdata/ct_to_seg_match_list_NSCLC_Radiogenomics.csv")
-    newpath.parent.mkdir(parents=True, exist_ok=True)
-    shutil.copy(oldpath, newpath)
-    yield newpath.as_posix()
-    newpath.unlink()
+    newpath = _setup_metadata_path(
+        "tests/output/ct_to_seg_match_list_NSCLC_Radiogenomics.csv",
+        "tests/NSCLC_Radiogenomics/procdata/ct_to_seg_match_list_NSCLC_Radiogenomics.csv"
+    )
+    yield newpath.as_posix()
+    newpath.unlink()

@pytest.fixture
def lung4DMetadataPath():
-    oldpath = Path("tests/output/ct_to_seg_match_list_4D-Lung.csv")
-    newpath = Path("tests/4D-Lung/procdata/ct_to_seg_match_list_4D-Lung.csv")
-    newpath.parent.mkdir(parents=True, exist_ok=True)
-    shutil.copy(oldpath, newpath)
-    yield newpath.as_posix()
-    newpath.unlink()
+    newpath = _setup_metadata_path(
+        "tests/output/ct_to_seg_match_list_4D-Lung.csv",
+        "tests/4D-Lung/procdata/ct_to_seg_match_list_4D-Lung.csv"
+    )
+    yield newpath.as_posix()
+    newpath.unlink()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4cc548 and 94111ba.

⛔ Files ignored due to path filters (1)
  • pixi.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • tests/test_feature_extraction.py (3 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_feature_extraction.py

129-129: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)


139-139: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

🔇 Additional comments (1)
tests/test_feature_extraction.py (1)

127-143: 🛠️ Refactor suggestion

Enhance test coverage and fix unused variables

Both test functions have similar issues that need to be addressed:

  1. The actual variable is assigned but never used
  2. The tests only verify file existence without validating the content

Apply these improvements to both test functions:

 def test_NSCLC_radiomicFeatureExtraction_output(nsclcMetadataPath):
     """Test output creation from radiomic feature extraction for SEG dataset"""
-    actual = radiomicFeatureExtraction(nsclcMetadataPath,
+    radiomicFeatureExtraction(nsclcMetadataPath,
                                    imageDirPath = "tests/",
                                    roiNames = None,
                                    outputDirPath = "tests/NSCLC_Radiogenomics/results/")
     expected_path = "tests/NSCLC_Radiogenomics/results/features/radiomicfeatures_original_NSCLC_Radiogenomics.csv"
-    assert os.path.exists(expected_path)
+    assert os.path.exists(expected_path), "Output file was not created"
+    df = pd.read_csv(expected_path)
+    assert not df.empty, "Output file is empty"
+    assert "original_shape_MeshVolume" in df.columns, "Expected features are missing"

 def test_4DLung_radiomicFeatureExtraction_output(lung4DMetadataPath):
     """Test output creation from radiomic feature extraction for RTSTRUCT dataset"""
-    actual = radiomicFeatureExtraction(lung4DMetadataPath,
+    radiomicFeatureExtraction(lung4DMetadataPath,
                                    imageDirPath = "tests/",
                                    roiNames = "Tumor_c40",
                                    outputDirPath = "tests/4D-Lung/results/")
     expected_path = "tests/4D-Lung/results/features/radiomicfeatures_original_4D-Lung.csv"
-    assert os.path.exists(expected_path)
+    assert os.path.exists(expected_path), "Output file was not created"
+    df = pd.read_csv(expected_path)
+    assert not df.empty, "Output file is empty"
+    assert "original_shape_MeshVolume" in df.columns, "Expected features are missing"
+    assert all(df["roi_name"] == "Tumor_c40"), "Unexpected ROI names in output"
🧰 Tools
🪛 Ruff (0.8.2)

129-129: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)


139-139: Local variable actual is assigned to but never used

Remove assignment to unused variable actual

(F841)

@jjjermiah jjjermiah merged commit 81fbd78 into main Dec 13, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants