-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: integrate analyze readii outputs functions #79
Conversation
…ze_readii_outputs
…mpy for seaborn install to work
…o plot them as a heatmap
…nds dataframe based on a column name
…date test to use this
…te-analyze_readii_outputs
…ion and to split data by column values like for train and test
WalkthroughThe changes in this pull request primarily involve the addition of new functionality and dependencies to the project. The Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CorrelationModule
participant DataFrame
participant VisualizationModule
User->>CorrelationModule: getFeatureCorrelations()
CorrelationModule->>DataFrame: Compute correlation matrix
DataFrame-->>CorrelationModule: Return correlation matrix
CorrelationModule->>VisualizationModule: plotCorrelationHeatmap()
VisualizationModule-->>User: Display heatmap
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Outside diff range and nitpick comments (13)
src/readii/io/loaders/features.py (1)
62-62
: Remove unused variablee
in exception handlingThe variable
e
is assigned but never used in theexcept
block. If you don't need the exception object, you can omit it.Apply this diff to address the issue:
- except Exception as e: + except Exception:Alternatively, if you intend to use the exception message, include it in your error handling.
🧰 Tools
🪛 Ruff (0.8.0)
62-62: Local variable
e
is assigned to but never usedRemove assignment to unused variable
e
(F841)
src/readii/data/split.py (1)
29-29
: Simplify dictionary iteration by removing unnecessary.keys()
When iterating over a dictionary, you can iterate directly over the keys without calling
.keys()
. This is more Pythonic and slightly more efficient.Apply this diff to simplify the code:
- for new_value in replacement_value_data.keys(): + for new_value in replacement_value_data:- for split_column_name in split_col_data.keys(): + for split_column_name in split_col_data:Also applies to: 65-65
🧰 Tools
🪛 Ruff (0.8.0)
29-29: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
src/readii/data/select.py (1)
78-78
: Simplify dictionary iteration by removing unnecessary.keys()
When iterating over a dictionary, iterate directly over the keys without using
.keys()
. This improves readability.Apply this diff to enhance the code:
- for key in include_col_values.keys(): + for key in include_col_values:- for key in exclude_col_values.keys(): + for key in exclude_col_values:Also applies to: 85-85
🧰 Tools
🪛 Ruff (0.8.0)
78-78: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
src/readii/analyze/correlation.py (2)
44-45
: Follow PEP 8 style guidelines: One statement per lineHaving multiple statements on one line can reduce code readability. It's recommended to place each statement on its own line.
Apply this diff to adhere to style guidelines:
- if vertical_feature_name: vertical_feature_name = f"_{vertical_feature_name}" - if horizontal_feature_name: horizontal_feature_name = f"_{horizontal_feature_name}" + if vertical_feature_name: + vertical_feature_name = f"_{vertical_feature_name}" + if horizontal_feature_name: + horizontal_feature_name = f"_{horizontal_feature_name}"🧰 Tools
🪛 Ruff (0.8.0)
44-44: Multiple statements on one line (colon)
(E701)
45-45: Multiple statements on one line (colon)
(E701)
58-58
: Preserve exception context when re-raising exceptionsWhen re-raising an exception, use
raise ... from ...
to maintain the original exception context. This helps with debugging by preserving the traceback of the original exception.Apply this diff to improve exception handling:
- raise ValueError(f"Error calculating correlation matrix: {e}") + raise ValueError(f"Error calculating correlation matrix: {e}") from e🧰 Tools
🪛 Ruff (0.8.0)
58-58: Within an
except
clause, raise exceptions withraise ... from err
orraise ... from None
to distinguish them from errors in exception handling(B904)
src/readii/data/label.py (2)
34-34
: Fix typographical error in error messageThe word "recognizeable" is misspelled. It should be "recognizable".
Apply this diff to correct the typo:
- raise ValueError("Dataframe doesn't have a recognizeable patient ID column. Must contain patient or case ID.") + raise ValueError("Dataframe doesn't have a recognizable patient ID column. Must contain patient or case ID.")
241-241
: Avoid shadowing built-in functionsThe variable
map
shadows the built-in functionmap
. Consider renaming it to avoid confusion.Apply this diff to rename the variable:
-if not all(value == map for value, map in zip(existing_event_values, sorted(event_column_value_mapping.keys()))): +if not all(value == mapped_value for value, mapped_value in zip(existing_event_values, sorted(event_column_value_mapping.keys()))):src/readii/analyze/plot_correlations.py (1)
254-254
: Replace print statement with loggingUsing
logging
module to provide informational messages.Apply this diff to use logging:
- print("Correlation matrix is symmetric.") + import logging + logging.info("Correlation matrix is symmetric.")src/readii/io/writers/plots.py (1)
20-20
: Update docstring to match parameter typeThe
output_dir_path
parameter accepts bothPath
andstr
types. Update the docstring for accuracy.Apply this diff to update the docstring:
- output_dir_path : str + output_dir_path : Union[Path, str]tests/test_correlation.py (2)
3-3
: Remove unused importplotCorrelationHeatmap
The function
plotCorrelationHeatmap
is imported but not used in this file.Apply this diff to remove the unused import:
-from readii.analyze.correlation import ( - getFeatureCorrelations, - plotCorrelationHeatmap, -) +from readii.analyze.correlation import getFeatureCorrelations🧰 Tools
🪛 Ruff (0.8.0)
3-3:
readii.analyze.correlation.plotCorrelationHeatmap
imported but unusedRemove unused import:
readii.analyze.correlation.plotCorrelationHeatmap
(F401)
6-6
: Remove unused importdropUpToFeature
The function
dropUpToFeature
is imported but not used.Apply this diff to remove the unused import:
-from readii.data.process import dropUpToFeature
🧰 Tools
🪛 Ruff (0.8.0)
6-6:
readii.data.process.dropUpToFeature
imported but unusedRemove unused import:
readii.data.process.dropUpToFeature
(F401)
src/readii/io/loaders/images.py (2)
42-43
: Fix typo in commentThere's a typo in the comment: "banes" should be "names"
- # Get list of file banes with the specified prefix and suffix in the directory + # Get list of file names with the specified prefix and suffix in the directory
40-52
: Consider sorting image types for consistent outputThe function returns image types in arbitrary order based on filesystem traversal.
Consider sorting the list before returning:
# Initialize an empty list to store the image types image_types = [] for file in raw_data_dir.glob(f"{feature_file_prefix}*{feature_file_suffix}"): file_name = file.name image_type = file_name.removeprefix(feature_file_prefix).removesuffix(feature_file_suffix) image_types.append(image_type) - return image_types + return sorted(image_types)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
pixi.lock
is excluded by!**/*.lock
📒 Files selected for processing (13)
pyproject.toml
(2 hunks)src/readii/__init__.py
(0 hunks)src/readii/analyze/correlation.py
(1 hunks)src/readii/analyze/plot_correlations.py
(1 hunks)src/readii/data/label.py
(1 hunks)src/readii/data/select.py
(1 hunks)src/readii/data/split.py
(1 hunks)src/readii/io/loaders/features.py
(1 hunks)src/readii/io/loaders/general.py
(1 hunks)src/readii/io/loaders/images.py
(1 hunks)src/readii/io/writers/plots.py
(1 hunks)src/readii/loaders.py
(1 hunks)tests/test_correlation.py
(1 hunks)
💤 Files with no reviewable changes (1)
- src/readii/init.py
🧰 Additional context used
🪛 Ruff (0.8.0)
src/readii/io/loaders/general.py
32-32: Use a context manager for opening files
(SIM115)
src/readii/io/loaders/features.py
10-10: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
12-12: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
62-62: Local variable e
is assigned to but never used
Remove assignment to unused variable e
(F841)
tests/test_correlation.py
3-3: readii.analyze.correlation.plotCorrelationHeatmap
imported but unused
Remove unused import: readii.analyze.correlation.plotCorrelationHeatmap
(F401)
6-6: readii.data.process.dropUpToFeature
imported but unused
Remove unused import: readii.data.process.dropUpToFeature
(F401)
src/readii/data/select.py
78-78: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
85-85: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
src/readii/data/split.py
29-29: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
65-65: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
src/readii/analyze/correlation.py
44-44: Multiple statements on one line (colon)
(E701)
45-45: Multiple statements on one line (colon)
(E701)
58-58: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
src/readii/data/label.py
188-188: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
261-261: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
🔇 Additional comments (3)
src/readii/io/writers/plots.py (1)
9-9
:
Correct type annotation syntax
When using Union
, types should be separated by commas, not the |
operator.
Apply this diff to fix the type annotation:
-def saveSeabornPlotFigure(sns_plot: heatmap,
+def saveSeabornPlotFigure(sns_plot: heatmap,
plot_name: str,
- output_dir_path: Union[Path|str]):
+ output_dir_path: Union[Path, str]):
Likely invalid or redundant comment.
src/readii/loaders.py (1)
77-79
: Verify impact of continuous=False parameter
The change ensures integer indices, but we should verify this doesn't affect existing code that might expect float indices.
Let's check for any dependencies on the index type:
✅ Verification successful
Setting continuous=False appears safe to use
The change to use integer indices is safe because:
- The output segMasks is only used locally within the function
- No downstream code depends on float indices
- The parameter is well documented
- The existing_roi_indices parameter already uses integer values (0 for background)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for code that might depend on the index type from StructureSetToSegmentation
rg -l "StructureSetToSegmentation" | xargs rg -A 5 "\.index|indices"
# Search for test files that might need updating
fd -e py test_ | xargs rg -A 5 "StructureSetToSegmentation"
Length of output: 446
Script:
#!/bin/bash
# Let's search for any usage of the output from StructureSetToSegmentation
rg -A 5 "makeMask\("
# Also search for any test cases involving StructureSetToSegmentation
ast-grep --pattern 'class $_ {
$$$
test_$_StructureSetToSegmentation$_($$$) {
$$$
}
$$$
}'
Length of output: 393
Script:
#!/bin/bash
# Let's check how segMasks is used after being created
rg -B 2 -A 5 "segMasks = makeMask"
# Also check for any direct usage of continuous parameter
rg -B 2 -A 2 "continuous\s*=\s*(True|False)"
Length of output: 713
pyproject.toml (1)
14-16
: 🛠️ Refactor suggestion
Consider using version range for numpy instead of strict pinning
The strict version pinning of numpy (==1.26.4
) differs from the project's convention of using version ranges. Since numpy is a core dependency that other packages might depend on, strict pinning could lead to dependency conflicts.
Consider using a version range similar to other dependencies:
- "numpy==1.26.4",
+ "numpy>=1.26.4,<2"
Let's verify the compatibility of these dependencies:
✅ Verification successful
Consider using version range for numpy instead of strict pinning
Based on the codebase analysis:
- numpy is widely used across multiple files (9 files import numpy)
- No version-specific numpy functionality is detected
- seaborn is used in correlation analysis and plotting modules
- The codebase follows a consistent pattern of using numpy's standard features through the
np
alias
Since there's no evidence of version-specific numpy requirements and no detected incompatibilities, using a version range would be more maintainable and consistent with the project's dependency style:
- "numpy==1.26.4",
+ "numpy>=1.26.4,<2"
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check compatibility of numpy with other dependencies
# Check if any other packages in the project have numpy as a dependency
pip index versions seaborn | grep -A 5 "Required-by:"
pip index versions pyradiomics-bhklab | grep -A 5 "Required-by:"
# Check for any known conflicts
pip check numpy==1.26.4 seaborn==0.13.2 pyradiomics-bhklab==3.1.4
Length of output: 468
Script:
#!/bin/bash
# Let's try a different approach by examining the codebase for numpy usage patterns
# Find files that import numpy and check their version constraints
rg "import numpy|from numpy" -l | while read -r file; do
echo "=== $file ==="
rg "import numpy|from numpy" "$file"
done
# Look for any requirements.txt or similar files that might have dependency info
fd "requirements.*\.txt|setup\.py" --type f
# Check if there are any CI/testing configuration files that might specify dependencies
fd "tox.ini|pytest.ini|\.github/workflows/.*\.ya?ml" --type f
# Look for any dependency management related comments in Python files
rg "# *(?i)(requires?|depends?|numpy)" "\.py$"
Length of output: 1985
Script:
#!/bin/bash
# Let's check the docs/requirements.txt file and look for any numpy version constraints
cat docs/requirements.txt
# Also check if seaborn is used in the codebase
rg "import seaborn|from seaborn" -l | while read -r file; do
echo "=== $file ==="
rg "import seaborn|from seaborn" "$file"
done
# Look for any version-specific numpy functionality
rg "numpy\.[0-9]|np\.[0-9]" -l
Length of output: 764
src/readii/io/loaders/features.py
Outdated
|
||
|
||
def loadFeatureFilesFromImageTypes(extracted_feature_dir:str, | ||
image_types:Optional[list]=['original'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid mutable default arguments in function parameters
Using mutable default arguments like lists can lead to unexpected behavior because the default list is shared across all function calls. It's better to use None
as the default value and initialize the list inside the function.
Apply this diff to fix the issue:
-def loadFeatureFilesFromImageTypes(extracted_feature_dir:str,
- image_types:Optional[list]=['original'],
- drop_labels:Optional[bool]=True,
- labels_to_drop:Optional[list]=["patient_ID","survival_time_in_years","survival_event_binary"])->Dict[str,pd.DataFrame]:
+def loadFeatureFilesFromImageTypes(extracted_feature_dir:str,
+ image_types:Optional[list]=None,
+ drop_labels:Optional[bool]=True,
+ labels_to_drop:Optional[list]=None)->Dict[str,pd.DataFrame]:
Within the function, initialize the lists if they are None
:
if image_types is None:
image_types = ['original']
if labels_to_drop is None:
labels_to_drop = ["patient_ID", "survival_time_in_years", "survival_event_binary"]
Also applies to: 12-12
🧰 Tools
🪛 Ruff (0.8.0)
10-10: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
src/readii/data/split.py
Outdated
if new_value not in dataframe[column_to_change].unique(): | ||
raise ValueError(f"New value {new_value} not found in column {column_to_change}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential logical error in value replacement check
The function checks if the new_value
exists in the column before replacement. However, you should check for the presence of the old values to be replaced, not the new value.
Modify the check to verify the presence of the old values:
- if new_value not in dataframe[column_to_change].unique():
- raise ValueError(f"New value {new_value} not found in column {column_to_change}.")
+ old_values = replacement_value_data[new_value]
+ if not set(old_values).intersection(set(dataframe[column_to_change].unique())):
+ raise ValueError(f"Values {old_values} to replace not found in column {column_to_change}.")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if new_value not in dataframe[column_to_change].unique(): | |
raise ValueError(f"New value {new_value} not found in column {column_to_change}.") | |
old_values = replacement_value_data[new_value] | |
if not set(old_values).intersection(set(dataframe[column_to_change].unique())): | |
raise ValueError(f"Values {old_values} to replace not found in column {column_to_change}.") |
src/readii/io/loaders/general.py
Outdated
def loadImageDatasetConfig(dataset_name:str, | ||
config_dir_path:str) -> dict: | ||
"""Load the configuration file for a given dataset. Expects the configuration file to be named <dataset_name>.yaml. | ||
|
||
Parameters | ||
---------- | ||
dataset_name : str | ||
Name of the dataset to load the configuration file for. | ||
config_dir_path : str | ||
Path to the directory containing the configuration files. | ||
|
||
Returns | ||
------- | ||
dict | ||
Dictionary containing the configuration settings for the dataset. | ||
|
||
Examples | ||
-------- | ||
>>> config = loadImageDatasetConfig("NSCLC_Radiogenomics", "config/") | ||
""" | ||
# Make full path to config file | ||
config_file_path = os.path.join(config_dir_path, f"{dataset_name}.yaml") | ||
|
||
# Check if config file exists | ||
if os.path.exists(config_file_path): | ||
# Load the config file | ||
config = yaml.safe_load(open(config_file_path, "r")) | ||
return config | ||
else: | ||
print(f"Config file {config_file_path} does not exist.") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use context manager for file operations and improve error handling
The function has potential resource leaks and could be more robust in error handling.
Consider this implementation:
def loadImageDatasetConfig(dataset_name:str,
config_dir_path:str) -> dict:
config_file_path = os.path.join(config_dir_path, f"{dataset_name}.yaml")
if not os.path.exists(config_file_path):
raise FileNotFoundError(f"Config file {config_file_path} does not exist.")
- config = yaml.safe_load(open(config_file_path, "r"))
- return config
- else:
- print(f"Config file {config_file_path} does not exist.")
- return None
+ try:
+ with open(config_file_path, "r") as f:
+ return yaml.safe_load(f)
+ except yaml.YAMLError as e:
+ raise ValueError(f"Invalid YAML in config file: {e}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def loadImageDatasetConfig(dataset_name:str, | |
config_dir_path:str) -> dict: | |
"""Load the configuration file for a given dataset. Expects the configuration file to be named <dataset_name>.yaml. | |
Parameters | |
---------- | |
dataset_name : str | |
Name of the dataset to load the configuration file for. | |
config_dir_path : str | |
Path to the directory containing the configuration files. | |
Returns | |
------- | |
dict | |
Dictionary containing the configuration settings for the dataset. | |
Examples | |
-------- | |
>>> config = loadImageDatasetConfig("NSCLC_Radiogenomics", "config/") | |
""" | |
# Make full path to config file | |
config_file_path = os.path.join(config_dir_path, f"{dataset_name}.yaml") | |
# Check if config file exists | |
if os.path.exists(config_file_path): | |
# Load the config file | |
config = yaml.safe_load(open(config_file_path, "r")) | |
return config | |
else: | |
print(f"Config file {config_file_path} does not exist.") | |
return None | |
def loadImageDatasetConfig(dataset_name:str, | |
config_dir_path:str) -> dict: | |
"""Load the configuration file for a given dataset. Expects the configuration file to be named <dataset_name>.yaml. | |
Parameters | |
---------- | |
dataset_name : str | |
Name of the dataset to load the configuration file for. | |
config_dir_path : str | |
Path to the directory containing the configuration files. | |
Returns | |
------- | |
dict | |
Dictionary containing the configuration settings for the dataset. | |
Examples | |
-------- | |
>>> config = loadImageDatasetConfig("NSCLC_Radiogenomics", "config/") | |
""" | |
# Make full path to config file | |
config_file_path = os.path.join(config_dir_path, f"{dataset_name}.yaml") | |
# Check if config file exists | |
if not os.path.exists(config_file_path): | |
raise FileNotFoundError(f"Config file {config_file_path} does not exist.") | |
try: | |
with open(config_file_path, "r") as f: | |
return yaml.safe_load(f) | |
except yaml.YAMLError as e: | |
raise ValueError(f"Invalid YAML in config file: {e}") |
🧰 Tools
🪛 Ruff (0.8.0)
32-32: Use a context manager for opening files
(SIM115)
src/readii/io/loaders/general.py
Outdated
def loadFileToDataFrame(file_path:str) -> pd.DataFrame: | ||
"""Load data from a csv or xlsx file into a pandas dataframe. | ||
|
||
Parameters | ||
---------- | ||
file_path (str): Path to the data file. | ||
|
||
Returns | ||
------- | ||
pd.DataFrame: Dataframe containing the data from the file. | ||
""" | ||
# Get the file extension | ||
_, file_extension = os.path.splitext(file_path) | ||
|
||
try: | ||
# Check if the file is an Excel file | ||
if file_extension == '.xlsx': | ||
df = pd.read_excel(file_path) | ||
# Check if the file is a CSV file | ||
elif file_extension == '.csv': | ||
df = pd.read_csv(file_path) | ||
else: | ||
raise ValueError("Unsupported file format. Please provide a .csv or .xlsx file.") | ||
|
||
return df | ||
|
||
except Exception as e: | ||
print(f"An error occurred: {e}") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling and input validation
The function could be more robust with input validation and specific error handling.
Consider this implementation:
def loadFileToDataFrame(file_path:str) -> pd.DataFrame:
+ if not file_path:
+ raise ValueError("file_path cannot be empty")
+
+ if not os.path.exists(file_path):
+ raise FileNotFoundError(f"File {file_path} does not exist")
+
_, file_extension = os.path.splitext(file_path)
try:
if file_extension == '.xlsx':
df = pd.read_excel(file_path)
elif file_extension == '.csv':
df = pd.read_csv(file_path)
else:
raise ValueError("Unsupported file format. Please provide a .csv or .xlsx file.")
+ if df.empty:
+ raise ValueError("Loaded DataFrame is empty")
+
return df
- except Exception as e:
- print(f"An error occurred: {e}")
- return None
+ except pd.errors.EmptyDataError:
+ raise ValueError("File is empty")
+ except (pd.errors.ParserError, ValueError) as e:
+ raise ValueError(f"Error parsing file: {e}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def loadFileToDataFrame(file_path:str) -> pd.DataFrame: | |
"""Load data from a csv or xlsx file into a pandas dataframe. | |
Parameters | |
---------- | |
file_path (str): Path to the data file. | |
Returns | |
------- | |
pd.DataFrame: Dataframe containing the data from the file. | |
""" | |
# Get the file extension | |
_, file_extension = os.path.splitext(file_path) | |
try: | |
# Check if the file is an Excel file | |
if file_extension == '.xlsx': | |
df = pd.read_excel(file_path) | |
# Check if the file is a CSV file | |
elif file_extension == '.csv': | |
df = pd.read_csv(file_path) | |
else: | |
raise ValueError("Unsupported file format. Please provide a .csv or .xlsx file.") | |
return df | |
except Exception as e: | |
print(f"An error occurred: {e}") | |
return None | |
def loadFileToDataFrame(file_path:str) -> pd.DataFrame: | |
"""Load data from a csv or xlsx file into a pandas dataframe. | |
Parameters | |
---------- | |
file_path (str): Path to the data file. | |
Returns | |
------- | |
pd.DataFrame: Dataframe containing the data from the file. | |
""" | |
if not file_path: | |
raise ValueError("file_path cannot be empty") | |
if not os.path.exists(file_path): | |
raise FileNotFoundError(f"File {file_path} does not exist") | |
# Get the file extension | |
_, file_extension = os.path.splitext(file_path) | |
try: | |
# Check if the file is an Excel file | |
if file_extension == '.xlsx': | |
df = pd.read_excel(file_path) | |
# Check if the file is a CSV file | |
elif file_extension == '.csv': | |
df = pd.read_csv(file_path) | |
else: | |
raise ValueError("Unsupported file format. Please provide a .csv or .xlsx file.") | |
if df.empty: | |
raise ValueError("Loaded DataFrame is empty") | |
return df | |
except pd.errors.EmptyDataError: | |
raise ValueError("File is empty") | |
except (pd.errors.ParserError, ValueError) as e: | |
raise ValueError(f"Error parsing file: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…matrix is passed into getFeatureCorrelations
…rscore prefix will be added if there isn't one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
src/readii/analyze/correlation.py (4)
44-45
: Refactor single-line conditional statements for better readabilityMultiple statements on one line reduce code readability and can make debugging more difficult.
Apply this diff to improve readability:
-if not vertical_feature_name.startswith("_"): vertical_feature_name = f"_{vertical_feature_name}" +if not vertical_feature_name.startswith("_"): + vertical_feature_name = f"_{vertical_feature_name}" -if not horizontal_feature_name.startswith("_"): horizontal_feature_name = f"_{horizontal_feature_name}" +if not horizontal_feature_name.startswith("_"): + horizontal_feature_name = f"_{horizontal_feature_name}"🧰 Tools
🪛 Ruff (0.8.0)
44-44: Multiple statements on one line (colon)
(E701)
45-45: Multiple statements on one line (colon)
(E701)
58-58
: Useraise ... from e
to preserve exception contextWhen re-raising an exception within an
except
block, usingraise ... from e
helps maintain the original traceback, making debugging easier.Apply this diff to modify the exception handling:
-except Exception as e: - raise ValueError(f"Error calculating correlation matrix: {e}") +except Exception as e: + raise ValueError(f"Error calculating correlation matrix: {e}") from e🧰 Tools
🪛 Ruff (0.8.0)
58-58: Within an
except
clause, raise exceptions withraise ... from err
orraise ... from None
to distinguish them from errors in exception handling(B904)
83-83
: Fix typo in docstringThere is a typo in the docstring for the
triangle
parameter.Apply this diff to correct the typo:
- Which triangle half of the matrixto plot. The default is "lower". + Which triangle half of the matrix to plot. The default is "lower".
162-170
: Refactor duplicated validation logic in correlation functionsThe functions
getVerticalSelfCorrelations
,getHorizontalSelfCorrelations
, andgetCrossCorrelationMatrix
have similar validation code for checking feature counts against the correlation matrix dimensions.Consider creating a shared helper function to handle this validation logic to reduce code duplication and improve maintainability.
Also applies to: 190-201, 221-226
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
src/readii/analyze/correlation.py
(1 hunks)src/readii/loaders.py
(1 hunks)tests/test_correlation.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/readii/loaders.py
🧰 Additional context used
🪛 Ruff (0.8.0)
src/readii/analyze/correlation.py
44-44: Multiple statements on one line (colon)
(E701)
45-45: Multiple statements on one line (colon)
(E701)
58-58: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (1)
tests/test_correlation.py (1)
1-115
: Comprehensive and well-structured test suite
The test cases cover various scenarios, including valid inputs, default behaviors, invalid methods, incorrect feature inputs, and feature name handling. The tests are well-organized and effectively validate the functionality of the getFeatureCorrelations
function.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #79 +/- ##
===========================================
- Coverage 60.92% 38.97% -21.96%
===========================================
Files 10 19 +9
Lines 540 898 +358
===========================================
+ Hits 329 350 +21
- Misses 211 548 +337 ☔ View full report in Codecov by Sentry. |
…in the passed column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/readii/data/label.py
(1 hunks)
🔇 Additional comments (2)
src/readii/data/label.py (2)
203-203
: Ensure consistency between docstring and default parameter value
The docstring mentions that the default for event_column_value_mapping
is an empty dictionary, but in the function signature, it is set to None
. Update the docstring to reflect the actual default value.
Apply this diff to fix the docstring:
- The default is an empty dictionary.
+ The default is `None`.
272-272
: Ensure consistency between docstring and default parameter value
The docstring states that outcome_labels
defaults to ["survival_time_in_years", "survival_event_binary"]
, but in the function signature, it is set to None
. Update the docstring to reflect the current default value.
Apply this diff to fix the docstring:
- The default is ["survival_time_in_years", "survival_event_binary"].
+ The default is `None`.
# TODO: add handling for values not in the dictionary | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement handling for unmapped event values
There is a TODO
comment indicating that handling for event values not in the mapping dictionary is incomplete. This could lead to errors if unexpected values are encountered. Please implement the necessary handling to ensure all event values are properly managed.
Would you like assistance in implementing this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I would like assistance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@strixy16, glad to assist! Here's an updated version of the eventOutcomeColumnSetup
function with handling for unmapped event values:
def eventOutcomeColumnSetup(dataframe_with_outcome: DataFrame,
outcome_column_label: str,
standard_column_label: str,
event_column_value_mapping: Optional[dict] = None):
"""Function to set up an event outcome column in a dataframe.
Parameters
----------
dataframe_with_outcome : DataFrame
Dataframe containing the outcome column to convert.
outcome_column_label : str
Label for the outcome column to convert in the dataframe.
standard_column_label : str
Name of the column to save the standardized outcome column as.
event_column_value_mapping : dict, optional
Dictionary of event values to convert to numeric values. Keys are the event values, values are the numeric values.
If provided, any unmapped event values will be assigned new numeric values.
If not provided, will attempt to convert based on the values in the outcome column. By default, "alive" and "dead"
will be converted to 0 and 1, respectively, if present in the outcome column.
Returns
-------
dataframe_with_standardized_outcome : DataFrame
Dataframe with a copy of the specified outcome column converted to numeric values.
"""
# Get the type of the existing event column
event_variable_type = type(dataframe_with_outcome[outcome_column_label].iloc[0])
# Make a copy of the dataframe to work on
dataframe_with_standardized_outcome = dataframe_with_outcome.copy()
# Handle numeric event column
if np.issubdtype(event_variable_type, np.number):
dataframe_with_standardized_outcome[standard_column_label] = dataframe_with_outcome[outcome_column_label]
# Handle boolean event column
elif np.issubdtype(event_variable_type, np.bool_):
dataframe_with_standardized_outcome[standard_column_label] = dataframe_with_outcome[outcome_column_label].astype(int)
# Handle string event column
elif np.issubdtype(event_variable_type, np.str_):
# Make values of outcome column lowercase
dataframe_with_standardized_outcome[outcome_column_label] = dataframe_with_outcome[outcome_column_label].str.lower()
# Get the existing event values in the provided dataframe
existing_event_values = sorted(dataframe_with_standardized_outcome[outcome_column_label].unique())
if not event_column_value_mapping:
# Create a dictionary to map event values to numeric values
event_column_value_mapping = survivalStatusToNumericMapping(
event_outcome_column=dataframe_with_standardized_outcome[outcome_column_label]
)
else:
# Convert all dictionary keys in provided mapping to lowercase
event_column_value_mapping = dict((status.lower(), value)
for status, value in event_column_value_mapping.items())
# Identify any values not in the mapping
unmapped_values = [value for value in existing_event_values
if value not in event_column_value_mapping]
if unmapped_values:
print(f"Unmapped event values found: {unmapped_values}. They will be assigned new numeric values.")
# Assign next available numeric value to unmapped event values
max_mapped_value = max(event_column_value_mapping.values(), default=-1)
for i, value in enumerate(unmapped_values, start=max_mapped_value + 1):
event_column_value_mapping[value] = i
# Map the event column using the updated mapping
dataframe_with_standardized_outcome[standard_column_label] = \
dataframe_with_standardized_outcome[outcome_column_label].map(event_column_value_mapping)
else:
raise ValueError(f"Unsupported data type for outcome column: {event_variable_type}")
return dataframe_with_standardized_outcome
Explanation:
- Handling Unmapped Values: The function now checks for any event values in
existing_event_values
that are not present inevent_column_value_mapping
. - Assigning New Numeric Values: Unmapped event values are assigned new numeric codes starting from the next available integer after the maximum value in the existing mapping.
- Mapping the Outcome Column: The event outcome column is then mapped using the updated
event_column_value_mapping
, ensuring all event values are properly managed. - User Feedback: A message is printed to inform the user about any unmapped event values found and their assignment.
This implementation ensures that any unexpected or unmapped event values are handled gracefully, preventing errors during execution.
Let me know if you have any questions or need further assistance!
…s to be a required argument
… to remove mutable default argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (5)
src/readii/data/split.py (3)
29-39
: Optimize dictionary key iteration.The current implementation uses
.keys()
which is unnecessary when iterating over a dictionary.Apply this optimization:
- for new_value in replacement_value_data.keys(): + for new_value in replacement_value_data:🧰 Tools
🪛 Ruff (0.8.0)
29-29: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
66-70
: Optimize dictionary key iteration.Similar to the previous function,
.keys()
is unnecessary when iterating over a dictionary.Apply this optimization:
- for split_column_name in split_col_data.keys(): + for split_column_name in split_col_data:🧰 Tools
🪛 Ruff (0.8.0)
66-66: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
96-96
: Consider avoiding inplace operations for better safety.Using
inplace=True
withreset_index
can be problematic in concurrent operations and makes the code less functional.Apply this change for better safety:
- split_dataframe.reset_index(inplace=True, drop=True) + split_dataframe = split_dataframe.reset_index(drop=True)src/readii/io/loaders/features.py (2)
13-31
: Update docstring to match current implementation.The docstring contains inconsistencies with the actual implementation:
- It mentions a default value of
['original']
forimage_types
, but this parameter is now required- It references specific default values for
labels_to_drop
in the description, which could become outdatedConsider updating the docstring:
Parameters ---------- extracted_feature_dir : str Path to the directory containing the extracted feature csv files - image_types : list, optional - List of image types to load in. The default is ['original']. + image_types : list + List of image types to load in drop_labels : bool, optional Whether to drop the labels from the dataframes. Use when loading labelled data from data_setup_for_modeling.ipynb. The default is True. labels_to_drop : list, optional - List of labels to drop from the dataframes. The default is ["patient_ID","survival_time_in_years","survival_event_binary"] based on code - in data_setup_for_modeling.ipynb. + List of labels to drop from the dataframes. If None, defaults to common patient identifiers and survival metrics.
42-53
: Consider providing more context in error messages.The error handling could be more informative by including the specific error details and providing guidance on how to resolve the issue.
for image_type in image_types: try: image_type_feature_file = [file for file in feature_file_list if (image_type in file) and (file.endswith(".csv"))][0] feature_file_list.remove(image_type_feature_file) - except Exception as e: - print(f"{e}\n No {image_type} feature csv files found in {extracted_feature_dir}") + except IndexError: + print(f"No CSV files found for image type '{image_type}' in {extracted_feature_dir}.") + print(f"Expected filename pattern: *{image_type}*.csv") # Skip to the next image type continue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
src/readii/data/split.py
(1 hunks)src/readii/io/loaders/features.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
src/readii/data/split.py
29-29: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
66-66: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
src/readii/io/loaders/features.py
66-66: Local variable e
is assigned to but never used
Remove assignment to unused variable e
(F841)
🔇 Additional comments (4)
src/readii/data/split.py (3)
1-6
: LGTM! Clean imports and well-typed function signatures.
The code follows good practices with explicit type hints and minimal imports.
7-23
: LGTM! Well-documented function with clear parameters and return types.
The docstring provides comprehensive information about the function's purpose, parameters, and return value.
46-61
: LGTM! Clear and comprehensive documentation.
The docstring effectively communicates the function's purpose, parameters, and return value.
src/readii/io/loaders/features.py (1)
1-7
: LGTM! Imports are well-organized.
The imports are logically grouped and all are being used in the code.
…andling in loadImageDatasetConfig
…tus mapping function
…t, add check when converting the values to lowercase
…lumnValues function in split.py
…lesFromImageTypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (3)
src/readii/io/loaders/general.py (1)
58-68
: Ensure file extension checks are case-insensitiveCurrently, the file extension checks are case-sensitive, which may cause issues if the file has an uppercase extension (e.g.,
.XLSX
,.CSV
). Consider normalizing the file extension to lowercase before comparison.Apply this diff to normalize the file extension:
_, file_extension = os.path.splitext(file_path) +file_extension = file_extension.lower() try: # Check if the file is an Excel file if file_extension == '.xlsx': df = pd.read_excel(file_path) # Check if the file is a CSV file elif file_extension == '.csv': df = pd.read_csv(file_path) else: raise ValueError("Unsupported file format. Please provide a .csv or .xlsx file.")
src/readii/analyze/correlation.py (1)
47-48
: Split multiple statements into separate linesLines 47 and 48 contain multiple statements on one line, which reduces readability. According to PEP 8 style guidelines, each statement should be on its own line.
Apply this diff:
-if not vertical_feature_name.startswith("_"): vertical_feature_name = f"_{vertical_feature_name}" +if not vertical_feature_name.startswith("_"): + vertical_feature_name = f"_{vertical_feature_name}" -if not horizontal_feature_name.startswith("_"): horizontal_feature_name = f"_{horizontal_feature_name}" +if not horizontal_feature_name.startswith("_"): + horizontal_feature_name = f"_{horizontal_feature_name}"🧰 Tools
🪛 Ruff (0.8.0)
47-47: Multiple statements on one line (colon)
(E701)
48-48: Multiple statements on one line (colon)
(E701)
src/readii/data/label.py (1)
4-4
: Remove unused importpandas as pd
The module
pandas
is imported aspd
but not used in the code. Please remove the unused import to clean up your code.Apply this diff:
-import pandas as pd
🧰 Tools
🪛 Ruff (0.8.0)
4-4:
pandas
imported but unusedRemove unused import:
pandas
(F401)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
src/readii/analyze/correlation.py
(1 hunks)src/readii/data/label.py
(1 hunks)src/readii/io/loaders/features.py
(1 hunks)src/readii/io/loaders/general.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
src/readii/io/loaders/features.py
70-70: Local variable e
is assigned to but never used
Remove assignment to unused variable e
(F841)
src/readii/io/loaders/general.py
39-39: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
76-76: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
78-78: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
src/readii/data/label.py
4-4: pandas
imported but unused
Remove unused import: pandas
(F401)
src/readii/analyze/correlation.py
47-47: Multiple statements on one line (colon)
(E701)
48-48: Multiple statements on one line (colon)
(E701)
61-61: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (2)
src/readii/io/loaders/features.py (2)
70-70
: Remove unused exception variable e
The exception variable e
is assigned but not used in the except
block. Remove it to clean up the code.
Apply this diff:
-except KeyError as e:
+except KeyError:
print(f"{feature_file_path} does not have the labels {labels_to_drop} to drop.")
# Skip to the next image type
continue
🧰 Tools
🪛 Ruff (0.8.0)
70-70: Local variable e
is assigned to but never used
Remove assignment to unused variable e
(F841)
9-12
: 🛠️ Refactor suggestion
Fix inconsistency between function signature and docstring for image_types
The parameter image_types
is required in the function signature but is described as optional with a default value in the docstring. Please update the function signature or the docstring to reflect the intended usage.
If image_types
should have a default value, avoid using a mutable default argument. Set it to None
and initialize within the function.
Apply this diff:
def loadFeatureFilesFromImageTypes(extracted_feature_dir:str,
- image_types:list,
+ image_types:Optional[list]=None,
drop_labels:Optional[bool]=True,
labels_to_drop:Optional[list]=None)->Dict[str,pd.DataFrame]:
Then initialize image_types
inside the function:
if image_types is None:
image_types = ['original']
except (pd.errors.ParserError, ValueError) as e: | ||
raise ValueError(f"Error parsing file: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use exception chaining with raise ... from e
To maintain the original exception context, add from e
when raising a new exception inside an except
block.
Apply this diff:
except (pd.errors.ParserError, ValueError) as e:
- raise ValueError(f"Error parsing file: {e}")
+ raise ValueError(f"Error parsing file: {e}") from e
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
except (pd.errors.ParserError, ValueError) as e: | |
raise ValueError(f"Error parsing file: {e}") | |
except (pd.errors.ParserError, ValueError) as e: | |
raise ValueError(f"Error parsing file: {e}") from e |
🧰 Tools
🪛 Ruff (0.8.0)
78-78: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
raise ValueError(f"Invalid YAML in config file: {e}") | ||
def loadFileToDataFrame(file_path:str) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use exception chaining with raise ... from e
To preserve the original traceback and provide better error context, use exception chaining by adding from e
when raising a new exception within an except
block.
Apply this diff:
except yaml.YAMLError as e:
- raise ValueError(f"Invalid YAML in config file: {e}")
+ raise ValueError(f"Invalid YAML in config file: {e}") from e
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
raise ValueError(f"Invalid YAML in config file: {e}") | |
def loadFileToDataFrame(file_path:str) -> pd.DataFrame: | |
raise ValueError(f"Invalid YAML in config file: {e}") from e | |
def loadFileToDataFrame(file_path:str) -> pd.DataFrame: |
🧰 Tools
🪛 Ruff (0.8.0)
39-39: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
# Calculate correlation between vertical features and horizontal features | ||
correlation_matrix = features_to_correlate.corr(method=method) | ||
except Exception as e: | ||
raise ValueError(f"Error calculating correlation matrix: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use exception chaining with raise ... from e
When raising a new exception within an except
block, use from e
to preserve the original exception context.
Apply this diff:
except Exception as e:
- raise ValueError(f"Error calculating correlation matrix: {e}")
+ raise ValueError(f"Error calculating correlation matrix: {e}") from e
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
raise ValueError(f"Error calculating correlation matrix: {e}") | |
raise ValueError(f"Error calculating correlation matrix: {e}") from e |
🧰 Tools
🪛 Ruff (0.8.0)
61-61: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
|
||
# Swap the keys and values in the mapping dictionary to use the replaceColumnValues function | ||
# So the numeric values will be the keys and the string event values will be the values | ||
replacement_value_data = dict([num_event, str_event] for str_event, num_event in event_column_value_mapping.items()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the key-value order in replacement_value_data
The dictionary comprehension for replacement_value_data
may have the keys and values swapped, which can lead to incorrect replacements.
Apply this diff to fix the key-value pair order:
- replacement_value_data = dict([num_event, str_event] for str_event, num_event in event_column_value_mapping.items())
+ replacement_value_data = {num_event: str_event for str_event, num_event in event_column_value_mapping.items()}
Ensure that the keys in replacement_value_data
are the numeric event codes and the values are the corresponding string event labels.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
replacement_value_data = dict([num_event, str_event] for str_event, num_event in event_column_value_mapping.items()) | |
replacement_value_data = {num_event: str_event for str_event, num_event in event_column_value_mapping.items()} |
Summary by CodeRabbit
New Features
Bug Fixes
loadRTSTRUCTSITK
function for improved data handling.Tests