Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: integrate analyze readii outputs functions #79

Closed
wants to merge 64 commits into from
Closed
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
95586e5
feat: add function to calculate feature correlation matrix
strixy16 Dec 3, 2024
5193fb4
feat: add function to generate a heatmap plot figure from a correlati…
strixy16 Dec 3, 2024
a0771c6
feat: add init file to analyze directory
strixy16 Dec 3, 2024
cf26afc
feat: add error handling in getFeatureCorrelations
strixy16 Dec 3, 2024
e643349
feat: add general loading file, add loading config and data file func…
strixy16 Dec 3, 2024
f5882da
feat: add file for loading functions related to feature files
strixy16 Dec 3, 2024
5495550
build: add numpy and seaborn for correlation code
strixy16 Dec 3, 2024
decf8e5
refactor: remove so far unused imports
strixy16 Dec 3, 2024
fcc1b9e
feat: started test function for getFeatureCorrelations
strixy16 Dec 3, 2024
a708182
feat: make files for better function organization
strixy16 Dec 3, 2024
d706863
Merge remote-tracking branch 'origin/main' into katys/integrate-analy…
strixy16 Dec 6, 2024
d63a1c5
fix: remove duplicate tool.pixi.dependencies from merge
strixy16 Dec 6, 2024
484c12e
build: add seaborn for correlation plot functions, need to specify nu…
strixy16 Dec 6, 2024
c6b945f
feat: add init files for new directories
strixy16 Dec 6, 2024
fc83d69
feat: add function to calculate feature correlations and a function t…
strixy16 Dec 6, 2024
46f0773
feat: add function to drop a set of features at the beginning of a pa…
strixy16 Dec 6, 2024
fe56257
fix: set continuous setting in StructureSetToSegmentation to False
strixy16 Dec 6, 2024
e618269
build: moved seaborn and numpy to project dependencies
strixy16 Dec 6, 2024
a6ab888
test: make test feature matrix to test correlation functions with, up…
strixy16 Dec 6, 2024
0f1d837
feat: set StructureSetToSegmentation continuous argument to False
strixy16 Dec 6, 2024
5b0dccc
build: lock file from installing on katys mac
strixy16 Dec 6, 2024
0d9c943
Merge branch 'katys/fix_continuous_rtstruct_index' into katys/integra…
strixy16 Dec 6, 2024
5b4e5cb
feat: add functions for selecting subsets of dataframes
strixy16 Dec 9, 2024
b36f3d2
refactor: renamed process to select for specificity
strixy16 Dec 9, 2024
fa4da89
style: rename labelling for consistent filename convention
strixy16 Dec 9, 2024
0256466
feat: add function to extract patient ID label from a dataframe
strixy16 Dec 9, 2024
f0b87c2
feat: add functions to replace column values in a dataset for imputat…
strixy16 Dec 9, 2024
d44e1ce
feat: add function to save out seaborn plot figure to a png
strixy16 Dec 9, 2024
bfdc357
feat: add function to convert numerical days column to years
strixy16 Dec 9, 2024
1e89c17
feat: add function to set up a time outcome column for survival predi…
strixy16 Dec 9, 2024
948b426
feat: add function for survival status mapping from string to numeric…
strixy16 Dec 9, 2024
86f13ec
feat: add function to set patient ID column as index in a dataframe
strixy16 Dec 9, 2024
7842ebe
feat: add function to intersect two dataframes by their patient ID va…
strixy16 Dec 9, 2024
81b884a
feat: add function that takes outcome labels from clinical data and a…
strixy16 Dec 9, 2024
de2dd2c
feat: add function to get a list of image types from a directory of f…
strixy16 Dec 9, 2024
1d49ec1
feat: add function to plot and return a correlation heatmap
strixy16 Dec 9, 2024
8e0868f
feat: add function to plot a histogram of correlation values
strixy16 Dec 9, 2024
45b8fb0
feat: add functions to extract subsets of a full correlation matrix
strixy16 Dec 9, 2024
6b84ef8
style: rename plot to plot_correlations for specificity
strixy16 Dec 9, 2024
61cdedd
feat: add functions for self and cross correlation plotting
strixy16 Dec 9, 2024
e021051
refactor: remove unused imports
strixy16 Dec 9, 2024
730361b
refactor: remove unused scipy import
strixy16 Dec 9, 2024
1f4edf2
build: latest pixi lock file for analysis code addition
strixy16 Dec 9, 2024
de1c752
feat: change continuous to True in loadRTSTRUCTSITK so tests pass for…
strixy16 Dec 10, 2024
2647168
fix: need default vertical and horizontal suffixes when same feature …
strixy16 Dec 10, 2024
253aba2
fix: default feature names will have underscore at the front and unde…
strixy16 Dec 10, 2024
31bf5bf
feat: testing getFeatureCorrelations function
strixy16 Dec 10, 2024
231c390
fix: handle mutable input argument event_column_mapping
strixy16 Dec 10, 2024
40c1cba
fix: add fstring so variable is used properly in error message
strixy16 Dec 10, 2024
550c32a
fix: remove mutable version of outcome_labels input for addOutcomeLabels
strixy16 Dec 10, 2024
0d36600
fix: update error handling of old values to be replaced not existing …
strixy16 Dec 10, 2024
187b1cb
feat: change input image_types list for loadFeatureFilesFromImageType…
strixy16 Dec 10, 2024
b1daaf0
fix: change labels to drop default to None and assign in the function…
strixy16 Dec 10, 2024
6075966
refactor: use context manager for file operations and improve error h…
strixy16 Dec 10, 2024
501e20d
feat: improve error handling and input validation in loadFileToDataframe
strixy16 Dec 10, 2024
5ea0b99
refactor: change assert statements in getFeatureCorrelations to if st…
strixy16 Dec 10, 2024
da16d68
feat: handle NaN values in existing event values list in survival sta…
strixy16 Dec 10, 2024
0c8ccbf
docs: describe handling of NaNs in survival outcome column when mappi…
strixy16 Dec 10, 2024
2ab08e6
refactor: check dtype of event outcome column instead of first elemen…
strixy16 Dec 10, 2024
90839a2
refactor: simplify event column mapping dictionary check with sets
strixy16 Dec 10, 2024
80b81a7
refactor: change out string to numeric replacement with the replaceCo…
strixy16 Dec 10, 2024
b0a892d
feat: check that extracted feature directory exists
strixy16 Dec 10, 2024
edaf74c
refactor: improve error handling for dropping labels in loadFeatureFi…
strixy16 Dec 10, 2024
dc2e86a
feat: validate that any feature sets were loaded before return
strixy16 Dec 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5,910 changes: 4,004 additions & 1,906 deletions pixi.lock

Large diffs are not rendered by default.

5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -11,7 +11,9 @@ dependencies = [
"matplotlib>=3.9.2,<4",
"med-imagetools>=1.9.2",
"pydicom>=2.3.1",
"pyradiomics-bhklab>=3.1.4,<4",
"pyradiomics-bhklab>=3.1.4,<4",
"numpy==1.26.4",
"seaborn>=0.13.2,<0.14"
]
requires-python = ">=3.10, <3.13"

@@ -191,3 +193,4 @@ publish-test = { cmd = [
], depends-on = [
"build",
], description = "Publish to test PyPI" }

1 change: 0 additions & 1 deletion src/readii/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
# read version from installed package
from importlib.metadata import version
__version__ = "1.18.0"

Empty file added src/readii/analyze/__init__.py
Empty file.
230 changes: 230 additions & 0 deletions src/readii/analyze/correlation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
import pandas as pd
from typing import Optional
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


def getFeatureCorrelations(vertical_features:pd.DataFrame,
horizontal_features:pd.DataFrame,
method:str = "pearson",
vertical_feature_name:str = '_vertical',
horizontal_feature_name:str = '_horizontal'):
""" Function to calculate correlation between two sets of features.
Parameters
----------
vertical_features : pd.DataFrame
Dataframe containing features to calculate correlations with. Index must be the same as the index of the horizontal_features dataframe.
horizontal_features : pd.DataFrame
Dataframe containing features to calculate correlations with. Index must be the same as the index of the vertical_features dataframe.
method : str
Method to use for calculating correlations. Default is "pearson".
vertical_feature_name : str
Name of the vertical features to use as suffix in correlation dataframe. Default is blank "".
horizontal_feature_name : str
Name of the horizontal features to use as suffix in correlation dataframe. Default is blank "".
Returns
-------
correlation_matrix : pd.DataFrame
Dataframe containing correlation values.
"""
# Check that features are dataframes
if not isinstance(vertical_features, pd.DataFrame):
raise TypeError("vertical_features must be a pandas DataFrame")
if not isinstance(horizontal_features, pd.DataFrame):
raise TypeError("horizontal_features must be a pandas DataFrame")


if method not in ["pearson", "spearman", "kendall"]:
raise ValueError("Correlation method must be one of 'pearson', 'spearman', or 'kendall'.")

if not vertical_features.index.equals(horizontal_features.index):
raise ValueError("Vertical and horizontal features must have the same index to calculate correlation. Set the index to the intersection of patient IDs.")

# Add _ to beginnging of feature names if they don't start with _ so they can be used as suffixes
if not vertical_feature_name.startswith("_"): vertical_feature_name = f"_{vertical_feature_name}"
if not horizontal_feature_name.startswith("_"): horizontal_feature_name = f"_{horizontal_feature_name}"

# Join the features into one dataframe
# Use inner join to keep only the rows that have a value in both vertical and horizontal features
features_to_correlate = vertical_features.join(horizontal_features,
how='inner',
lsuffix=vertical_feature_name,
rsuffix=horizontal_feature_name)

try:
# Calculate correlation between vertical features and horizontal features
correlation_matrix = features_to_correlate.corr(method=method)
except Exception as e:
raise ValueError(f"Error calculating correlation matrix: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use exception chaining with raise ... from e

When raising a new exception within an except block, use from e to preserve the original exception context.

Apply this diff:

 except Exception as e:
-    raise ValueError(f"Error calculating correlation matrix: {e}")
+    raise ValueError(f"Error calculating correlation matrix: {e}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
raise ValueError(f"Error calculating correlation matrix: {e}")
raise ValueError(f"Error calculating correlation matrix: {e}") from e
🧰 Tools
🪛 Ruff (0.8.0)

61-61: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


return correlation_matrix


def plotCorrelationHeatmap(correlation_matrix_df:pd.DataFrame,
diagonal:Optional[bool] = False,
triangle:Optional[str] = "lower",
cmap:Optional[str] = "nipy_spectral",
xlabel:Optional[str] = "",
ylabel:Optional[str] = "",
title:Optional[str] = "",
subtitle:Optional[str] = "",
show_tick_labels:Optional[bool] = False
):
"""Function to plot a correlation heatmap.
Parameters
----------
correlation_matrix_df : pd.DataFrame
Dataframe containing the correlation matrix to plot.
diagonal : bool, optional
Whether to only plot half of the matrix. The default is False.
triangle : str, optional
Which triangle half of the matrixto plot. The default is "lower".
xlabel : str, optional
Label for the x-axis. The default is "".
ylabel : str, optional
Label for the y-axis. The default is "".
title : str, optional
Title for the plot. The default is "".
subtitle : str, optional
Subtitle for the plot. The default is "".
show_tick_labels : bool, optional
Whether to show the tick labels on the x and y axes. These would be the feature names. The default is False.
Returns
-------
corr_fig : matplotlib.pyplot.figure
Figure object containing a Seaborn heatmap.
"""

if diagonal:
# Set up mask for hiding half the matrix in the plot
if triangle == "lower":
# Mask out the upper right triangle half of the matrix
mask = np.triu(correlation_matrix_df)
elif triangle == "upper":
# Mask out the lower left triangle half of the matrix
mask = np.tril(correlation_matrix_df)
else:
raise ValueError("If diagonal is True, triangle must be either 'lower' or 'upper'.")
else:
# The entire correlation matrix will be visisble in the plot
mask = None

# Set a default title if one is not provided
if not title:
title = "Correlation Heatmap"

# Set up figure and axes for the plot
corr_fig, corr_ax = plt.subplots()

# Plot the correlation matrix
corr_ax = sns.heatmap(correlation_matrix_df,
mask = mask,
cmap=cmap,
vmin=-1.0,
vmax=1.0)

if not show_tick_labels:
# Remove the individual feature names from the axes
corr_ax.set_xticklabels(labels=[])
corr_ax.set_yticklabels(labels=[])

# Set axis labels
corr_ax.set_xlabel(xlabel)
corr_ax.set_ylabel(ylabel)

# Set title and subtitle
# Suptitle is the super title, which will be above the title
plt.title(subtitle, fontsize=12)
plt.suptitle(title, fontsize=14)

return corr_fig



def getVerticalSelfCorrelations(correlation_matrix:pd.DataFrame,
num_vertical_features:int):
""" Function to get the vertical (y-axis) self correlations from a correlation matrix. Gets the top left quadrant of the correlation matrix.
Parameters
----------
correlation_matrix : pd.DataFrame
Dataframe containing the correlation matrix to get the vertical self correlations from.
num_vertical_features : int
Number of vertical features in the correlation matrix.
Returns
-------
pd.DataFrame
Dataframe containing the vertical self correlations from the correlation matrix.
"""
if num_vertical_features > correlation_matrix.shape[0]:
raise ValueError(f"Number of vertical features ({num_vertical_features}) is greater than the number of rows in the correlation matrix ({correlation_matrix.shape[0]}).")

if num_vertical_features > correlation_matrix.shape[1]:
raise ValueError(f"Number of vertical features ({num_vertical_features}) is greater than the number of columns in the correlation matrix ({correlation_matrix.shape[1]}).")

# Get the correlation matrix for vertical vs vertical - this is the top left corner of the matrix
return correlation_matrix.iloc[0:num_vertical_features, 0:num_vertical_features]



def getHorizontalSelfCorrelations(correlation_matrix:pd.DataFrame,
num_horizontal_features:int):
""" Function to get the horizontal (x-axis) self correlations from a correlation matrix. Gets the bottom right quadrant of the correlation matrix.
Parameters
----------
correlation_matrix : pd.DataFrame
Dataframe containing the correlation matrix to get the horizontal self correlations from.
num_horizontal_features : int
Number of horizontal features in the correlation matrix.
Returns
-------
pd.DataFrame
Dataframe containing the horizontal self correlations from the correlation matrix.
"""

if num_horizontal_features > correlation_matrix.shape[0]:
raise ValueError(f"Number of horizontal features ({num_horizontal_features}) is greater than the number of rows in the correlation matrix ({correlation_matrix.shape[0]}).")

if num_horizontal_features > correlation_matrix.shape[1]:
raise ValueError(f"Number of horizontal features ({num_horizontal_features}) is greater than the number of columns in the correlation matrix ({correlation_matrix.shape[1]}).")

# Get the index of the start of the horizontal correlations
start_of_horizontal_correlations = len(correlation_matrix.columns) - num_horizontal_features

# Get the correlation matrix for horizontal vs horizontal - this is the bottom right corner of the matrix
return correlation_matrix.iloc[start_of_horizontal_correlations:, start_of_horizontal_correlations:]



def getCrossCorrelationMatrix(correlation_matrix:pd.DataFrame,
num_vertical_features:int):
""" Function to get the cross correlation matrix subsection for a correlation matrix. Gets the top right quadrant of the correlation matrix so vertical and horizontal features are correctly labeled.
Parameters
----------
correlation_matrix : pd.DataFrame
Dataframe containing the correlation matrix to get the cross correlation matrix subsection from.
num_vertical_features : int
Number of vertical features in the correlation matrix.
Returns
-------
pd.DataFrame
Dataframe containing the cross correlations from the correlation matrix.
"""

if num_vertical_features > correlation_matrix.shape[0]:
raise ValueError(f"Number of vertical features ({num_vertical_features}) is greater than the number of rows in the correlation matrix ({correlation_matrix.shape[0]}).")

if num_vertical_features > correlation_matrix.shape[1]:
raise ValueError(f"Number of vertical features ({num_vertical_features}) is greater than the number of columns in the correlation matrix ({correlation_matrix.shape[1]}).")

return correlation_matrix.iloc[0:num_vertical_features, num_vertical_features:]
Loading