Skip to content

Commit 8d3d173

Browse files
author
pubudu
committed
Add Hands-on
1 parent cc8a8b2 commit 8d3d173

12 files changed

+101555
-47
lines changed

content/2.NumPy_Data_Types.md

+54-1
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ This homogeneity enables:
134134

135135
For bioinformatics applications, this homogeneity helps ensure consistency when processing large datasets of gene expression values, sequence reads, or alignment scores.
136136

137-
## Key NumPy Data Types for Bioinformatics
137+
## Key NumPy Data Types
138138

139139
### Integer Types
140140

@@ -329,6 +329,8 @@ Different bioinformatics tools may expect specific data types:
329329

330330
Being aware of these requirements helps create more robust analysis pipelines.
331331

332+
## Key Takeaways
333+
332334
:::{Keypoints}
333335

334336
NumPy's specialized data types provide significant advantages for bioinformatics applications:
@@ -347,3 +349,54 @@ By choosing appropriate data types, bioinformaticians can:
347349

348350
Understanding the distinctions between Python's general-purpose types and NumPy's specialized numeric types is essential for effective scientific programming in bioinformatics.
349351
:::
352+
353+
## Hands-on
354+
355+
:::{exercise} Hands-on
356+
357+
```python
358+
# What is NumPy and why it's important for bioinformatics
359+
# Performance advantages over Python lists
360+
# Foundation for other scientific libraries
361+
362+
import numpy as np
363+
364+
# Read the CSV file into a numpy array
365+
## CSV file contains sample group information
366+
data = np.genfromtxt("test_data/Sample_group_info.csv", delimiter=',', dtype='str')
367+
368+
# Print the numpy array information
369+
370+
def print_array_info(array):
371+
# Get the shape of the array
372+
shape = array.shape
373+
374+
# Get the number of dimensions of the array
375+
ndim = array.ndim
376+
377+
# Get the data type of the array
378+
dtype = array.dtype
379+
380+
# Get the number of elements in the array
381+
size = array.size
382+
383+
print(f"Shape: {shape} \nNumber of dimensions: {ndim} \nData type: {dtype} \nSize: {size}")
384+
385+
386+
print_array_info(data)
387+
388+
# Read the CSV file into a numpy array with string dtype
389+
## CSV file contains RNA count matrix
390+
count_matrix = np.genfromtxt("test_data/count_matrix.csv", delimiter=',',
391+
dtype='str')
392+
print_array_info(count_matrix)
393+
394+
# Remove sample names from the count matrix (cm) - Delete the first row
395+
## Convert the cm to a float32 array
396+
print(count_matrix[0:5, 0:5])
397+
print("___")
398+
cm = np.delete(count_matrix, 0, axis=0).astype("float32")
399+
print(cm[0:5, 0:5])
400+
```
401+
402+
:::

content/3.Indexing_and_Slicing.md

+21-18
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ print(gene_expr[::2, :2])
144144

145145
:::
146146

147-
**Real-world significance in bioinformatics:**
147+
## Real-world significance in bioinformatics
148148

149149
* Indexing:
150150
* Retrieving expression value for a specific gene in a specific condition
@@ -159,23 +159,6 @@ print(gene_expr[::2, :2])
159159
* Analyzing specific regions in protein contact maps
160160
* Extracting protein domains from structure coordinate arrays
161161

162-
:::{Keypoints}
163-
164-
* Efficient indexing and slicing are crucial for bioinformatics workflows
165-
* Key takeaways:
166-
* Indexing for accessing individual elements
167-
* Slicing for extracting regions of interest
168-
* Leverage both for efficient data manipulation in matrices (gene × condition, position × sequence, etc.)
169-
* Combine with boolean operations for filtering
170-
* Remember zero-based indexing
171-
* Common pitfalls:
172-
* Off-by-one errors (especially when converting between biology's 1-based and programming's 0-based systems)
173-
* Overlooking the exclusive upper bound in slicing (end index is not included)
174-
* Forgetting that modifying slices can modify the original array (use .copy() when needed)
175-
* Confusing row-major vs. column-major operations
176-
177-
:::
178-
179162
## Exercises - Array Indexing and Slicing Exercises
180163

181164
:::{exercise}
@@ -373,3 +356,23 @@ print("Condition 2 values < 20:", gene_expr[:, 1][condition2_low])
373356
```
374357

375358
:::
359+
360+
## Key Takeaways
361+
362+
:::{Keypoints}
363+
364+
* Efficient indexing and slicing are crucial for bioinformatics workflows
365+
* Key takeaways:
366+
* Indexing for accessing individual elements
367+
* Slicing for extracting regions of interest
368+
* Leverage both for efficient data manipulation in matrices (gene × condition, position × sequence, etc.)
369+
* Combine with boolean operations for filtering
370+
* Remember zero-based indexing
371+
* Common pitfalls:
372+
* Off-by-one errors (especially when converting between biology's 1-based and programming's 0-based systems)
373+
* Overlooking the exclusive upper bound in slicing (end index is not included)
374+
* Forgetting that modifying slices can modify the original array (use .copy() when needed)
375+
* Confusing row-major vs. column-major operations
376+
377+
:::
378+

content/4.Advance_indexing_filtering.md

+74-9
Original file line numberDiff line numberDiff line change
@@ -215,15 +215,6 @@ Boolean masking and `np.where()` operations are highly optimized in NumPy. They:
215215

216216
For large datasets, these techniques are drastically faster than traditional iteration.
217217

218-
:::{Keypoints}
219-
220-
* Boolean masking provides an intuitive way to filter arrays based on conditions
221-
* `np.where()` in its single-argument form finds indices where conditions are true
222-
* `np.where(condition, x, y)` acts as a vectorized if-else statement
223-
* `np.isin()` lets us filter based on membership in a set of values
224-
225-
:::
226-
227218
## Exercises: NumPy Boolean Masking and Advanced Filtering
228219

229220
:::{exercise}
@@ -346,3 +337,77 @@ print(f"A: {a_count}, T: {t_count}, G: {g_count}, C: {c_count}")
346337
```
347338

348339
:::
340+
341+
## Key Takeaways
342+
343+
:::{Keypoints}
344+
345+
* Boolean masking provides an intuitive way to filter arrays based on conditions
346+
* `np.where()` in its single-argument form finds indices where conditions are true
347+
* `np.where(condition, x, y)` acts as a vectorized if-else statement
348+
* `np.isin()` lets us filter based on membership in a set of values
349+
350+
:::
351+
352+
## Hands-on
353+
354+
:::{exercise} Hands-on
355+
356+
```python
357+
358+
import numpy as np
359+
360+
# Read the CSV file into a numpy array
361+
data = np.genfromtxt("test_data/Sample_group_info.csv", delimiter=',', dtype='str')
362+
363+
def print_array_info(array):
364+
# Get the shape of the array
365+
shape = array.shape
366+
# Get the number of dimensions of the array
367+
ndim = array.ndim
368+
# Get the data type of the array
369+
dtype = array.dtype
370+
# Get the number of elements in the array
371+
size = array.size
372+
print(f"Shape: {shape} \nNumber of dimensions: {ndim} \nData type: {dtype} \nSize: {size}")
373+
374+
# Access indices of the array where the second column is 'iweak'
375+
iweak_index = np.where(data[:, 1] == 'iweak')
376+
print(iweak_index)
377+
print_array_info(iweak_index[0])
378+
379+
# Access indices of the array where the second column is 'iweak'
380+
## Assign the indices to a iweak_index (not the tuple returned by np.where)
381+
iweak_index = np.where(data[:, 1] == 'iweak')[0]
382+
print_array_info(iweak_index)
383+
384+
# Access indices of the array where the second column is 'istrong'
385+
## Assign the indices to a istrong_index (not the tuple returned by np.where)
386+
istrong_index = np.where(data[:, 1] == 'istrong')[0]
387+
print(istrong_index)
388+
print_array_info(istrong_index)
389+
390+
# Load count matrix
391+
count_matrix = np.genfromtxt("test_data/count_matrix.csv", delimiter=',', dtype='str')
392+
393+
# View the first column of the count matrix where the sample group is 'iweak'
394+
print(count_matrix[0:5, 0:5])
395+
print("___")
396+
397+
# Create a boolean mask to find if the columns in the count matrix where the sample group is 'iweak'
398+
cm_iweak_mask = np.isin(count_matrix[0, :], data[iweak_index, 0])
399+
print(cm_iweak_mask[:30])
400+
401+
# Find the indices of the columns in the count matrix where the sample group is 'iweak'
402+
cm_weak_cols = np.where(cm_iweak_mask)[0]
403+
print(cm_weak_cols)
404+
print_array_info(cm_weak_cols)
405+
406+
# Find the indices of the columns in the count matrix where the sample group is 'istrong'
407+
cm_strong_cols = np.where(np.isin(count_matrix[0, :], data[istrong_index, 0]))[0]
408+
print(cm_strong_cols)
409+
print_array_info(cm_strong_cols)
410+
411+
```
412+
413+
:::

content/5.Essential_array_operations.md

+97
Original file line numberDiff line numberDiff line change
@@ -361,3 +361,100 @@ print(f"Row sums: {row_sums}") # [6 15]
361361
```
362362

363363
:::
364+
365+
:::{Keypoints}
366+
367+
* **Reshaping Arrays:** Maintain the total number of elements when reshaping; use -1 for automatic dimension calculation.
368+
* **Concatenation of Arrays:** Combine arrays while matching dimensions, except along the concatenation axis.
369+
* **Statistical Functions:** Utilize NumPy’s statistical functions for data analysis, operating across different axes.
370+
* **Error Handling:** Be aware of shape requirements for concatenation to avoid errors.
371+
:::
372+
373+
## Hands-on
374+
375+
:::{exercise} Hands-on
376+
377+
```python
378+
379+
import numpy as np
380+
381+
# Read the CSV file into a numpy array
382+
data = np.genfromtxt("test_data/Sample_group_info.csv", delimiter=',', dtype='str')
383+
384+
def print_array_info(array):
385+
# Get the shape of the array
386+
shape = array.shape
387+
# Get the number of dimensions of the array
388+
ndim = array.ndim
389+
# Get the data type of the array
390+
dtype = array.dtype
391+
# Get the number of elements in the array
392+
size = array.size
393+
print(f"Shape: {shape} \nNumber of dimensions: {ndim} \nData type: {dtype} \nSize: {size}")
394+
395+
# Access indices of the array where the second column is 'iweak'
396+
iweak_index = np.where(data[:, 1] == 'iweak')
397+
print(iweak_index)
398+
print_array_info(iweak_index[0])
399+
400+
# Access indices of the array where the second column is 'iweak'
401+
## Assign the indices to a iweak_index (not the tuple returned by np.where)
402+
iweak_index = np.where(data[:, 1] == 'iweak')[0]
403+
print_array_info(iweak_index)
404+
405+
# Access indices of the array where the second column is 'istrong'
406+
## Assign the indices to a istrong_index (not the tuple returned by np.where)
407+
istrong_index = np.where(data[:, 1] == 'istrong')[0]
408+
print(istrong_index)
409+
print_array_info(istrong_index)
410+
411+
# Load count matrix
412+
count_matrix = np.genfromtxt("test_data/count_matrix.csv", delimiter=',', dtype='str')
413+
414+
# View the first column of the count matrix where the sample group is 'iweak'
415+
print(count_matrix[0:5, 0:5])
416+
print("___")
417+
418+
# Create a boolean mask to find if the columns in the count matrix where the sample group is 'iweak'
419+
cm_iweak_mask = np.isin(count_matrix[0, :], data[iweak_index, 0])
420+
print(cm_iweak_mask[:30])
421+
422+
# Find the indices of the columns in the count matrix where the sample group is 'iweak'
423+
cm_weak_cols = np.where(cm_iweak_mask)[0]
424+
print(cm_weak_cols)
425+
print_array_info(cm_weak_cols)
426+
427+
# Find the indices of the columns in the count matrix where the sample group is 'istrong'
428+
cm_strong_cols = np.where(np.isin(count_matrix[0, :], data[istrong_index, 0]))[0]
429+
print(cm_strong_cols)
430+
print_array_info(cm_strong_cols)
431+
432+
# Remove sample names from the count matrix (cm) - Delete the first row
433+
## Convert the cm to a float32 array
434+
print(count_matrix[0:5, 0:5])
435+
print("___")
436+
cm = np.delete(count_matrix, 0, axis=0).astype("float32")
437+
print(cm[0:5, 0:5])
438+
439+
# Convert cm to log scale
440+
cm = np.log2(cm + 1)
441+
print(cm)
442+
print_array_info(cm)
443+
444+
# Calculate mean and STD of each gene in iweak samples
445+
iweak_mean = cm[:, cm_weak_cols].mean(1) ## Mean of iweak samples
446+
iweak_std = cm[:, cm_weak_cols].std(1) ## STD of iweak samples
447+
448+
print(cm.shape)
449+
print("--------")
450+
print(iweak_mean[:5], iweak_mean.shape)
451+
print("--------")
452+
print(iweak_mean[:5, np.newaxis], iweak_mean[:, np.newaxis].shape)
453+
454+
# Calculate mean and STD of each gene in istrong samples
455+
istrong_mean = cm[:,cm_strong_cols].mean(1) ## Mean of istrong disease samples
456+
istrong_std = cm[:,cm_strong_cols].std(1) ## STD of istrong samples
457+
458+
```
459+
460+
:::

0 commit comments

Comments
 (0)