1
- # DataFrame Manipulation & Sorting (25 mins)
2
- ## Lecture Materials with Exercises
1
+ # DataFrame Manipulation & Sorting
3
2
4
- ### Introduction (2 minutes)
3
+ ## Introduction
5
4
6
- ** Slide 1: DataFrame Manipulation & Sorting**
7
- - Now that we can import data, we need to reshape it for analysis
8
- - Most real-world datasets need significant manipulation before analysis
9
- - Selecting, adding, removing, and reordering data are fundamental skills
10
- - These operations build directly on our understanding of DataFrames as labeled, 2D structures
5
+ ** DataFrame Manipulation & Sorting:**
11
6
12
- ** Talking Points: **
13
- - "In real-world data analysis, you'll spend about 80% of your time cleaning and manipulating data, and only 20% on actual analysis."
14
- - "The skills we're covering today form the backbone of data wrangling in Python."
15
- - "Think of these operations as transforming raw data into analysis-ready information."
7
+ * Now that we can import data, we need to reshape it for analysis
8
+ * Most real-world datasets need significant manipulation before analysis
9
+ * Selecting, adding, removing, and reordering data are fundamental skills
10
+ * These operations build directly on our understanding of DataFrames as labeled, 2D structures
16
11
17
- ---
12
+ :::{discussion}
13
+
14
+ * In real-world data analysis, you'll spend about 80% of your time cleaning and manipulating data, and only 20% on actual analysis
15
+ * The skills we're covering today form the backbone of data wrangling in Python
16
+ * Think of these operations as transforming raw data into analysis-ready information
18
17
19
- ### 1. Column and Row Selection (7 minutes)
18
+ :::
20
19
21
- ** Slide 2: Different Ways to Select Data**
20
+ ## Column and Row Selection
21
+
22
+ ** Different Ways to Select Data:**
22
23
23
24
| Selection Type | Purpose | Example Syntax |
24
25
| ----------------| ---------| ----------------|
29
30
| Row and column | Get specific value(s) | ` df.loc['index', 'column'] ` |
30
31
| Slicing | Get ranges of data | ` df.loc['idx1':'idx2', 'col1':'col2'] ` |
31
32
32
- ** Code Example 1: Basic Selection**
33
+ ### Basic Selection
34
+
35
+ :::{demo}
36
+
33
37
``` python
34
38
import pandas as pd
35
39
import numpy as np
@@ -79,7 +83,79 @@ print("\n8. Selecting subset of rows and columns:")
79
83
print (df.loc[' emp001' :' emp003' , [' Name' , ' Age' , ' Salary' ]])
80
84
```
81
85
82
- ** Code Example 2: Advanced Selection with Conditions**
86
+ Output
87
+
88
+ ``` none
89
+ Original DataFrame:
90
+ Name Age City Salary Department
91
+ emp001 Alice 24 New York 65000 HR
92
+ emp002 Bob 30 Boston 72000 Sales
93
+ emp003 Charlie 35 Chicago 85000 Tech
94
+ emp004 David 42 Seattle 92000 Tech
95
+ emp005 Eva 28 Miami 70000 Finance
96
+
97
+ 1. Single column as Series:
98
+ emp001 24
99
+ emp002 30
100
+ emp003 35
101
+ emp004 42
102
+ emp005 28
103
+ Name: Age, dtype: int64
104
+
105
+ 2. Alternative syntax for columns without spaces:
106
+ emp001 24
107
+ emp002 30
108
+ emp003 35
109
+ emp004 42
110
+ emp005 28
111
+ Name: Age, dtype: int64
112
+
113
+ 3. Selecting multiple columns:
114
+ Name Salary
115
+ emp001 Alice 65000
116
+ emp002 Bob 72000
117
+ emp003 Charlie 85000
118
+ emp004 David 92000
119
+ emp005 Eva 70000
120
+
121
+ 4. Selecting row by index label:
122
+ Name Charlie
123
+ Age 35
124
+ City Chicago
125
+ Salary 85000
126
+ Department Tech
127
+ Name: emp003, dtype: object
128
+
129
+ 5. Selecting row by position (third row):
130
+ Name Charlie
131
+ Age 35
132
+ City Chicago
133
+ Salary 85000
134
+ Department Tech
135
+ Name: emp003, dtype: object
136
+
137
+ 6. Selecting multiple rows by position:
138
+ Name Age City Salary Department
139
+ emp002 Bob 30 Boston 72000 Sales
140
+ emp003 Charlie 35 Chicago 85000 Tech
141
+ emp004 David 42 Seattle 92000 Tech
142
+
143
+ 7. Selecting specific value (cell):
144
+ 72000
145
+
146
+ 8. Selecting subset of rows and columns:
147
+ Name Age Salary
148
+ emp001 Alice 24 65000
149
+ emp002 Bob 30 72000
150
+ emp003 Charlie 35 85000
151
+ ```
152
+
153
+ :::
154
+
155
+ ### Advanced Selection with Conditions
156
+
157
+ :::{demo}
158
+
83
159
``` python
84
160
# Boolean selection - rows where Age > 30
85
161
print (" \n 9. Boolean selection - employees over 30:" )
@@ -98,15 +174,52 @@ print("\n12. Using .isin() - employees in HR or Finance:")
98
174
print (df[df[' Department' ].isin([' HR' , ' Finance' ])])
99
175
```
100
176
101
- ** Talking Points:**
102
- - "Notice that selecting a single column returns a Series, while selecting multiple columns maintains the DataFrame structure."
103
- - "The ` .loc ` accessor is used for label-based indexing, while ` .iloc ` is for position-based indexing."
104
- - "Boolean selection is incredibly powerful - it lets you filter data based on specific conditions."
105
- - "These selection methods can be combined in powerful ways to extract exactly the data you need."
177
+ Output
106
178
107
- ** Exercise 1: Selection Practice (2 minutes)**
179
+ ``` none
180
+ 9. Boolean selection - employees over 30:
181
+ Name Age City Salary Department
182
+ emp003 Charlie 35 Chicago 85000 Tech
183
+ emp004 David 42 Seattle 92000 Tech
184
+
185
+ 10. Multiple conditions - Tech department with salary > 80000:
186
+ Name Age City Salary Department
187
+ emp003 Charlie 35 Chicago 85000 Tech
188
+ emp004 David 42 Seattle 92000 Tech
189
+
190
+ 11. Using query method - same condition:
191
+ Name Age City Salary Department
192
+ emp003 Charlie 35 Chicago 85000 Tech
193
+ emp004 David 42 Seattle 92000 Tech
194
+
195
+ 12. Using .isin() - employees in HR or Finance:
196
+ Name Age City Salary Department
197
+ emp001 Alice 24 New York 65000 HR
198
+ emp005 Eva 28 Miami 70000 Finance
199
+ ```
200
+
201
+ :::
202
+
203
+ :::{discussion}
204
+
205
+ * Notice that selecting a single column returns a Series, while selecting multiple columns maintains the DataFrame structure
206
+ * The ` .loc ` accessor is used for label-based indexing, while ` .iloc ` is for position-based indexing
207
+ * Boolean selection is incredibly powerful - it lets you filter data based on specific conditions
208
+ * These selection methods can be combined in powerful ways to extract exactly the data you need
209
+
210
+ :::
211
+
212
+ :::{exercise}
213
+
214
+ ** Selection Practice:**
215
+
216
+ Use ` inventory ` dataframe and
217
+
218
+ * Select just the Product_Name and Price columns
219
+ * Select all products that are in stock (In_Stock is True)
220
+ * Select all electronics that cost less than 500
221
+ * Select the 2nd and 3rd products using position-based indexing
108
222
109
- Have students execute:
110
223
``` python
111
224
# Create a dataset of product inventory
112
225
products = {
@@ -119,7 +232,13 @@ products = {
119
232
' Units' : [15 , 28 , 0 , 10 , 45 , 0 ]
120
233
}
121
234
inventory = pd.DataFrame(products)
235
+ ```
236
+
237
+ :::
122
238
239
+ :::{solution}
240
+
241
+ ``` python
123
242
# Tasks:
124
243
# 1. Select just the Product_Name and Price columns
125
244
names_prices = inventory[[' Product_Name' , ' Price' ]]
@@ -143,13 +262,13 @@ print("\nSecond and third products:")
143
262
print (second_third)
144
263
```
145
264
146
- ** Expected Learning Outcome: ** Students should understand the different ways to select data from a DataFrame, including column selection, row selection by label and position, boolean filtering, and combinations of these methods.
265
+ Understand the different ways to select data from a DataFrame, including column selection, row selection by label and position, boolean filtering, and combinations of these methods
147
266
148
- ---
267
+ :::
149
268
150
- ### 2. Adding and Removing Columns/Rows (6 minutes)
269
+ ## Adding and Removing Columns/Rows
151
270
152
- ** Slide 3: Modifying DataFrame Structure**
271
+ ** Modifying DataFrame Structure: **
153
272
154
273
| Operation | Method | Example |
155
274
| -----------| --------| ---------|
@@ -161,8 +280,25 @@ print(second_third)
161
280
| Add row | Using loc | ` df.loc['new_index'] = values ` |
162
281
| Add row | Using append/concat | ` pd.concat([df, new_row]) ` |
163
282
164
- ** Code Example 3: Adding and Removing Columns**
283
+ ### Adding and Removing Columns
284
+
285
+ :::{done}
286
+
165
287
``` python
288
+
289
+ # Create a sample dataset
290
+ data = {
291
+ ' Name' : [' Alice' , ' Bob' , ' Charlie' , ' David' , ' Eva' ],
292
+ ' Age' : [24 , 30 , 35 , 42 , 28 ],
293
+ ' City' : [' New York' , ' Boston' , ' Chicago' , ' Seattle' , ' Miami' ],
294
+ ' Salary' : [65000 , 72000 , 85000 , 92000 , 70000 ],
295
+ ' Department' : [' HR' , ' Sales' , ' Tech' , ' Tech' , ' Finance' ]
296
+ }
297
+ df = pd.DataFrame(data)
298
+ df.index = [' emp001' , ' emp002' , ' emp003' , ' emp004' , ' emp005' ] # Custom index
299
+ print (" Original DataFrame:" )
300
+ print (df)
301
+
166
302
# Continuing with our employee DataFrame
167
303
print (" Original DataFrame:" )
168
304
print (df)
@@ -201,7 +337,12 @@ print(df_minimal)
201
337
# df.drop(['City', 'Active'], axis=1, inplace=True)
202
338
```
203
339
204
- ** Code Example 4: Adding and Removing Rows**
340
+ :::
341
+
342
+ ### Adding and Removing Rows
343
+
344
+ :::{demo}
345
+
205
346
``` python
206
347
# 1. Removing a row by index label
207
348
df_no_bob = df.drop(' emp002' )
@@ -232,16 +373,58 @@ print("\n4. DataFrame with another new employee:")
232
373
print (df_newer)
233
374
```
234
375
235
- ** Talking Points:**
236
- - "Adding columns is a common operation, especially when you need to create derived fields or features."
237
- - "Notice that we can add columns based on calculations from other columns - this is ideal for metrics and KPIs."
238
- - "The ` drop() ` function is powerful but doesn't modify the original DataFrame unless you specify ` inplace=True ` ."
239
- - "Adding rows is less common but useful for simulation, testing, or creating summary rows."
240
- - "Always be careful with the ` axis ` parameter - ` axis=0 ` is for rows, ` axis=1 ` is for columns."
376
+ :::
241
377
242
- ** Exercise 2: Adding and Removing Data (2 minutes)**
378
+ :::{discussion}
379
+
380
+ * Adding columns is a common operation, especially when you need to create derived fields or features
381
+ * Notice that we can add columns based on calculations from other columns - this is ideal for metrics and KPIs
382
+ * The ` drop() ` function is powerful but doesn't modify the original DataFrame unless you specify ` inplace=True `
383
+ * Adding rows is less common but useful for simulation, testing, or creating summary rows
384
+ * Always be careful with the ` axis ` parameter - ` axis=0 ` is for rows, ` axis=1 ` is for columns
385
+
386
+ :::
387
+
388
+ :::{exercise}
389
+
390
+ ** Adding and Removing Data:**
391
+
392
+ ``` python
393
+ products = {
394
+ ' Product_ID' : [' P001' , ' P002' , ' P003' , ' P004' , ' P005' , ' P006' ],
395
+ ' Product_Name' : [' Laptop' , ' Smartphone' , ' Tablet' , ' Monitor' , ' Keyboard' , ' Mouse' ],
396
+ ' Category' : [' Electronics' , ' Electronics' , ' Electronics' ,
397
+ ' Electronics' , ' Accessories' , ' Accessories' ],
398
+ ' Price' : [1200 , 800 , 350 , 250 , 75 , 25 ],
399
+ ' In_Stock' : [True , True , False , True , True , False ],
400
+ ' Units' : [15 , 28 , 0 , 10 , 45 , 0 ]
401
+ }
402
+ inventory = pd.DataFrame(products)
403
+ ```
404
+
405
+ Use inventory dataframe and
406
+
407
+ * Add a 'Value' column that multiplies Price by Units
408
+ * Add a 'Status' column: 'Available' if In_Stock is True, 'Out of Stock' otherwise
409
+ * Remove the In_Stock column (now redundant with Status)
410
+ * Add a ` new_product ` row
411
+
412
+ ``` python
413
+ new_product = pd.Series({
414
+ ' Product_ID' : ' P007' ,
415
+ ' Product_Name' : ' Headphones' ,
416
+ ' Category' : ' Accessories' ,
417
+ ' Price' : 150 ,
418
+ ' Units' : 20 ,
419
+ ' Value' : 3000 ,
420
+ ' Status' : ' Available'
421
+ })
422
+ ```
423
+
424
+ :::
425
+
426
+ :::{solution}
243
427
244
- Have students execute:
245
428
``` python
246
429
# Continue with the inventory DataFrame from Exercise 1
247
430
print (" Original inventory:" )
@@ -263,25 +446,15 @@ print("\nInventory without In_Stock column:")
263
446
print (inventory_updated)
264
447
265
448
# 4. Add a new product row
266
- new_product = pd.Series({
267
- ' Product_ID' : ' P007' ,
268
- ' Product_Name' : ' Headphones' ,
269
- ' Category' : ' Accessories' ,
270
- ' Price' : 150 ,
271
- ' Units' : 20 ,
272
- ' Value' : 3000 ,
273
- ' Status' : ' Available'
274
- })
275
449
inventory_final = pd.concat([inventory_updated, pd.DataFrame([new_product])])
276
450
print (" \n Inventory with new product:" )
277
451
print (inventory_final)
278
452
```
279
453
280
- ** Expected Learning Outcome: ** Students should understand how to add and remove columns and rows from a DataFrame using different methods, and how to create calculated columns based on existing data.
454
+ :::
281
455
282
- ---
283
456
284
- ### 3. DataFrame Sorting (5 minutes)
457
+ ### DataFrame Sorting (5 minutes)
285
458
286
459
** Slide 4: Sorting Data**
287
460
0 commit comments