Skip to content

Commit 58da61d

Browse files
author
pubudu
committed
Pandas data import - export
1 parent d9dc4bd commit 58da61d

7 files changed

+959
-553
lines changed

content/10.DataFrame_Manipulation.md

+223-50
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,25 @@
1-
# DataFrame Manipulation & Sorting (25 mins)
2-
## Lecture Materials with Exercises
1+
# DataFrame Manipulation & Sorting
32

4-
### Introduction (2 minutes)
3+
## Introduction
54

6-
**Slide 1: DataFrame Manipulation & Sorting**
7-
- Now that we can import data, we need to reshape it for analysis
8-
- Most real-world datasets need significant manipulation before analysis
9-
- Selecting, adding, removing, and reordering data are fundamental skills
10-
- These operations build directly on our understanding of DataFrames as labeled, 2D structures
5+
**DataFrame Manipulation & Sorting:**
116

12-
**Talking Points:**
13-
- "In real-world data analysis, you'll spend about 80% of your time cleaning and manipulating data, and only 20% on actual analysis."
14-
- "The skills we're covering today form the backbone of data wrangling in Python."
15-
- "Think of these operations as transforming raw data into analysis-ready information."
7+
* Now that we can import data, we need to reshape it for analysis
8+
* Most real-world datasets need significant manipulation before analysis
9+
* Selecting, adding, removing, and reordering data are fundamental skills
10+
* These operations build directly on our understanding of DataFrames as labeled, 2D structures
1611

17-
---
12+
:::{discussion}
13+
14+
* In real-world data analysis, you'll spend about 80% of your time cleaning and manipulating data, and only 20% on actual analysis
15+
* The skills we're covering today form the backbone of data wrangling in Python
16+
* Think of these operations as transforming raw data into analysis-ready information
1817

19-
### 1. Column and Row Selection (7 minutes)
18+
:::
2019

21-
**Slide 2: Different Ways to Select Data**
20+
## Column and Row Selection
21+
22+
**Different Ways to Select Data:**
2223

2324
| Selection Type | Purpose | Example Syntax |
2425
|----------------|---------|----------------|
@@ -29,7 +30,10 @@
2930
| Row and column | Get specific value(s) | `df.loc['index', 'column']` |
3031
| Slicing | Get ranges of data | `df.loc['idx1':'idx2', 'col1':'col2']` |
3132

32-
**Code Example 1: Basic Selection**
33+
### Basic Selection
34+
35+
:::{demo}
36+
3337
```python
3438
import pandas as pd
3539
import numpy as np
@@ -79,7 +83,79 @@ print("\n8. Selecting subset of rows and columns:")
7983
print(df.loc['emp001':'emp003', ['Name', 'Age', 'Salary']])
8084
```
8185

82-
**Code Example 2: Advanced Selection with Conditions**
86+
Output
87+
88+
```none
89+
Original DataFrame:
90+
Name Age City Salary Department
91+
emp001 Alice 24 New York 65000 HR
92+
emp002 Bob 30 Boston 72000 Sales
93+
emp003 Charlie 35 Chicago 85000 Tech
94+
emp004 David 42 Seattle 92000 Tech
95+
emp005 Eva 28 Miami 70000 Finance
96+
97+
1. Single column as Series:
98+
emp001 24
99+
emp002 30
100+
emp003 35
101+
emp004 42
102+
emp005 28
103+
Name: Age, dtype: int64
104+
105+
2. Alternative syntax for columns without spaces:
106+
emp001 24
107+
emp002 30
108+
emp003 35
109+
emp004 42
110+
emp005 28
111+
Name: Age, dtype: int64
112+
113+
3. Selecting multiple columns:
114+
Name Salary
115+
emp001 Alice 65000
116+
emp002 Bob 72000
117+
emp003 Charlie 85000
118+
emp004 David 92000
119+
emp005 Eva 70000
120+
121+
4. Selecting row by index label:
122+
Name Charlie
123+
Age 35
124+
City Chicago
125+
Salary 85000
126+
Department Tech
127+
Name: emp003, dtype: object
128+
129+
5. Selecting row by position (third row):
130+
Name Charlie
131+
Age 35
132+
City Chicago
133+
Salary 85000
134+
Department Tech
135+
Name: emp003, dtype: object
136+
137+
6. Selecting multiple rows by position:
138+
Name Age City Salary Department
139+
emp002 Bob 30 Boston 72000 Sales
140+
emp003 Charlie 35 Chicago 85000 Tech
141+
emp004 David 42 Seattle 92000 Tech
142+
143+
7. Selecting specific value (cell):
144+
72000
145+
146+
8. Selecting subset of rows and columns:
147+
Name Age Salary
148+
emp001 Alice 24 65000
149+
emp002 Bob 30 72000
150+
emp003 Charlie 35 85000
151+
```
152+
153+
:::
154+
155+
### Advanced Selection with Conditions
156+
157+
:::{demo}
158+
83159
```python
84160
# Boolean selection - rows where Age > 30
85161
print("\n9. Boolean selection - employees over 30:")
@@ -98,15 +174,52 @@ print("\n12. Using .isin() - employees in HR or Finance:")
98174
print(df[df['Department'].isin(['HR', 'Finance'])])
99175
```
100176

101-
**Talking Points:**
102-
- "Notice that selecting a single column returns a Series, while selecting multiple columns maintains the DataFrame structure."
103-
- "The `.loc` accessor is used for label-based indexing, while `.iloc` is for position-based indexing."
104-
- "Boolean selection is incredibly powerful - it lets you filter data based on specific conditions."
105-
- "These selection methods can be combined in powerful ways to extract exactly the data you need."
177+
Output
106178

107-
**Exercise 1: Selection Practice (2 minutes)**
179+
```none
180+
9. Boolean selection - employees over 30:
181+
Name Age City Salary Department
182+
emp003 Charlie 35 Chicago 85000 Tech
183+
emp004 David 42 Seattle 92000 Tech
184+
185+
10. Multiple conditions - Tech department with salary > 80000:
186+
Name Age City Salary Department
187+
emp003 Charlie 35 Chicago 85000 Tech
188+
emp004 David 42 Seattle 92000 Tech
189+
190+
11. Using query method - same condition:
191+
Name Age City Salary Department
192+
emp003 Charlie 35 Chicago 85000 Tech
193+
emp004 David 42 Seattle 92000 Tech
194+
195+
12. Using .isin() - employees in HR or Finance:
196+
Name Age City Salary Department
197+
emp001 Alice 24 New York 65000 HR
198+
emp005 Eva 28 Miami 70000 Finance
199+
```
200+
201+
:::
202+
203+
:::{discussion}
204+
205+
* Notice that selecting a single column returns a Series, while selecting multiple columns maintains the DataFrame structure
206+
* The `.loc` accessor is used for label-based indexing, while `.iloc` is for position-based indexing
207+
* Boolean selection is incredibly powerful - it lets you filter data based on specific conditions
208+
* These selection methods can be combined in powerful ways to extract exactly the data you need
209+
210+
:::
211+
212+
:::{exercise}
213+
214+
**Selection Practice:**
215+
216+
Use `inventory` dataframe and
217+
218+
* Select just the Product_Name and Price columns
219+
* Select all products that are in stock (In_Stock is True)
220+
* Select all electronics that cost less than 500
221+
* Select the 2nd and 3rd products using position-based indexing
108222

109-
Have students execute:
110223
```python
111224
# Create a dataset of product inventory
112225
products = {
@@ -119,7 +232,13 @@ products = {
119232
'Units': [15, 28, 0, 10, 45, 0]
120233
}
121234
inventory = pd.DataFrame(products)
235+
```
236+
237+
:::
122238

239+
:::{solution}
240+
241+
```python
123242
# Tasks:
124243
# 1. Select just the Product_Name and Price columns
125244
names_prices = inventory[['Product_Name', 'Price']]
@@ -143,13 +262,13 @@ print("\nSecond and third products:")
143262
print(second_third)
144263
```
145264

146-
**Expected Learning Outcome:** Students should understand the different ways to select data from a DataFrame, including column selection, row selection by label and position, boolean filtering, and combinations of these methods.
265+
Understand the different ways to select data from a DataFrame, including column selection, row selection by label and position, boolean filtering, and combinations of these methods
147266

148-
---
267+
:::
149268

150-
### 2. Adding and Removing Columns/Rows (6 minutes)
269+
## Adding and Removing Columns/Rows
151270

152-
**Slide 3: Modifying DataFrame Structure**
271+
**Modifying DataFrame Structure:**
153272

154273
| Operation | Method | Example |
155274
|-----------|--------|---------|
@@ -161,8 +280,25 @@ print(second_third)
161280
| Add row | Using loc | `df.loc['new_index'] = values` |
162281
| Add row | Using append/concat | `pd.concat([df, new_row])` |
163282

164-
**Code Example 3: Adding and Removing Columns**
283+
### Adding and Removing Columns
284+
285+
:::{done}
286+
165287
```python
288+
289+
# Create a sample dataset
290+
data = {
291+
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
292+
'Age': [24, 30, 35, 42, 28],
293+
'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
294+
'Salary': [65000, 72000, 85000, 92000, 70000],
295+
'Department': ['HR', 'Sales', 'Tech', 'Tech', 'Finance']
296+
}
297+
df = pd.DataFrame(data)
298+
df.index = ['emp001', 'emp002', 'emp003', 'emp004', 'emp005'] # Custom index
299+
print("Original DataFrame:")
300+
print(df)
301+
166302
# Continuing with our employee DataFrame
167303
print("Original DataFrame:")
168304
print(df)
@@ -201,7 +337,12 @@ print(df_minimal)
201337
# df.drop(['City', 'Active'], axis=1, inplace=True)
202338
```
203339

204-
**Code Example 4: Adding and Removing Rows**
340+
:::
341+
342+
### Adding and Removing Rows
343+
344+
:::{demo}
345+
205346
```python
206347
# 1. Removing a row by index label
207348
df_no_bob = df.drop('emp002')
@@ -232,16 +373,58 @@ print("\n4. DataFrame with another new employee:")
232373
print(df_newer)
233374
```
234375

235-
**Talking Points:**
236-
- "Adding columns is a common operation, especially when you need to create derived fields or features."
237-
- "Notice that we can add columns based on calculations from other columns - this is ideal for metrics and KPIs."
238-
- "The `drop()` function is powerful but doesn't modify the original DataFrame unless you specify `inplace=True`."
239-
- "Adding rows is less common but useful for simulation, testing, or creating summary rows."
240-
- "Always be careful with the `axis` parameter - `axis=0` is for rows, `axis=1` is for columns."
376+
:::
241377

242-
**Exercise 2: Adding and Removing Data (2 minutes)**
378+
:::{discussion}
379+
380+
* Adding columns is a common operation, especially when you need to create derived fields or features
381+
* Notice that we can add columns based on calculations from other columns - this is ideal for metrics and KPIs
382+
* The `drop()` function is powerful but doesn't modify the original DataFrame unless you specify `inplace=True`
383+
* Adding rows is less common but useful for simulation, testing, or creating summary rows
384+
* Always be careful with the `axis` parameter - `axis=0` is for rows, `axis=1` is for columns
385+
386+
:::
387+
388+
:::{exercise}
389+
390+
**Adding and Removing Data:**
391+
392+
```python
393+
products = {
394+
'Product_ID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006'],
395+
'Product_Name': ['Laptop', 'Smartphone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse'],
396+
'Category': ['Electronics', 'Electronics', 'Electronics',
397+
'Electronics', 'Accessories', 'Accessories'],
398+
'Price': [1200, 800, 350, 250, 75, 25],
399+
'In_Stock': [True, True, False, True, True, False],
400+
'Units': [15, 28, 0, 10, 45, 0]
401+
}
402+
inventory = pd.DataFrame(products)
403+
```
404+
405+
Use inventory dataframe and
406+
407+
* Add a 'Value' column that multiplies Price by Units
408+
* Add a 'Status' column: 'Available' if In_Stock is True, 'Out of Stock' otherwise
409+
* Remove the In_Stock column (now redundant with Status)
410+
* Add a `new_product` row
411+
412+
```python
413+
new_product = pd.Series({
414+
'Product_ID': 'P007',
415+
'Product_Name': 'Headphones',
416+
'Category': 'Accessories',
417+
'Price': 150,
418+
'Units': 20,
419+
'Value': 3000,
420+
'Status': 'Available'
421+
})
422+
```
423+
424+
:::
425+
426+
:::{solution}
243427

244-
Have students execute:
245428
```python
246429
# Continue with the inventory DataFrame from Exercise 1
247430
print("Original inventory:")
@@ -263,25 +446,15 @@ print("\nInventory without In_Stock column:")
263446
print(inventory_updated)
264447

265448
# 4. Add a new product row
266-
new_product = pd.Series({
267-
'Product_ID': 'P007',
268-
'Product_Name': 'Headphones',
269-
'Category': 'Accessories',
270-
'Price': 150,
271-
'Units': 20,
272-
'Value': 3000,
273-
'Status': 'Available'
274-
})
275449
inventory_final = pd.concat([inventory_updated, pd.DataFrame([new_product])])
276450
print("\nInventory with new product:")
277451
print(inventory_final)
278452
```
279453

280-
**Expected Learning Outcome:** Students should understand how to add and remove columns and rows from a DataFrame using different methods, and how to create calculated columns based on existing data.
454+
:::
281455

282-
---
283456

284-
### 3. DataFrame Sorting (5 minutes)
457+
### DataFrame Sorting (5 minutes)
285458

286459
**Slide 4: Sorting Data**
287460

0 commit comments

Comments
 (0)