marp | theme | paginate |
---|---|---|
true |
default |
true |
- Introduction to Python, Jupyter, and Chatbots
- Basic data types and structures
- Control flow, loops, and functions
- NumPy: Numerical Python
- Fundamental package for scientific computing in Python
- Provides support for large, multi-dimensional arrays and matrices
- Offers a wide range of mathematical functions
- Efficient: Optimized for performance
- Versatile: Supports various data types
- Integrates well with other libraries
- Essential for data analysis and scientific computing
Let's solve the classic "chickens and rabbits in the same cage" problem:
- There are 35 heads and 94 legs in a cage of chickens and rabbits.
- How many chickens and rabbits are there?
Let's solve the classic "chickens and rabbits in the same cage" problem:
- There are 35 heads and 94 legs in a cage of chickens and rabbits.
- How many chickens and rabbits are there?
We can use linear algebra to solve this system of equations:
- x + y = 35 (total heads)
- 2x + 4y = 94 (total legs)
Where x = number of chickens, y = number of rabbits
We can represent this system of equations in matrix form:
\begin{bmatrix} 35 \ 94 \end{bmatrix} $$
Or more concisely:
Where:
-
$A$ is the coefficient matrix -
$\vec{x}$ is the vector of unknowns (chickens and rabbits) -
$\vec{b}$ is the constant vector
import numpy as np
# Define the coefficient matrix A and the constant vector b
A = np.array([[1, 1], # Coefficients for heads equation
[2, 4]]) # Coefficients for legs equation
b = np.array([35, 94]) # Constants (total heads and legs)
# Solve the system of equations
solution = np.linalg.solve(A, b)
print(f"Chickens: {int(solution[0])}")
print(f"Rabbits: {int(solution[1])}")
import numpy as np
# From a list
arr1 = np.array([1, 2, 3, 4, 5])
# Using NumPy functions
arr2 = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
arr3 = np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]
arr4 = np.zeros((3, 3)) # 3x3 array of zeros
arr5 = np.ones((2, 4)) # 2x4 array of ones
arr6 = np.random.rand(3, 3) # 3x3 array of random values
-
Array operations:
np.reshape()
: Reshape an arraynp.concatenate()
: Join arraysnp.split()
: Split an array
-
Mathematical operations:
np.sum()
,np.mean()
,np.std()
: Basic statisticsnp.min()
,np.max()
: Find minimum and maximum valuesnp.argmin()
,np.argmax()
: Find indices of min/max values
-
Linear algebra:
np.dot()
: Matrix multiplicationnp.linalg.inv()
: Matrix inversenp.linalg.eig()
: Eigenvalues and eigenvectors
-
Array manipulation:
np.transpose()
: Transpose an arraynp.sort()
: Sort an arraynp.unique()
: Find unique elements
- GPT, Claude, and other AI assistants
- Use Python's built-in help function:
import numpy as np help(np.array)
- Use IPython/Jupyter Notebook's tab completion and
?
operator:np.array?
Let's compare the speed of calculating the mean of a large array:
import numpy as np
import time
# Create large arrays
size = 10000000
data = list(range(size))
np_data = np.array(data)
# Python list comprehension
start = time.time()
result_py = [x**2 + 2*x + 1 for x in data]
end = time.time()
print(f"Python time: {end - start:.6f} seconds")
# NumPy vectorized operation
start = time.time()
result_np = np_data**2 + 2*np_data + 1
end = time.time()
print(f"NumPy time: {end - start:.6f} seconds")
# NumPy is significantly faster due to its optimized C implementation.
We'll use NumPy to analyze earthquake data:
import numpy as np
# Load earthquake data (magnitude and depth)
# the first coloumn is utc datetime
earthquakes = np.loadtxt("data/earthquakes.csv", delimiter=",", skiprows=1, usecols=(1, 2, 3, 4), dtype=float)
# Calculate average magnitude and depth
avg_depth = np.mean(earthquakes[:, 2])
avg_magnitude = np.mean(earthquakes[:, 3])
# Find the strongest earthquake
strongest_idx = np.argmax(earthquakes[:, 3])
strongest_magnitude = earthquakes[strongest_idx, 3]
strongest_depth = earthquakes[strongest_idx, 2]
print(f"Average magnitude: M{avg_magnitude:.2f}")
print(f"Average depth: {avg_depth:.2f} km")
print(f"Strongest earthquake: Magnitude {strongest_magnitude:.2f} at depth {strongest_depth:.2f} km")
- Pandas: Python Data Analysis Library
- Built on top of NumPy
- Provides high-performance, easy-to-use data structures and tools
- Essential for data manipulation and analysis
- Handles structured data efficiently
- Powerful data alignment and merging capabilities
- Integrates well with other libraries
- Excellent for handling time series data
- Built-in tools for reading/writing various file formats
- Series: 1D labeled array
- DataFrame: 2D labeled data structure with columns of potentially different types
import pandas as pd
# Create a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
# Create a DataFrame
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': pd.date_range('20230101', periods=4),
'C': pd.Series(1, index=range(4), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'
})
-
Data loading and saving:
pd.read_csv()
,pd.read_excel()
,pd.read_sql()
df.to_csv()
,df.to_excel()
,df.to_sql()
-
Data inspection:
df.head()
,df.tail()
: View first/last rowsdf.info()
: Summary of DataFramedf.describe()
: Statistical summary
-
Data selection:
df['column']
: Select a columndf.loc[]
: Label-based indexingdf.iloc[]
: Integer-based indexing
-
Data manipulation:
df.groupby()
: Group datadf.merge()
: Merge DataFramesdf.pivot()
: Reshape data
-
Data cleaning:
df.dropna()
: Drop missing valuesdf.fillna()
: Fill missing valuesdf.drop_duplicates()
: Remove duplicate rows
-
Time series functionality:
pd.date_range()
: Create date rangesdf.resample()
: Resample time series data
- GPT, Claude, and other AI assistants
- Use Python's built-in help function:
import pandas as pd help(pd.DataFrame)
- Use IPython/Jupyter Notebook's tab completion and
?
operator:pd.DataFrame?
- Pandas is built on top of NumPy
- Pandas adds functionality for handling structured data
- Pandas excels at:
- Handling missing data
- Data alignment
- Merging and joining datasets
- Time series functionality
- NumPy is better for:
- Large numerical computations
- Linear algebra operations
- When you need ultimate performance
We'll use Pandas to analyze earthquake data this time:
import pandas as pd
# Load earthquake data
df = pd.read_csv("data/earthquakes.csv")
# Calculate average magnitude and depth
avg_depth = df['depth'].mean()
avg_magnitude = df['magnitude'].mean()
# Find the strongest earthquake
strongest_idx = df['magnitude'].idxmax()
strongest_magnitude = df.loc[strongest_idx, 'magnitude']
strongest_depth = df.loc[strongest_idx, 'depth']
print(f"Average magnitude: M{avg_magnitude:.2f}")
print(f"Average depth: {avg_depth:.2f} km")
print(f"Strongest earthquake: Magnitude {strongest_magnitude:.2f} at depth {strongest_depth:.2f} km")
We'll use Pandas to analyze temperature data:
import pandas as pd
# Load temperature data
df = pd.read_csv("data/global_temperature.csv")
# Convert date column to datetime
df["date"] = pd.to_datetime(df["date"])
# Set date as index
df.set_index("date", inplace=True)
# Find the hottest and coldest days
hottest_day = df["temperature"].idxmax()
coldest_day = df["temperature"].idxmin()
print(f"Hottest day: {hottest_day.date()} ({df.loc[hottest_day, 'temperature']:.1f}°C)")
print(f"Coldest day: {coldest_day.date()} ({df.loc[coldest_day, 'temperature']:.1f}°C)")
# Calculate monthly average temperatures
yearly_avg = df.resample("Y").mean()
# Plot monthly average temperatures
yearly_avg["temperature"].plot(figsize=(12, 6))
plt.title("Yearly Average Temperatures")
plt.ylabel("Temperature (°C)")
plt.show()
- NumPy and Pandas are essential tools for data analysis in Python
- NumPy excels at numerical computations and array operations
- Pandas is great for structured data manipulation and analysis
- Both libraries integrate well with other scientific Python tools
- Practice and explore these libraries to become proficient in data analysis!