Skip to content

Latest commit

 

History

History
410 lines (292 loc) · 9.25 KB

01_numpy_pandas.md

File metadata and controls

410 lines (292 loc) · 9.25 KB
marp theme paginate
true
default
true

PyEarth: A Python Introduction to Earth Science

Lecture 2: NumPy and Pandas


Review of Lecture 1:

  • Introduction to Python, Jupyter, and Chatbots
  • Basic data types and structures
  • Control flow, loops, and functions

Introduction to NumPy

  • NumPy: Numerical Python
  • Fundamental package for scientific computing in Python
  • Provides support for large, multi-dimensional arrays and matrices
  • Offers a wide range of mathematical functions

Why NumPy?

  • Efficient: Optimized for performance
  • Versatile: Supports various data types
  • Integrates well with other libraries
  • Essential for data analysis and scientific computing

Linear Algebra and NumPy?

Let's solve the classic "chickens and rabbits in the same cage" problem:

  • There are 35 heads and 94 legs in a cage of chickens and rabbits.
  • How many chickens and rabbits are there?

Linear Algebra and NumPy?

Let's solve the classic "chickens and rabbits in the same cage" problem:

  • There are 35 heads and 94 legs in a cage of chickens and rabbits.
  • How many chickens and rabbits are there?

We can use linear algebra to solve this system of equations:

  1. x + y = 35 (total heads)
  2. 2x + 4y = 94 (total legs)

Where x = number of chickens, y = number of rabbits


Matrix Representation

We can represent this system of equations in matrix form:

$$ \begin{bmatrix} 1 & 1 \ 2 & 4 \end{bmatrix} \begin{bmatrix} x \ y \end{bmatrix}

\begin{bmatrix} 35 \ 94 \end{bmatrix} $$

Or more concisely:

$$ A\vec{x} = \vec{b} $$

Where:

  • $A$ is the coefficient matrix
  • $\vec{x}$ is the vector of unknowns (chickens and rabbits)
  • $\vec{b}$ is the constant vector

Solving with NumPy

import numpy as np

# Define the coefficient matrix A and the constant vector b
A = np.array([[1, 1],   # Coefficients for heads equation
              [2, 4]])  # Coefficients for legs equation
b = np.array([35, 94])  # Constants (total heads and legs)

# Solve the system of equations
solution = np.linalg.solve(A, b)

print(f"Chickens: {int(solution[0])}")
print(f"Rabbits: {int(solution[1])}")

Creating NumPy Arrays

import numpy as np

# From a list
arr1 = np.array([1, 2, 3, 4, 5])

# Using NumPy functions
arr2 = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
arr3 = np.linspace(0, 1, 5)  # [0, 0.25, 0.5, 0.75, 1]
arr4 = np.zeros((3, 3))  # 3x3 array of zeros
arr5 = np.ones((2, 4))  # 2x4 array of ones
arr6 = np.random.rand(3, 3)  # 3x3 array of random values

Useful NumPy Functions

  1. Array operations:

    • np.reshape(): Reshape an array
    • np.concatenate(): Join arrays
    • np.split(): Split an array
  2. Mathematical operations:

    • np.sum(), np.mean(), np.std(): Basic statistics
    • np.min(), np.max(): Find minimum and maximum values
    • np.argmin(), np.argmax(): Find indices of min/max values

Useful NumPy Functions (cont.)

  1. Linear algebra:

    • np.dot(): Matrix multiplication
    • np.linalg.inv(): Matrix inverse
    • np.linalg.eig(): Eigenvalues and eigenvectors
  2. Array manipulation:

    • np.transpose(): Transpose an array
    • np.sort(): Sort an array
    • np.unique(): Find unique elements

How to Find NumPy Functions

  1. GPT, Claude, and other AI assistants
  2. Use Python's built-in help function:
    import numpy as np
    help(np.array)
  3. Use IPython/Jupyter Notebook's tab completion and ? operator:
    np.array?

NumPy vs. Basic Python: Speed Comparison

Let's compare the speed of calculating the mean of a large array:

import numpy as np
import time

# Create large arrays
size = 10000000
data = list(range(size))
np_data = np.array(data)

# Python list comprehension
start = time.time()
result_py = [x**2 + 2*x + 1 for x in data]
end = time.time()
print(f"Python time: {end - start:.6f} seconds")

# NumPy vectorized operation
start = time.time()
result_np = np_data**2 + 2*np_data + 1
end = time.time()
print(f"NumPy time: {end - start:.6f} seconds")

# NumPy is significantly faster due to its optimized C implementation. 

Real-world Example: Analyzing Earthquake Data

We'll use NumPy to analyze earthquake data:

import numpy as np

# Load earthquake data (magnitude and depth)
# the first coloumn is utc datetime
earthquakes = np.loadtxt("data/earthquakes.csv", delimiter=",", skiprows=1, usecols=(1, 2, 3, 4), dtype=float)

# Calculate average magnitude and depth
avg_depth = np.mean(earthquakes[:, 2])
avg_magnitude = np.mean(earthquakes[:, 3])

# Find the strongest earthquake
strongest_idx = np.argmax(earthquakes[:, 3])
strongest_magnitude = earthquakes[strongest_idx, 3]
strongest_depth = earthquakes[strongest_idx, 2]

print(f"Average magnitude: M{avg_magnitude:.2f}")
print(f"Average depth: {avg_depth:.2f} km")
print(f"Strongest earthquake: Magnitude {strongest_magnitude:.2f} at depth {strongest_depth:.2f} km")

Introduction to Pandas

  • Pandas: Python Data Analysis Library
  • Built on top of NumPy
  • Provides high-performance, easy-to-use data structures and tools
  • Essential for data manipulation and analysis

Why Pandas?

  • Handles structured data efficiently
  • Powerful data alignment and merging capabilities
  • Integrates well with other libraries
  • Excellent for handling time series data
  • Built-in tools for reading/writing various file formats

Pandas Data Structures

  1. Series: 1D labeled array
  2. DataFrame: 2D labeled data structure with columns of potentially different types
import pandas as pd

# Create a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': pd.date_range('20230101', periods=4),
    'C': pd.Series(1, index=range(4), dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(["test", "train", "test", "train"]),
    'F': 'foo'
})

Useful Pandas Functions

  1. Data loading and saving:

    • pd.read_csv(), pd.read_excel(), pd.read_sql()
    • df.to_csv(), df.to_excel(), df.to_sql()
  2. Data inspection:

    • df.head(), df.tail(): View first/last rows
    • df.info(): Summary of DataFrame
    • df.describe(): Statistical summary
  3. Data selection:

    • df['column']: Select a column
    • df.loc[]: Label-based indexing
    • df.iloc[]: Integer-based indexing

Useful Pandas Functions (cont.)

  1. Data manipulation:

    • df.groupby(): Group data
    • df.merge(): Merge DataFrames
    • df.pivot(): Reshape data
  2. Data cleaning:

    • df.dropna(): Drop missing values
    • df.fillna(): Fill missing values
    • df.drop_duplicates(): Remove duplicate rows
  3. Time series functionality:

    • pd.date_range(): Create date ranges
    • df.resample(): Resample time series data

How to Find Pandas Functions

  1. GPT, Claude, and other AI assistants
  2. Use Python's built-in help function:
    import pandas as pd
    help(pd.DataFrame)
  3. Use IPython/Jupyter Notebook's tab completion and ? operator:
    pd.DataFrame?

Pandas vs. NumPy

  • Pandas is built on top of NumPy
  • Pandas adds functionality for handling structured data
  • Pandas excels at:
    • Handling missing data
    • Data alignment
    • Merging and joining datasets
    • Time series functionality
  • NumPy is better for:
    • Large numerical computations
    • Linear algebra operations
    • When you need ultimate performance

Real-world Example: Revisit the Earthquake Data

We'll use Pandas to analyze earthquake data this time:

import pandas as pd

# Load earthquake data
df = pd.read_csv("data/earthquakes.csv")

# Calculate average magnitude and depth
avg_depth = df['depth'].mean()
avg_magnitude = df['magnitude'].mean()

# Find the strongest earthquake
strongest_idx = df['magnitude'].idxmax()
strongest_magnitude = df.loc[strongest_idx, 'magnitude']
strongest_depth = df.loc[strongest_idx, 'depth']

print(f"Average magnitude: M{avg_magnitude:.2f}")
print(f"Average depth: {avg_depth:.2f} km")
print(f"Strongest earthquake: Magnitude {strongest_magnitude:.2f} at depth {strongest_depth:.2f} km")

Real-world Example: Analyzing Temperature Data

We'll use Pandas to analyze temperature data:

import pandas as pd

# Load temperature data
df = pd.read_csv("data/global_temperature.csv")

# Convert date column to datetime
df["date"] = pd.to_datetime(df["date"])

# Set date as index
df.set_index("date", inplace=True)

# Find the hottest and coldest days
hottest_day = df["temperature"].idxmax()
coldest_day = df["temperature"].idxmin()

print(f"Hottest day: {hottest_day.date()} ({df.loc[hottest_day, 'temperature']:.1f}°C)")
print(f"Coldest day: {coldest_day.date()} ({df.loc[coldest_day, 'temperature']:.1f}°C)")

# Calculate monthly average temperatures
yearly_avg = df.resample("Y").mean()

# Plot monthly average temperatures
yearly_avg["temperature"].plot(figsize=(12, 6))

plt.title("Yearly Average Temperatures")
plt.ylabel("Temperature (°C)")
plt.show()

Conclusion

  • NumPy and Pandas are essential tools for data analysis in Python
  • NumPy excels at numerical computations and array operations
  • Pandas is great for structured data manipulation and analysis
  • Both libraries integrate well with other scientific Python tools
  • Practice and explore these libraries to become proficient in data analysis!