Databases
Carry-over from lecture 6
Whenever documentation says pip install …
, you probably want to add it to your requirements.txt
.
Who has experience with SQL? What have you used it for?
SQL syntax is largely the same across SQL databases:
- SQLite
- DuckDB
- PostgreSQL
- MySQL
- BigQuery
- Oracle
- …
The variations are known as "dialects".
- Mac: Assuming you have Homebrew set up, use the "package manager" option.
- Windows: Not available for Chocolatey yet, so probably easiest to do the "direct download".
-
Download and unzip the access to electricity dataset as a CSV.
-
Query it.
SELECT * FROM read_csv('[path].csv');
SELECT * FROM read_csv(
'[path].csv',
nullstr=['', '..']
);
CREATE TABLE electricity AS SELECT ...;
What's a question we might want to ask?
Column names with spaces need double quotes. Alternatively, normalize_names
.
SELECT
"Country name",
"2022 [YR2022]" - "1990 [YR1990]" AS diff
FROM electricity
ORDER BY diff DESC;
- Tabular
- DuckDB:
read_csv()
- Tables are like DataFrames
- Columns have types
- Column-based operations
SELECT
is like boolean indexingGROUP BY
is likegroupby()
pandas allows you to build up operations over multiple lines; harder to do that in SQL.
Lots of ways to connect to databases from Python, including:
- pandas
- SQLAlchemy
- Object Relational Mapper (ORM)
Allow you to use the same Python syntax across databases
How would you take data from an API and get it into a database?
- In the Google Cloud Console, make sure your Project is selected.
- Open BigQuery.
- Enable the API.
- Open a public dataset.
- Try
Category
ofEconomics
andPrice
ofFree
- Try
SELECT company_name, COUNT(*) AS num_complaints
FROM `bigquery-public-data.cfpb_complaints.complaint_database`
GROUP BY company_name
ORDER BY num_complaints DESC;
Coming soon