Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Advanced Python DS Ecosystem" course materials #3

Draft
wants to merge 47 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
50e8272
Add course overview
ccauet Oct 9, 2023
e1dae85
Add DB notebooks
ccauet Oct 9, 2023
bfa40ee
Fix data path
ccauet Oct 9, 2023
bd76b0d
add polars notebooks
jkuehlem Oct 18, 2023
792d457
Update dependencies
ccauet Oct 20, 2023
af812b0
Add object-oriented programming notebook
Oct 21, 2023
da2a63a
Fix some typos and add links to db notebooks in index.
Oct 23, 2023
426e9d3
streamlit example of stromnetz
sd-p8 Oct 23, 2023
d4057fe
Update scipy and statsmodels
ccauet Oct 23, 2023
d059226
Upgrade scikit-learn
ccauet Oct 23, 2023
37612d3
Update project dependencies
ccauet Oct 23, 2023
db71a8d
Update copyright notice
ccauet Oct 23, 2023
c981687
Add copyright notice
ccauet Oct 23, 2023
7345877
Fix relativ path
ccauet Oct 23, 2023
9cee91a
Linting
ccauet Oct 23, 2023
63833b9
Update requirement.txt
ccauet Oct 23, 2023
3b9387d
Remove an incompatible package by hand
ccauet Oct 23, 2023
4664ba3
Create requirements file by hand
ccauet Oct 23, 2023
e659751
Remove python from req file
ccauet Oct 23, 2023
a93828c
No versions constraints
ccauet Oct 23, 2023
dc5475e
Add compose file to setup aux services
ccauet Oct 23, 2023
538043f
Update port mapping
ccauet Oct 23, 2023
048c534
Update MongoDB client to include port and authentication.
Oct 24, 2023
4ddb681
Add polars notebook links to index.
Oct 24, 2023
ce3d77c
Minor changes in polars notebooks.
Oct 24, 2023
839e152
Update dependencies
ccauet Oct 24, 2023
18f0c25
update streamlit example
sd-p8 Oct 24, 2023
92f778a
add streamlit config
sd-p8 Oct 24, 2023
18d2140
add oop and streamlit to index
sd-p8 Oct 24, 2023
59e0f1b
add exercise streamlit
sd-p8 Oct 24, 2023
38ab2b7
fix path to data
sd-p8 Oct 24, 2023
8c6077e
re-add pathlib
sd-p8 Oct 25, 2023
869b536
add licences
sd-p8 Oct 25, 2023
22572c6
Add 'timeit' to polars notebook.
Oct 30, 2023
177334c
Add the current course index to the first level and update paths acco…
Oct 30, 2023
f99fb8d
Add type hints to OOD notebook.
Oct 30, 2023
8451101
add output of db schema information
sd-p8 Nov 13, 2023
f35f3ff
Fix some typos and remove unnecessary lines of code from polars noteb…
Nov 24, 2023
befd5b2
Remove duplicated index, update paths in ape-index, update polars int…
Nov 24, 2023
8347d06
Add APE to official data-science-learning-paths index.
Nov 24, 2023
5040f00
Rework polars notebooks. Split in exercise and solution and add a 'be…
Nov 24, 2023
ae78aaf
Delete old polars notebook.
Nov 24, 2023
591b4ae
Try to use black format in polars notebooks.
Nov 24, 2023
8e47f5b
Add more information for the DB-API sqlite notebook and update the in…
Nov 24, 2023
dd10538
Add introductory text for ORM notebook.
Nov 24, 2023
55ba215
Add introductory text for NoSQL notebook with MongoDB.
Nov 24, 2023
8a525ad
Minor update in pandas-sql notebook.
Nov 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified notebooks/data-science-learning-paths-concept.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions notebooks/data-science-learning-paths.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,20 @@
"- **Index notebook**: [📓Machine Learning on Time Series](index/mlts-machine-learning-time-series.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Advanced Python Data Science Ecosystem [APE]\n",
"\n",
"A 2-day advanced course on independent topics, including developement of python packages with poetry, object-oriented programming, introduction in data bases, dashboards with streamlit, and introduction of the polars library. \n",
"\n",
"- **Level**: Advanced\n",
"- **Duration**: 2 days\n",
"- **Prerequisites**: DAP+MLP\n",
"- **Index notebook**: [📓Advanced Python Data Science Ecosystem](index/ape-advanced-python-ds-ecosystem-2day.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
Binary file added notebooks/db/RDB_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notebooks/db/data/firmenlauf_demo.db
Binary file not shown.
5 changes: 5 additions & 0 deletions notebooks/db/data/participants.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
First Name;Last Name;Shoe Size;Shirt Size;Distance;Team;
Anna;Einstein;38;38;5;3;
Marius;Fermi;44;60;2;5;
James;Pauli;44;42;10;8;
Selma;Meitner;41;40;10;3;
4 changes: 4 additions & 0 deletions notebooks/db/data/teams.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
ID;Size;Shoe Color;;;;
3;16;Red;;;;
5;15;Green;;;;
8;11;Purple;;;;
10 changes: 10 additions & 0 deletions notebooks/db/data/training.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
ID;Date(YYYY-MM-DD);Time(mm:ss);Distance(km);Runner;;
1;2023-07-15;39:00;4.5;Anna;;
2;2023-08-05;58:00;3;Marius;;
3;2023-08-07;34:45;1.6;James;;
4;2023-07-08;32:00;4.05;Selma;;
5;2023-07-18;35:00;4.5;Anna;;
6;2023-07-25;30:00;4.5;Anna;;
7;2023-09-07;37:00;5.456;Selma;;
8;2023-07-19;41:51;2.24;James;;
9;2023-07-28;32:06;1.6;James;;
261 changes: 261 additions & 0 deletions notebooks/db/db-pandas-sql.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "9f37e58c-94cf-4930-8ad9-1f724740083f",
"metadata": {},
"source": [
"# Pandas + SQL(Alchemy)\n",
"\n",
"Pandas is a very powerful tool to work with data frames.\n",
"But it can also be used with databases!\n",
"We can load single tables from an existing database into dataframes or create new tables from dataframes, without specifying any schema!\n",
"SQLAlchemy is doing it under the hood.\n",
"\n",
"Working with Pandas and SQL will always load tables into a dataframe, we do *not* get any Python objects as we did with the ORM when using SQLAlchemy.\n",
"\n",
"Be aware that dataframes do *not* know about any relations you might have established with SQLAlchemy!\n",
"\n",
"**Be aware**: Working with SQL+Pandas is usually only a comfortable workaround for simple use cases and \"quick-and-dirty\" approaches, e.g. when you need a simple lookup of some data. Also, if your amout of data is feasible for a dataframe and you plan to load the data once from a DB and do everything else in Pandas anyway, then this workflow will do.\n",
"For more complex tasks involving joining/aggregating/grouping/selecting data on a large data volume, you might want to rely on the features of your (relational) DB itself and do all these tasks using SQLAlchemy."
]
},
{
"cell_type": "markdown",
"id": "92f25dbc-afab-4413-881e-0a8c9c6787c2",
"metadata": {},
"source": [
"# Read from a DB with Pandas"
]
},
{
"cell_type": "markdown",
"id": "0112bdf6-91e2-444a-b344-aefe7f185b7f",
"metadata": {},
"source": [
"## Open Connection\n",
"\n",
"First, we have to establish a connection to the DB we have already filled.\n",
"In this example, we use the Sqlite DB-API for this task.\n",
"You can do the same with other systems, e.g. PostgreSQL, using the respective DB-API."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "890c3e23-b4e8-4a5c-a8b1-1d9b36b2a688",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import sqlite3"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1ab98ccf-b753-4b3a-a24c-2093aacc3b04",
"metadata": {},
"outputs": [],
"source": [
"connection = sqlite3.connect(\"data/firmenlauf_demo.db\")"
]
},
{
"cell_type": "markdown",
"id": "070ba9db-9c33-4380-8e85-a9dcc43c3583",
"metadata": {},
"source": [
"## Run an SQL queries with Pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "766d3ccf-b612-48a7-96f3-2866b98e7dbb",
"metadata": {},
"outputs": [],
"source": [
"# Load the whole table \"teams\" into a dataframe\n",
"df_teams = pd.read_sql(\"SELECT * FROM teams\", connection)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "927e0dad-675f-4b05-af81-30be0cd7c544",
"metadata": {},
"outputs": [],
"source": [
"df_teams"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4100ce2e-11e1-476d-9375-192ae74c5de0",
"metadata": {},
"outputs": [],
"source": [
"# For better readability, we define the query string separately\n",
"# Note, that we have to JOIN two tables explicitly in SQL if we want to combine data from two tables\n",
"sql_query_runner_shoe_color = \"\"\"\n",
" SELECT runners.first_name, runners.shoe_size, teams.shoe_color \n",
" FROM runners\n",
" JOIN teams\n",
" ON runners.team_id = teams.id\n",
"\"\"\"\n",
"\n",
"df_shoes = pd.read_sql(sql_query_runner_shoe_color, connection)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "30c95984-57ff-447d-adaf-f18f8c4a1f19",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"df_shoes"
]
},
{
"cell_type": "markdown",
"id": "3a03cd89-70ca-4a38-919f-9c966db4548b",
"metadata": {},
"source": [
"# Add a Table to a DB with Pandas\n",
"\n",
"Let's say we want to add a new table containing the ranking from the actual Firmenlauf and the money each team gets.\n",
"We first create a dataframe and add this dataframe as new table to the DB.\n",
"Note, that we can not add any relationships as we did when using SQLAlchemy, since we do not use an ORM here."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4ae82b51-6ab6-4552-bac1-937ceb11a8dc",
"metadata": {},
"outputs": [],
"source": [
"df_ranking = pd.DataFrame({\"rank\": [1, 2, 3, 4], \"team_id\": [4, 3, 2, 1], \"prize\": [5000, 2000, 1000, 500]})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eaa43902-2fd7-4ceb-a1cb-1804b39a0dcc",
"metadata": {},
"outputs": [],
"source": [
"df_ranking"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d95e6c3f-cd0b-4bba-b674-a443d83e93c7",
"metadata": {},
"outputs": [],
"source": [
"# Import the dataframe as table to the DB and replace it, if it already exists (this might cause data loss in real world!).\n",
"df_ranking.to_sql(\"rankings\", connection, if_exists=\"replace\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eedfced1-3a67-4f97-a253-f3996658f988",
"metadata": {},
"outputs": [],
"source": [
"# Read out the newly added table as dataframe again\n",
"pd.read_sql(\"SELECT * FROM rankings\", connection)"
]
},
{
"cell_type": "markdown",
"id": "1cc2e5d0-c7da-4ae0-a00e-d2c2a247348b",
"metadata": {},
"source": [
"# Show DB schema information"
]
},
{
"cell_type": "markdown",
"id": "650c92e5-d1af-4168-9ff4-15e7be36dea1",
"metadata": {},
"source": [
"The `sqlite_master` element contains all information of the DB schema. \n",
"\n",
"We want to have a more structured output of the available tables and their columns therefore we define the following function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b51d21e5-daa5-4060-8043-f20d19a0151e",
"metadata": {},
"outputs": [],
"source": [
"def table_info(c, conn):\n",
" '''\n",
" prints out all of the columns of every table in db\n",
" c : cursor object\n",
" conn : database connection object\n",
" '''\n",
" tables = c.execute(\"SELECT name FROM sqlite_master WHERE type='table';\").fetchall()\n",
" for table_name in tables:\n",
" table_name = table_name[0] # tables is a list of single item tuples\n",
" table = pd.read_sql_query(\"SELECT * from {} LIMIT 0\".format(table_name), conn)\n",
" print(table_name)\n",
" for col in table.columns:\n",
" print('\\t' + col)\n",
" print()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b0de9d90-3eff-47be-8556-98cd03f9fde4",
"metadata": {},
"outputs": [],
"source": [
"cur = connection.cursor()\n",
"table_info(cur, connection)"
]
},
{
"cell_type": "markdown",
"id": "d3eb6810-a3f2-47e3-ad55-5ecf82c8f100",
"metadata": {},
"source": [
"---\n",
"_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © [Point 8 GmbH](https://point-8.de)_"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading