diff --git a/Python Guide/.ipynb_checkpoints/Python Kaggle Guide (Titantic)-checkpoint.ipynb b/Python Guide/.ipynb_checkpoints/Python Kaggle Guide (Titantic)-checkpoint.ipynb new file mode 100644 index 0000000..caf9a09 --- /dev/null +++ b/Python Guide/.ipynb_checkpoints/Python Kaggle Guide (Titantic)-checkpoint.ipynb @@ -0,0 +1,545 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python Guide to Data Science\n", + "\n", + "This guide is based on Python 3 (any version above 3 is ok).\n", + "An easy way to get the Python and the necessary libraries is to install everything through [Anaconda](https://www.continuum.io/downloads). It is a distribution that will provide you everything you need to start working with Data Science. This thing you're looking at is an iPython notebook. Essentially you can write your process while executing code at the same time. On Kaggle this is a certain type of what they call a **kernel**.\n", + "\n", + "---\n", + "This guide will look at the Titanic dataset, we will see if we can predict what types of people would have survived on the Titanic.\n", + "\n", + "So first let's import some useful libraries that we will use." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**os** is a built-in library to do operating system related things. We mostly use the `os.path.join()` function to access the file we want. Different operating systems store their files in different ways and python easily does the work for us. Ex. Windows might have a path like `\"C:\\Users\\scientist\\Desktop\"` while linux may have `\"~/Desktop\"`. \n", + "\n", + "**matplotlib** is used to plot any data we have. It's a very flexible library from plotting basic scatter plots to doing animations of geographical maps.\n", + "\n", + "**pandas** is used to store our data into something called a dataframe (as you will see shortly). The library allows us to apply functions on the dataframe to allow us to easily extract certain parts of the data, apply functions (ex. mean) on the data, and much more. If you are already aware of this concept, pandas has a good cheatsheet [here](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf).\n", + "\n", + "**numpy** is a scientific computing library that allows for more speedy computations and useful tools such as linear algebra capabilites.\n", + "\n", + "We name each as `np` and `pd` by convention, much faster than writing the full name each time." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "titanic_data = pd.read_csv(os.path.join('..', 'titanic_data', 'train.csv')) # .. means the parent folder" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since there are no errors, the import was successful. You can see we imported 891 observations of data and 12 different variables." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(891, 12)\n" + ] + } + ], + "source": [ + "print(titanic_data.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can view the first `n` or last `n` observations using `dataframe.head(n)` and `dataframe.tail(n)` respectively." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Pclass \\\n", + "0 1 0 3 \n", + "1 2 1 1 \n", + "\n", + " Name Sex Age SibSp \\\n", + "0 Braund, Mr. Owen Harris male 22.0 1 \n", + "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", + "\n", + " Parch Ticket Fare Cabin Embarked \n", + "0 0 A/5 21171 7.2500 NaN S \n", + "1 0 PC 17599 71.2833 C85 C " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic_data.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88989011Behr, Mr. Karl Howellmale26.00011136930.00C148C
89089103Dooley, Mr. Patrickmale32.0003703767.75NaNQ
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Pclass Name Sex Age SibSp \\\n", + "889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 \n", + "890 891 0 3 Dooley, Mr. Patrick male 32.0 0 \n", + "\n", + " Parch Ticket Fare Cabin Embarked \n", + "889 0 111369 30.00 C148 C \n", + "890 0 370376 7.75 NaN Q " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic_data.tail(2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also select individual columns." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SexAge
0male22.0
1female38.0
2female26.0
\n", + "
" + ], + "text/plain": [ + " Sex Age\n", + "0 male 22.0\n", + "1 female 38.0\n", + "2 female 26.0" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic_data[['Sex', 'Age']].head(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Pandas is powerful as it allows us to group data together by a certain variable. We can apply what we learned to see the average `Fare`, `Age`, and proportion of `Survived` by each ticket class. We can see that the as you move to a higher class (ie. 3 -> 1):\n", + "- Fares increase\n", + "- Passengers are older\n", + "- More survived" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
FareAgeSurvived
Pclass
184.15468738.2334410.629630
220.66218329.8776300.472826
313.67555025.1406200.242363
\n", + "
" + ], + "text/plain": [ + " Fare Age Survived\n", + "Pclass \n", + "1 84.154687 38.233441 0.629630\n", + "2 20.662183 29.877630 0.472826\n", + "3 13.675550 25.140620 0.242363" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic_data.groupby('Pclass').mean()[['Fare', 'Age', 'Survived']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While this seems pretty good, there's a problem that may not be obvious. Data rarely comes by perfectly, in this case there are missing values all over the data set. " + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PclassAge
887119.0
8883NaN
889126.0
890332.0
\n", + "
" + ], + "text/plain": [ + " Pclass Age\n", + "887 1 19.0\n", + "888 3 NaN\n", + "889 1 26.0\n", + "890 3 32.0" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "titanic_data[['Pclass', 'Age']].tail(4)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/Python Guide/Python Kaggle Guide (Titantic).ipynb b/Python Guide/Python Kaggle Guide (Titantic).ipynb index ff9bd7d..0c43039 100644 --- a/Python Guide/Python Kaggle Guide (Titantic).ipynb +++ b/Python Guide/Python Kaggle Guide (Titantic).ipynb @@ -7,1254 +7,89 @@ "# Python Guide to Data Science\n", "\n", "This guide is based on Python 3 (any version above 3 is ok).\n", - "An easy way to get the Python and the necessary libraries is to install everything through [Anaconda](https://www.continuum.io/downloads). It is a distribution that will provide you everything you need to start working with Data Science.\n", + "An easy way to get the Python and the necessary libraries is to install everything through [Anaconda](https://www.continuum.io/downloads). It is a distribution that will provide you everything you need to start working with Data Science. This thing you're looking at is an iPython notebook. Essentially you can write your process while executing code at the same time. On Kaggle this is a certain type of what they call a **kernel**.\n", "\n", - "So first let's import some useful libraries that we will use." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": { - "collapsed": true - }, - "outputs": [], - "source": [ - "import os\n", - "\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**os** is a built-in library to do operating system related things. We mostly use the `os.path.join()` function to access the file we want. Different operating systems store their files in different ways and python easily does the work for us. Ex. Windows might have a path like `\"C:\\Users\\scientist\\Desktop\"` while linux may have `\"~/Desktop\"`. \n", - "\n", - "**matplotlib** is used to plot any data we have. It's a very flexible library from plotting basic scatter plots to doing animations of geographical maps.\n", - "\n", - "**pandas** is used to store our data into something called a dataframe (as you will see shortly). The library allows us to apply functions on the dataframe to allow us to easily extract certain parts of the data, apply functions (ex. mean) on the data, and much more. If you are already aware of this concept, pandas has a good cheatsheet [here](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf).\n", - "\n", - "**numpy** is a scientific computing library that allows for more speedy computations and useful tools such as linear algebra capabilites.\n", - "\n", - "We name each as `np` and `pd` by convention, much faster than writing the full name each time." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "titanic_data = pd.read_csv(os.path.join('..', 'titanic_data', 'train.csv')) # .. means the parent folder" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Since there are no errors, the import was successful. You can see we imported 891 observations of data and 12 different variables." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(891, 12)\n" - ] - } - ], - "source": [ - "print(titanic_data.shape)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Just entering the variable allows us to see the dataframe. In this case the dataframe is too large and will only show you the first and last few observations of the data." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
121303Saundercock, Mr. William Henrymale20.000A/5. 21518.0500NaNS
131403Andersson, Mr. Anders Johanmale39.01534708231.2750NaNS
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.8542NaNS
151612Hewlett, Mrs. (Mary D Kingcome)female55.00024870616.0000NaNS
161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
171812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female31.01034576318.0000NaNS
192013Masselmani, Mrs. FatimafemaleNaN0026497.2250NaNC
202102Fynney, Mr. Joseph Jmale35.00023986526.0000NaNS
212212Beesley, Mr. Lawrencemale34.00024869813.0000D56S
222313McGowan, Miss. Anna \"Annie\"female15.0003309238.0292NaNQ
232411Sloper, Mr. William Thompsonmale28.00011378835.5000A6S
242503Palsson, Miss. Torborg Danirafemale8.03134990921.0750NaNS
252613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...female38.01534707731.3875NaNS
262703Emir, Mr. Farred ChehabmaleNaN0026317.2250NaNC
272801Fortune, Mr. Charles Alexandermale19.03219950263.0000C23 C25 C27S
282913O'Dwyer, Miss. Ellen \"Nellie\"femaleNaN003309597.8792NaNQ
293003Todoroff, Mr. LaliomaleNaN003492167.8958NaNS
.......................................
86186202Giles, Mr. Frederick Edwardmale21.0102813411.5000NaNS
86286311Swift, Mrs. Frederick Joel (Margaret Welles Ba...female48.0001746625.9292D17S
86386403Sage, Miss. Dorothy Edith \"Dolly\"femaleNaN82CA. 234369.5500NaNS
86486502Gill, Mr. John Williammale24.00023386613.0000NaNS
86586612Bystrom, Mrs. (Karolina)female42.00023685213.0000NaNS
86686712Duran y More, Miss. Asuncionfemale27.010SC/PARIS 214913.8583NaNC
86786801Roebling, Mr. Washington Augustus IImale31.000PC 1759050.4958A24S
86886903van Melkebeke, Mr. PhilemonmaleNaN003457779.5000NaNS
86987013Johnson, Master. Harold Theodormale4.01134774211.1333NaNS
87087103Balkic, Mr. Cerinmale26.0003492487.8958NaNS
87187211Beckwith, Mrs. Richard Leonard (Sallie Monypeny)female47.0111175152.5542D35S
87287301Carlsson, Mr. Frans Olofmale33.0006955.0000B51 B53 B55S
87387403Vander Cruyssen, Mr. Victormale47.0003457659.0000NaNS
87487512Abelson, Mrs. Samuel (Hannah Wizosky)female28.010P/PP 338124.0000NaNC
87587613Najib, Miss. Adele Kiamie \"Jane\"female15.00026677.2250NaNC
87687703Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS
87787803Petroff, Mr. Nedeliomale19.0003492127.8958NaNS
87887903Laleff, Mr. KristomaleNaN003492177.8958NaNS
87988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88088112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
88188203Markun, Mr. Johannmale33.0003492577.8958NaNS
88288303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS
88388402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS
88488503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS
88588603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
\n", - "

891 rows × 12 columns

\n", - "
" - ], - "text/plain": [ - " PassengerId Survived Pclass \\\n", - "0 1 0 3 \n", - "1 2 1 1 \n", - "2 3 1 3 \n", - "3 4 1 1 \n", - "4 5 0 3 \n", - "5 6 0 3 \n", - "6 7 0 1 \n", - "7 8 0 3 \n", - "8 9 1 3 \n", - "9 10 1 2 \n", - "10 11 1 3 \n", - "11 12 1 1 \n", - "12 13 0 3 \n", - "13 14 0 3 \n", - "14 15 0 3 \n", - "15 16 1 2 \n", - "16 17 0 3 \n", - "17 18 1 2 \n", - "18 19 0 3 \n", - "19 20 1 3 \n", - "20 21 0 2 \n", - "21 22 1 2 \n", - "22 23 1 3 \n", - "23 24 1 1 \n", - "24 25 0 3 \n", - "25 26 1 3 \n", - "26 27 0 3 \n", - "27 28 0 1 \n", - "28 29 1 3 \n", - "29 30 0 3 \n", - ".. ... ... ... \n", - "861 862 0 2 \n", - "862 863 1 1 \n", - "863 864 0 3 \n", - "864 865 0 2 \n", - "865 866 1 2 \n", - "866 867 1 2 \n", - "867 868 0 1 \n", - "868 869 0 3 \n", - "869 870 1 3 \n", - "870 871 0 3 \n", - "871 872 1 1 \n", - "872 873 0 1 \n", - "873 874 0 3 \n", - "874 875 1 2 \n", - "875 876 1 3 \n", - "876 877 0 3 \n", - "877 878 0 3 \n", - "878 879 0 3 \n", - "879 880 1 1 \n", - "880 881 1 2 \n", - "881 882 0 3 \n", - "882 883 0 3 \n", - "883 884 0 2 \n", - "884 885 0 3 \n", - "885 886 0 3 \n", - "886 887 0 2 \n", - "887 888 1 1 \n", - "888 889 0 3 \n", - "889 890 1 1 \n", - "890 891 0 3 \n", - "\n", - " Name Sex Age SibSp \\\n", - "0 Braund, Mr. Owen Harris male 22.0 1 \n", - "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", - "2 Heikkinen, Miss. Laina female 26.0 0 \n", - "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", - "4 Allen, Mr. William Henry male 35.0 0 \n", - "5 Moran, Mr. James male NaN 0 \n", - "6 McCarthy, Mr. Timothy J male 54.0 0 \n", - "7 Palsson, Master. Gosta Leonard male 2.0 3 \n", - "8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n", - "9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n", - "10 Sandstrom, Miss. Marguerite Rut female 4.0 1 \n", - "11 Bonnell, Miss. Elizabeth female 58.0 0 \n", - "12 Saundercock, Mr. William Henry male 20.0 0 \n", - "13 Andersson, Mr. Anders Johan male 39.0 1 \n", - "14 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 \n", - "15 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 \n", - "16 Rice, Master. Eugene male 2.0 4 \n", - "17 Williams, Mr. Charles Eugene male NaN 0 \n", - "18 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 \n", - "19 Masselmani, Mrs. Fatima female NaN 0 \n", - "20 Fynney, Mr. Joseph J male 35.0 0 \n", - "21 Beesley, Mr. Lawrence male 34.0 0 \n", - "22 McGowan, Miss. Anna \"Annie\" female 15.0 0 \n", - "23 Sloper, Mr. William Thompson male 28.0 0 \n", - "24 Palsson, Miss. Torborg Danira female 8.0 3 \n", - "25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 \n", - "26 Emir, Mr. Farred Chehab male NaN 0 \n", - "27 Fortune, Mr. Charles Alexander male 19.0 3 \n", - "28 O'Dwyer, Miss. Ellen \"Nellie\" female NaN 0 \n", - "29 Todoroff, Mr. Lalio male NaN 0 \n", - ".. ... ... ... ... \n", - "861 Giles, Mr. Frederick Edward male 21.0 1 \n", - "862 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 \n", - "863 Sage, Miss. Dorothy Edith \"Dolly\" female NaN 8 \n", - "864 Gill, Mr. John William male 24.0 0 \n", - "865 Bystrom, Mrs. (Karolina) female 42.0 0 \n", - "866 Duran y More, Miss. Asuncion female 27.0 1 \n", - "867 Roebling, Mr. Washington Augustus II male 31.0 0 \n", - "868 van Melkebeke, Mr. Philemon male NaN 0 \n", - "869 Johnson, Master. Harold Theodor male 4.0 1 \n", - "870 Balkic, Mr. Cerin male 26.0 0 \n", - "871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 \n", - "872 Carlsson, Mr. Frans Olof male 33.0 0 \n", - "873 Vander Cruyssen, Mr. Victor male 47.0 0 \n", - "874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 \n", - "875 Najib, Miss. Adele Kiamie \"Jane\" female 15.0 0 \n", - "876 Gustafsson, Mr. Alfred Ossian male 20.0 0 \n", - "877 Petroff, Mr. Nedelio male 19.0 0 \n", - "878 Laleff, Mr. Kristo male NaN 0 \n", - "879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 \n", - "880 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 \n", - "881 Markun, Mr. Johann male 33.0 0 \n", - "882 Dahlberg, Miss. Gerda Ulrika female 22.0 0 \n", - "883 Banfield, Mr. Frederick James male 28.0 0 \n", - "884 Sutehall, Mr. Henry Jr male 25.0 0 \n", - "885 Rice, Mrs. William (Margaret Norton) female 39.0 0 \n", - "886 Montvila, Rev. Juozas male 27.0 0 \n", - "887 Graham, Miss. Margaret Edith female 19.0 0 \n", - "888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n", - "889 Behr, Mr. Karl Howell male 26.0 0 \n", - "890 Dooley, Mr. Patrick male 32.0 0 \n", - "\n", - " Parch Ticket Fare Cabin Embarked \n", - "0 0 A/5 21171 7.2500 NaN S \n", - "1 0 PC 17599 71.2833 C85 C \n", - "2 0 STON/O2. 3101282 7.9250 NaN S \n", - "3 0 113803 53.1000 C123 S \n", - "4 0 373450 8.0500 NaN S \n", - "5 0 330877 8.4583 NaN Q \n", - "6 0 17463 51.8625 E46 S \n", - "7 1 349909 21.0750 NaN S \n", - "8 2 347742 11.1333 NaN S \n", - "9 0 237736 30.0708 NaN C \n", - "10 1 PP 9549 16.7000 G6 S \n", - "11 0 113783 26.5500 C103 S \n", - "12 0 A/5. 2151 8.0500 NaN S \n", - "13 5 347082 31.2750 NaN S \n", - "14 0 350406 7.8542 NaN S \n", - "15 0 248706 16.0000 NaN S \n", - "16 1 382652 29.1250 NaN Q \n", - "17 0 244373 13.0000 NaN S \n", - "18 0 345763 18.0000 NaN S \n", - "19 0 2649 7.2250 NaN C \n", - "20 0 239865 26.0000 NaN S \n", - "21 0 248698 13.0000 D56 S \n", - "22 0 330923 8.0292 NaN Q \n", - "23 0 113788 35.5000 A6 S \n", - "24 1 349909 21.0750 NaN S \n", - "25 5 347077 31.3875 NaN S \n", - "26 0 2631 7.2250 NaN C \n", - "27 2 19950 263.0000 C23 C25 C27 S \n", - "28 0 330959 7.8792 NaN Q \n", - "29 0 349216 7.8958 NaN S \n", - ".. ... ... ... ... ... \n", - "861 0 28134 11.5000 NaN S \n", - "862 0 17466 25.9292 D17 S \n", - "863 2 CA. 2343 69.5500 NaN S \n", - "864 0 233866 13.0000 NaN S \n", - "865 0 236852 13.0000 NaN S \n", - "866 0 SC/PARIS 2149 13.8583 NaN C \n", - "867 0 PC 17590 50.4958 A24 S \n", - "868 0 345777 9.5000 NaN S \n", - "869 1 347742 11.1333 NaN S \n", - "870 0 349248 7.8958 NaN S \n", - "871 1 11751 52.5542 D35 S \n", - "872 0 695 5.0000 B51 B53 B55 S \n", - "873 0 345765 9.0000 NaN S \n", - "874 0 P/PP 3381 24.0000 NaN C \n", - "875 0 2667 7.2250 NaN C \n", - "876 0 7534 9.8458 NaN S \n", - "877 0 349212 7.8958 NaN S \n", - "878 0 349217 7.8958 NaN S \n", - "879 1 11767 83.1583 C50 C \n", - "880 1 230433 26.0000 NaN S \n", - "881 0 349257 7.8958 NaN S \n", - "882 0 7552 10.5167 NaN S \n", - "883 0 C.A./SOTON 34068 10.5000 NaN S \n", - "884 0 SOTON/OQ 392076 7.0500 NaN S \n", - "885 5 382652 29.1250 NaN Q \n", - "886 0 211536 13.0000 NaN S \n", - "887 0 112053 30.0000 B42 S \n", - "888 2 W./C. 6607 23.4500 NaN S \n", - "889 0 111369 30.0000 C148 C \n", - "890 0 370376 7.7500 NaN Q \n", - "\n", - "[891 rows x 12 columns]" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" + "---\n", + "This guide will look at the Titanic dataset, we will see if we can predict what types of people would have survived on the Titanic.\n", + "\n", + "So first let's import some useful libraries that we will use." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**os** is a built-in library to do operating system related things. We mostly use the `os.path.join()` function to access the file we want. Different operating systems store their files in different ways and python easily does the work for us. Ex. Windows might have a path like `\"C:\\Users\\scientist\\Desktop\"` while linux may have `\"~/Desktop\"`. \n", + "\n", + "**matplotlib** is used to plot any data we have. It's a very flexible library from plotting basic scatter plots to doing animations of geographical maps.\n", + "\n", + "**pandas** is used to store our data into something called a dataframe (as you will see shortly). The library allows us to apply functions on the dataframe to allow us to easily extract certain parts of the data, apply functions (ex. mean) on the data, and much more. If you are already aware of this concept, pandas has a good cheatsheet [here](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf).\n", + "\n", + "**numpy** is a scientific computing library that allows for more speedy computations and useful tools such as linear algebra capabilites.\n", + "\n", + "We name each as `np` and `pd` by convention, much faster than writing the full name each time." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "titanic_data = pd.read_csv(os.path.join('..', 'titanic_data', 'train.csv')) # .. means the parent folder" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since there are no errors, the import was successful. You can see we imported 891 observations of data and 12 different variables." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(891, 12)\n" + ] } ], "source": [ - "titanic_data" + "print(titanic_data.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Alternatively, you can view the first `n` or last `n` observations using `dataframe.head(n)` and `dataframe.tail(n)` respectively." + "You can view the first `n` or last `n` observations using `dataframe.head(n)` and `dataframe.tail(n)` respectively." ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -1341,7 +176,7 @@ "1 0 PC 17599 71.2833 C85 C " ] }, - "execution_count": 8, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -1352,7 +187,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -1435,7 +270,7 @@ "890 0 370376 7.75 NaN Q " ] }, - "execution_count": 9, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -1453,7 +288,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -1508,7 +343,7 @@ "2 female 26.0" ] }, - "execution_count": 28, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -1521,15 +356,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Pandas is powerful as it allows us to group data together by a certain variable. We can apply what we learned to see the average `Fare`, `Age`, and proportion of `Survived` by each ticket class. We can see that the as you move to a higher class (ie. 3 -> 1):\n", - "- Fares increase\n", - "- Passengers are older\n", - "- More survived" + "To start exploring, let's get an summary of our data." ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -1553,70 +385,145 @@ " \n", " \n", " \n", - " Fare\n", - " Age\n", + " PassengerId\n", " Survived\n", - " \n", - " \n", " Pclass\n", - " \n", - " \n", - " \n", + " Age\n", + " SibSp\n", + " Parch\n", + " Fare\n", " \n", " \n", " \n", " \n", - " 1\n", - " 84.154687\n", - " 38.233441\n", - " 0.629630\n", - " \n", - " \n", - " 2\n", - " 20.662183\n", - " 29.877630\n", - " 0.472826\n", - " \n", - " \n", - " 3\n", - " 13.675550\n", - " 25.140620\n", - " 0.242363\n", + " count\n", + " 891.000000\n", + " 891.000000\n", + " 891.000000\n", + " 714.000000\n", + " 891.000000\n", + " 891.000000\n", + " 891.000000\n", + " \n", + " \n", + " mean\n", + " 446.000000\n", + " 0.383838\n", + " 2.308642\n", + " 29.699118\n", + " 0.523008\n", + " 0.381594\n", + " 32.204208\n", + " \n", + " \n", + " std\n", + " 257.353842\n", + " 0.486592\n", + " 0.836071\n", + " 14.526497\n", + " 1.102743\n", + " 0.806057\n", + " 49.693429\n", + " \n", + " \n", + " min\n", + " 1.000000\n", + " 0.000000\n", + " 1.000000\n", + " 0.420000\n", + " 0.000000\n", + " 0.000000\n", + " 0.000000\n", + " \n", + " \n", + " 25%\n", + " 223.500000\n", + " 0.000000\n", + " 2.000000\n", + " 20.125000\n", + " 0.000000\n", + " 0.000000\n", + " 7.910400\n", + " \n", + " \n", + " 50%\n", + " 446.000000\n", + " 0.000000\n", + " 3.000000\n", + " 28.000000\n", + " 0.000000\n", + " 0.000000\n", + " 14.454200\n", + " \n", + " \n", + " 75%\n", + " 668.500000\n", + " 1.000000\n", + " 3.000000\n", + " 38.000000\n", + " 1.000000\n", + " 0.000000\n", + " 31.000000\n", + " \n", + " \n", + " max\n", + " 891.000000\n", + " 1.000000\n", + " 3.000000\n", + " 80.000000\n", + " 8.000000\n", + " 6.000000\n", + " 512.329200\n", " \n", " \n", "\n", "" ], "text/plain": [ - " Fare Age Survived\n", - "Pclass \n", - "1 84.154687 38.233441 0.629630\n", - "2 20.662183 29.877630 0.472826\n", - "3 13.675550 25.140620 0.242363" + " PassengerId Survived Pclass Age SibSp \\\n", + "count 891.000000 891.000000 891.000000 714.000000 891.000000 \n", + "mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n", + "std 257.353842 0.486592 0.836071 14.526497 1.102743 \n", + "min 1.000000 0.000000 1.000000 0.420000 0.000000 \n", + "25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n", + "50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n", + "75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n", + "max 891.000000 1.000000 3.000000 80.000000 8.000000 \n", + "\n", + " Parch Fare \n", + "count 891.000000 891.000000 \n", + "mean 0.381594 32.204208 \n", + "std 0.806057 49.693429 \n", + "min 0.000000 0.000000 \n", + "25% 0.000000 7.910400 \n", + "50% 0.000000 14.454200 \n", + "75% 0.000000 31.000000 \n", + "max 6.000000 512.329200 " ] }, - "execution_count": 18, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "titanic_data.groupby('Pclass').mean()[['Fare', 'Age', 'Survived']]" + "titanic_data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "While this seems pretty good, there's a problem that may not be obvious. Data rarely comes by perfectly, in this case there are missing values all over the data set. " + "Pandas is powerful as it allows us to group data together by a certain variable. We can apply what we learned to see the average `Fare`, `Age`, and proportion of `Survived` by each ticket class. We can see that the as you move to a higher class (ie. 3 -> 1):\n", + "- Fares increase\n", + "- Passengers are older\n", + "- More survived" ] }, { "cell_type": "code", - "execution_count": 32, - "metadata": { - "scrolled": true - }, + "execution_count": 8, + "metadata": {}, "outputs": [ { "data": { @@ -1639,50 +546,97 @@ " \n", " \n", " \n", - " Pclass\n", + " Fare\n", " Age\n", + " Survived\n", " \n", - " \n", - " \n", " \n", - " 887\n", - " 1\n", - " 19.0\n", + " Pclass\n", + " \n", + " \n", + " \n", " \n", + " \n", + " \n", " \n", - " 888\n", - " 3\n", - " NaN\n", + " 1\n", + " 84.154687\n", + " 38.233441\n", + " 0.629630\n", " \n", " \n", - " 889\n", - " 1\n", - " 26.0\n", + " 2\n", + " 20.662183\n", + " 29.877630\n", + " 0.472826\n", " \n", " \n", - " 890\n", - " 3\n", - " 32.0\n", + " 3\n", + " 13.675550\n", + " 25.140620\n", + " 0.242363\n", " \n", " \n", "\n", "" ], "text/plain": [ - " Pclass Age\n", - "887 1 19.0\n", - "888 3 NaN\n", - "889 1 26.0\n", - "890 3 32.0" + " Fare Age Survived\n", + "Pclass \n", + "1 84.154687 38.233441 0.629630\n", + "2 20.662183 29.877630 0.472826\n", + "3 13.675550 25.140620 0.242363" ] }, - "execution_count": 32, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "titanic_data[['Pclass', 'Age']].tail(4)" + "titanic_data.groupby('Pclass').mean()[['Fare', 'Age', 'Survived']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While this seems pretty good, there's a problem that may not be obvious. Data rarely comes by perfectly, in this case there are missing values all over the data set. " + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 891 entries, 0 to 890\n", + "Data columns (total 12 columns):\n", + "PassengerId 891 non-null int64\n", + "Survived 891 non-null int64\n", + "Pclass 891 non-null int64\n", + "Name 891 non-null object\n", + "Sex 891 non-null object\n", + "Age 714 non-null float64\n", + "SibSp 891 non-null int64\n", + "Parch 891 non-null int64\n", + "Ticket 891 non-null object\n", + "Fare 891 non-null float64\n", + "Cabin 204 non-null object\n", + "Embarked 889 non-null object\n", + "dtypes: float64(2), int64(5), object(5)\n", + "memory usage: 83.6+ KB\n" + ] + } + ], + "source": [ + "titanic_data.info()" ] } ], diff --git a/R Guide/R Kaggle Guide (Titanic).Rmd b/R Guide/R Kaggle Guide (Titanic).Rmd new file mode 100644 index 0000000..c8188bc --- /dev/null +++ b/R Guide/R Kaggle Guide (Titanic).Rmd @@ -0,0 +1,96 @@ +--- +title: "R Kaggle Guide (Titanic)" +author: "UWaterloo Data Science Club" +date: "August 18, 2017" +output: pdf_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +This guide is based on R 3.3.2. We recommend downloading R [here](https://r-project.org/) along with [R Studio](https://www.rstudio.com/products/rstudio/download/), a set of integrated tools that will make your life a lot easier. This guide assumes that you have some sort of programming experience. + +This guide is written in something called R markdown, which allows us to describe our process while showing and executing code (kind of like a notebook). This notebook process is type of what Kaggle calls a **kernel**. When working in R Studio, pressing `ctrl + enter` will run the current line of code. + +--- +This guide will look at the Titanic dataset, we will see if we can predict what types of people would have survived on the Titanic. + +So first we will import some useful libraries. R is old and there are confusing things about the language that came up over time, the tidyverse stack is a set of libraries that make these functions more consistent and powerful. +```{R} +library("tidyverse") +``` + +Note the conflict errors indicate that two different libraries have the same function name. We don't need to worry about this for now. Now we can import our data into a dataframe. + +```{R, echo=FALSE} +titanic_data <- read_csv("../titanic_data/train.csv") # .. indicates the parent folder +``` + +The code output tells us about how each column was imported such as what data type is stored. To understand more about what options you have, you can type `?read_csv` in the console, or `?` for any function. If you don't know what exactly the function name is, you can do `??` which will return you all manual pages relevant to the query. + +Normally we won't worry to much about datatypes, but notice how certain columns like `Survived` and `Pclass` were imported as integers? The problem is that we use integers to differentiate the data value but there isn't any ineherent order to the numbers. Instead we can convert integers, characters, etc.to categories, which is called a **factor** in R. + +The `$` let's us select specific variables in a dataframe. + +```{R} +titanic_data$Survived <- as.factor(titanic_data$Survived) +titanic_data$Pclass <- as.factor(titanic_data$Pclass) +titanic_data$Sex <- as.factor(titanic_data$Sex) +titanic_data$Embarked <- as.factor(titanic_data$Embarked) +``` + +We can observe the first `n` entries of our dataframe by using the `head()` function, likewise we to observe the last `n` entires we can use `tail()`. If there are too many variables, the output will omit them to save space. + +```{R} +head(titanic_data, 5) +``` +After a quick look, let's get a summary of our data. +```{R} +summary(titanic_data) +``` +The `NA's` in some columns indicate the number of missing values. One could either remove the rows with missing values, or try to fill in the data based on surrounding data. Since our dataset is fairly small, the latter is preferred. This is called **imputation**. + +Let's get a closer look at who these people with missing embarked locations are. We can use the `filter()` function to select rows that satisfy a certain criteria. Note that we do not have to use $ to indicate that `Embarked` is from `titanic_data` because it's inferred when we put what data we're looking at as the first parameter in `filter`. + +**NOTE:** `NA == NA` will return `NA`. While this may be confusing think of it this way. +```{R} +alice.age <- NA # We don't know Alice's age +bob.age <- NA # We don't know Bob's age +alice.age == bob.age # Are Alice and Bob the same age? We don't know! +``` + +That's why we use `is.na` to test for missing values instead. + +**NOTE:** The code below might start to look a little convoluted. We'll soon look at some syntactic sugar to make everything easier to read. + +```{R} +filter(titanic_data, is.na(Embarked))[c('Name', 'Fare', 'Ticket', 'Cabin')] +``` +It seems the passenger's had the same ticket, hence the identical fare. Let's visualize how much a passenger paid and their class based off the location and they embarked. We add $80 as the dashed red line to make a comparison. +```{R} +ggplot(filter(titanic_data, !is.na(Embarked)), + aes(x = Embarked, y = Fare, fill = Pclass)) + + geom_boxplot() + + scale_y_continuous() + + labs(title = "Ticket Price from Embark Location", + y = "Fare [$]") + + geom_hline(aes(yintercept = 80), + colour = "red", linetype = "dashed", lwd = 1) +``` + +The red line is aligned with the median of the fare paid in location C. Thus we will fill in the missing embarked locations with C. + +```{R} +titanic_data$Embarked[is.na(titanic_data$Embarked)] <- 'C' +``` + +## INCOMPLETE SECTION + +Another method of imputation is through prediction. It would be naive to use simple methods such as mean because we have other data that hint towards the age of a passenger. We can make a model to estimate the age from the other information we have. + +```{R} +model <- lm(Age ~ Survived + Pclass * Fare, titanic_data) +summary(model) +``` + diff --git a/R Guide/R_Kaggle_Guide__Titanic_.pdf b/R Guide/R_Kaggle_Guide__Titanic_.pdf new file mode 100644 index 0000000..47e5443 Binary files /dev/null and b/R Guide/R_Kaggle_Guide__Titanic_.pdf differ