-
Notifications
You must be signed in to change notification settings - Fork 2
Python & R Guide Update #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome communications with all the text, mostly just some style changes. Also actually easier to output it as markdown/html.
@@ -0,0 +1,545 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the .ipynb_checkpoints
which you should include in the .gitignore
@@ -0,0 +1,96 @@ | |||
--- | |||
title: "R Kaggle Guide (Titanic)" | |||
author: "UWaterloo Data Science Club" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome to use your name here :) you should take credit for the tut.
This guide will look at the Titanic dataset, we will see if we can predict what types of people would have survived on the Titanic. | ||
|
||
So first we will import some useful libraries. R is old and there are confusing things about the language that came up over time, the tidyverse stack is a set of libraries that make these functions more consistent and powerful. | ||
```{R} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add args like the ones below to ignores warning messages.
{R includes=FALSE, warnings=FALSE}
The `$` let's us select specific variables in a dataframe. | ||
|
||
```{R} | ||
titanic_data$Survived <- as.factor(titanic_data$Survived) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use this notation vs dplyr mutate
titanic_data %>%
mutate(Survived=as.factor(Survived),
Pclass=as.factor(Pclass),
...)
We can observe the first `n` entries of our dataframe by using the `head()` function, likewise we to observe the last `n` entires we can use `tail()`. If there are too many variables, the output will omit them to save space. | ||
|
||
```{R} | ||
head(titanic_data, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love for you to introduce the %>%
operator just because its preferred way of doing things.
perhaps explain what it does, and show that you can do both
head(df, 5)
and df %>% head(5)
|
||
## INCOMPLETE SECTION | ||
|
||
Another method of imputation is through prediction. It would be naive to use simple methods such as mean because we have other data that hint towards the age of a passenger. We can make a model to estimate the age from the other information we have. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a section of missing data mechanisms.
more information can be found in Elements of Statistical Learning
in the missing data section.
basically that missing data in itself can be predictive and we can always include is.na(feature)
as a new indicator variable feature.
R
Python
pandas.DataFrame.info()
to determine the amount of missing values in each column