Skip to content

Python & R Guide Update #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Python & R Guide Update #4

wants to merge 4 commits into from

Conversation

clayton-halim
Copy link
Member

@clayton-halim clayton-halim commented Aug 18, 2017

R

  • Created guide:
    • import
    • imputation
    • a little bit of plotting.

Python

  • Removed the large output in the python guide.
  • Used pandas.DataFrame.info() to determine the amount of missing values in each column

@clayton-halim clayton-halim changed the title Started R guide Python & R Guide Update Aug 18, 2017
@jxnl jxnl self-requested a review August 19, 2017 04:09
Copy link
Collaborator

@jxnl jxnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome communications with all the text, mostly just some style changes. Also actually easier to output it as markdown/html.

@@ -0,0 +1,545 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the .ipynb_checkpoints which you should include in the .gitignore

@@ -0,0 +1,96 @@
---
title: "R Kaggle Guide (Titanic)"
author: "UWaterloo Data Science Club"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to use your name here :) you should take credit for the tut.

This guide will look at the Titanic dataset, we will see if we can predict what types of people would have survived on the Titanic.

So first we will import some useful libraries. R is old and there are confusing things about the language that came up over time, the tidyverse stack is a set of libraries that make these functions more consistent and powerful.
```{R}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add args like the ones below to ignores warning messages.

{R includes=FALSE, warnings=FALSE}

The `$` let's us select specific variables in a dataframe.

```{R}
titanic_data$Survived <- as.factor(titanic_data$Survived)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use this notation vs dplyr mutate

titanic_data %>%
   mutate(Survived=as.factor(Survived),
          Pclass=as.factor(Pclass), 
   ...)

We can observe the first `n` entries of our dataframe by using the `head()` function, likewise we to observe the last `n` entires we can use `tail()`. If there are too many variables, the output will omit them to save space.

```{R}
head(titanic_data, 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love for you to introduce the %>% operator just because its preferred way of doing things.

perhaps explain what it does, and show that you can do both

head(df, 5) and df %>% head(5)


## INCOMPLETE SECTION

Another method of imputation is through prediction. It would be naive to use simple methods such as mean because we have other data that hint towards the age of a passenger. We can make a model to estimate the age from the other information we have.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a section of missing data mechanisms.

more information can be found in Elements of Statistical Learning in the missing data section.

basically that missing data in itself can be predictive and we can always include is.na(feature) as a new indicator variable feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants