Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data in place of feline-data_v2.csv, closes #717 #908

Merged
merged 2 commits into from
Jan 7, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 25 additions & 55 deletions episodes/04-data-structures-part1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -164,78 +164,50 @@ No matter how
complicated our analyses become, all data in R is interpreted as one of these
basic data types. This strictness has some really important consequences.

A user has added details of another cat. This information is in the file
`data/feline-data_v2.csv`.
A user has provided details of another cat. We can add an additional row to our cats table using `rbind()`.

```{r, eval=FALSE}
file.show("data/feline-data_v2.csv")
```

```{r, eval=FALSE}
coat,weight,likes_catnip
calico,2.1,1
black,5.0,0
tabby,3.2,1
tabby,2.3 or 2.4,1
```{r}
additional_cat <- data.frame(coat = "tabby", weight = "2.3 or 2.4", likes_catnip = 1)
additional_cat
cats2 <- rbind(cats, additional_cat)
cats2
```

Load the new cats data like before, and check what type of data we find in the
`weight` column:
Let's check what type of data we find in the
`weight` column of our new `cats2` object:

```{r}
cats <- read.csv(file="data/feline-data_v2.csv")
typeof(cats$weight)
typeof(cats2$weight)
```

Oh no, our weights aren't the double type anymore! If we try to do the same math
we did on them before, we run into trouble:

```{r}
cats$weight + 2
cats2$weight + 2
```

What happened?
The `cats` data we are working with is something called a *data frame*. Data frames
The `cats` (and `cats2`) data we are working with is something called a *data frame*. Data frames
are one of the most common and versatile types of *data structures* we will work with in R.
A given column in a data frame cannot be composed of different data types.
In this case, R does not read everything in the data frame column `weight` as a *double*, therefore the entire
In this case, R cannot store everything in the data frame column `weight` as a *double* anymore once we add the row for the additional cat (because its weight is `2.3 or 2.4`), therefore the entire
column data type changes to something that is suitable for everything in the column.

When R reads a csv file, it reads it in as a *data frame*. Thus, when we loaded the `cats`
csv file, it is stored as a data frame. We can recognize data frames by the first row that
is written by the `str()` function:

```{r}
str(cats)
str(cats2)
```

*Data frames* are composed of rows and columns, where each column has the
same number of rows. Different columns in a data frame can be made up of different
data types (this is what makes them so versatile), but everything in a given
column needs to be the same type (e.g., vector, factor, or list).

Let's explore more about different data structures and how they behave.
For now, let's remove that extra line from our cats data and reload it,
while we investigate this behavior further:

feline-data.csv:

```
coat,weight,likes_catnip
calico,2.1,1
black,5.0,0
tabby,3.2,1
```

And back in RStudio:

```{r, eval=FALSE}
cats <- read.csv(file="data/feline-data.csv")
```

```{r, include=FALSE}
cats <- cats_orig
```
Let's explore more about different data structures and how they behave. For now, we will focus on our original data frame `cats` (and we can forget about `cats2` for the rest of this episode).

### Vectors and Type Coercion

Expand Down Expand Up @@ -389,8 +361,7 @@ Create a new script in RStudio and copy and paste the following code. Then
move on to the tasks below, which help you to fill in the gaps (\_\_\_\_\_\_).

```
# Read data
cats <- read.csv("data/feline-data_v2.csv")
Using the object `cats2`:

# 1. Print the data
_____
Expand All @@ -402,15 +373,15 @@ _____(cats)
# The correct data type is: ____________.

# 4. Correct the 4th weight data point with the mean of the two given values
cats$weight[4] <- 2.35
cats2$weight[4] <- 2.35
# print the data again to see the effect
cats

# 5. Convert the weight to the right data type
cats$weight <- ______________(cats$weight)
cats2$weight <- ______________(cats2$weight)

# Calculate the mean to test yourself
mean(cats$weight)
mean(cats2$weight)

# If you see the correct mean value (and not NA), you did the exercise
# correctly!
Expand All @@ -420,7 +391,7 @@ mean(cats$weight)

#### 1\. Print the data

Execute the first statement (`read.csv(...)`). Then print the data to the
Print the data to the
console

::::::::::::::: solution
Expand All @@ -435,8 +406,8 @@ Show the content of any variable by typing its name.
Two correct solutions:

```
cats
print(cats)
cats2
print(cats2)
```

:::::::::::::::::::::::::
Expand All @@ -445,7 +416,7 @@ print(cats)

The data type of your data is as important as the data itself. Use a
function we saw earlier to print out the data types of all columns of the
`cats` table.
`cats2` `data.frame`.

::::::::::::::: solution

Expand All @@ -462,15 +433,14 @@ here.
> ### Solution to Challenge 1.2
>
> ```
> str(cats)
> str(cats2)
> ```

#### 3\. Which data type do we need?

The shown data type is not the right one for this data (weight of
a cat). Which data type do we need?

- Why did the `read.csv()` function not choose the correct data type?
- Fill in the gap in the comment with the correct data type for cat weight!

::::::::::::::: solution
Expand Down Expand Up @@ -549,8 +519,8 @@ auto-complete function: Type "`as.`" and then press the TAB key.
> There are two functions that are synonymous for historic reasons:
>
> ```
> cats$weight <- as.double(cats$weight)
> cats$weight <- as.numeric(cats$weight)
> cats2$weight <- as.double(cats2$weight)
> cats2$weight <- as.numeric(cats2$weight)
> ```

::::::::::::::::::::::::::::::::::::::::::::::::::
Expand Down