15_brazil_aopdata.qmd

---
author:
  - name: Rafael H. M. Pereira
    orcid: "0000-0003-2125-7465"
    affiliations:
      - name: "Ipea - Institute for Applied Economic Research"
    url: "https://www.urbandemographics.org/about/"
---

# Access to Oppotunity in Brazil

This chapter presents a rich dataset with estimates of access to opportunities in Brazilian cities. This dataset is created and made available by the [Access to Opportunities Lab](https://www.ipea.gov.br/acessooportunidades/en/) (AOP-Lab) at the Institute for Applied Economic Research (Ipea). The AOP-Lab has a core research agenda investigating urban accessibility conditions in Brazilian cities and developing computational tools for spatial accessibility and urban planning, with a focus on social inequality and sustainable urban mobility. These tools include, among others the {r5r} [@pereira2021r5r] and {accessibility} [@pereira2022accessibility] packages in R.


## Overview of the data

The AOP-Lab generates high spatial resolution estimates of access to employment, public health, education, and social welfare services by transportation mode for Brazil's largest cities. All data produced by the lab is made publicly available, including not only accessibility estimates, but also information on the spatial distribution of population, economic activities and public services[^aop_papers]. The data is spatially aggregated into a hexagonal grid indexed by the [H3](https://h3geo.org/) geospatial indexing system, originally developed by Uber [@brodsky2018h3]. Each hexagonal cell covers around 0.11 km², an area similar to that covered by a city block, allowing for high spatial resolution analyses and outputs. 

[^aop_papers]: The methods used to generate these datasets are presented in detail in two separate publications in Portuguese, one for population and land use data [@pereira2022distribuicao] and another for accessibility data [@pereira2022estimativas].

All accessibility estimates are generated by different transport modes (walking, cycling, public transport and automobile), times of day (peak and off-peak), population groups (aggregated by income, race, sex and age) and types of activity (jobs, schools, health services and social assistance centers). These accessibility estimates have been produced for various years 2017, 2018 and 2019 and for the 20 largest Brazilian cities, and the data is being expanded to cover more cities in from 2022 onwards.


The following tables summarize the data currently available. @tbl-tabela_dados_access describes the urban accessibility dataset.

```{r}
#| echo: false
#| label: tbl-tabela_dados_access
#| tbl-cap: Accessibility indicators calculated by the AOP-Lab
tabela_dados_access <- data.table::data.table(
  `Indicator (code)` = c(
    "Minimum travel time (TMI)",
    "Active cumulative accessibility measure (CMA)",
    "Passive cumulative accessibility measure (CMP)"
  ),
  Description = c(
    "Time to the nearest opportunity",
    "Number of accessible opportunities within a given travel time threshold",
    "Number of people that can access the grid cell within a given travel time threshold"
  ),
  `Type of opportunities` = c(
    "Health, education and social assistance reference centers (CRAS)",
    "Jobs, health, education and CRAS",
    "-"
  ),
  `Travel time thresholds` = c(
    "Walk (60 minutes); bicycle, public transport and car (120 minutes)",
    "Walk and bicycle (15, 30, 45 and 60 minutes); public transport and car (15, 30, 60, 90 and 120 minutes)",
    "Walk and bicycle (15, 30, 45 and 60 minutes); public transport and car (15, 30, 60, 90 and 120 minutes)"
  )
)

knitr::kable(tabela_dados_access)
```

@tbl-tabela_dados_aop describes the dataset containing the sociodemographic characteristics of the population and the spatial distribution of opportunities. Note the the data is soon to be updated with information from the 2022 census and more recent data land use and public services.

```{r}
#| echo: false
#| label: tbl-tabela_dados_aop
#| tbl-cap: Data on the sociodemographic characteristics of the population and the spatial distribution of activities aggregated by AOP, by year of reference and data source
tabela_dados_pop <- data.table::data.table(
  Data = c(
    "Sociodemographic characteristics of the population",
    "Education services",
    "Health services",
    "Economic activity",
    "Social welfare services"
  ),
  Information = c(
    "Number of people by sex, age and race; average income per capita",
    "Number of public schools by education level (early childhood, primary and secondary education)",
    "Number of health facilities that serve the Unified Health System (SUS) by complexity level (low, medium and high complexity)",
    "Number of formal jobs by education level of workers (primary, secondary and tertiary education)",
    "Number of CRAS"
  ),
  Years = c(
    "2010",
    "2017, 2018, 2019",
    "2017, 2018, 2019",
    "2017, 2018, 2019",
    "2017, 2018, 2019"
  ),
  Source = c(
    "Demographic Census from the Brazilian Institute for Geography and Statistics (IBGE)",
    "School Census from the Anísio Teixeira National Institute for Educational Studies and Research (Inep)",
    "National Registry of Health Facilities (CNES) from the Ministry of Health",
    "Annual Relation of Social Information (RAIS) from the Ministry of Economy",
    "Unified Social Assistance System (SUAS) Census from the Ministry of Citizenship"
  )
)

knitr::kable(tabela_dados_pop)
```

All datasets created by AOP-Lab are available for download on the AOP-Lab [website](https://www.ipea.gov.br/acessooportunidades/en/dados/) or through the `{aopdata}` R package [@pereira2022aopdata].

In the next section, we provide a few examples illustrating how to download the data with accessibility estimates in Brazil, and how these data can be used to analyze and visualize urban accessibility levels and inequalities using the R programming language.


## Downloading urban accessibility data for Brazil

The easiest way to get urban accessibility data for Brazilian cities is using the `{aopdata}` package [@pereira2022aopdata]. The also allows one to download estimates of accessibility to jobs, public health facilities, public schools and social assistance services for various years and multiple cities.

This data can be downloaded with the `read_access()` function, which works similarly to `read_population()` and `read_landuse()`. Besides indicating the city (`city` parameter) and the reference year (`year`), though, it is also necessary to inform the transport mode (`mode`) and the period of the day (`peak` parameter) to select the accessibility data that should be downloaded. The peak period is between 6 am and 8 am, while off-peak is between 2 pm and 4 pm.

With the code below, we show how to download accessibility estimates that refer to the peak period in São Paulo in 2019. In this example, we download accessibility estimates both by car and by public transport and merged them into a single `data.frame`. Please note that this function results in a table that also includes sociodemographic and land use data.

```{r}
#| warning: false
#| message: false
access_pt <- aopdata::read_access(
  city = "São Paulo",
  mode = "public_transport",
  year = 2019,
  peak = TRUE,
  geometry = TRUE,
  showProgress = FALSE
)

access_car <- aopdata::read_access(
  city = "São Paulo",
  mode = "car",
  year = 2019,
  peak = TRUE,
  geometry = TRUE,
  showProgress = FALSE
)

data_sp <- rbind(access_pt, access_car)

names(data_sp)
```


The data dictionary (code book) can be accessed online[^aop_data_dictionary] or with the command `aopdata::aopdata_dictionary(lang = "en")` in an R session. The names of the accessibility estimates columns, such as `CMAEF30`, `TMISB` and `CMPPM60`, result from a combination of three components, as follows.

[^aop_data_dictionary]: Available at <https://ipeagit.github.io/aopdata/articles/data_dic_en.html>.


1. The **type of accessibility measure**, which is indicated by the first 3 letters of the code. The data includes three types of measures:

   - `CMA` - active cumulative accessibility;
   - `CMP` - passive cumulative accessibility; and
   - `TMI` - minimum travel time to the nearest opportunity.

2. The **type of activity** to which the accessibility levels were calculated, indicated by the following two letters, in the middle of the column name. The data includes accessibility estimates to various types of activities:

   - `TT` - all jobs;
   - `TB` - low education jobs;
   - `TM` - middle education jobs;
   - `TA` - high education jobs;
   - `ST` - all public health facilities;
   - `SB` - low complexity public health facilities;
   - `SM` - medium complexity public health facilities;
   - `SA` - high complexity public health facilities;
   - `ET` - all public schools;
   - `EI` - early childhood public  schools;
   - `EF` - primary public schools;
   - `MS` - secondary public schools;
   - `MT` - total number of enrollments in public schools;
   - `MI` - number of enrollments in early childhood public schools;
   - `MF` - number of enrollments in primary public schools;
   - `MM` - number of enrollments in secondary public schools; and
   - `CT` - all CRAS.
 
In the case of the passive cumulative measure, the letters in the middle of the column name indicate **the population group** which the accessibility estimates refer to:

   - `PT` - the entire population;
   - `PH` - male population;
   - `PM` - female population;
   - `PB` - white population;
   - `PN` - black population;
   - `PA` - yellow population;
   - `PI` - indigenous population;
   - `P0005I` - population from 0 to 5 years old;
   - `P0614I` - population from 6 to 14 years old;
   - `P1518I` - population from 15 to 18 years old;
   - `P1924I` - population from 19 to 24 years old;
   - `P2539I` - population from 25 to 39 years old;
   - `P4069I` - population from 40 to 69 years old; and
   - `P70I` - population aged 70 years old and over.
 
3. The **travel time threshold** used to estimate the accessibility levels, which is indicated by the two numbers at the end of the column name. This component only applies to the active and passive cumulative measures. The data includes accessibility estimates calculated with cutoffs of 15, 30, 45, 60, 90 and 120 minutes, depending on the transport mode.

**Examples:**

- <span style="color: red;">CMA</span><span style="color: black;">EF</span><span style="color: blue;">30</span>: number of accessible primary public schools within 30 minutes of travel;
- <span style="color: red;">TMI</span><span style="color: black;">SB</span>: minimum travel time to the closest low complexity public health facility; and
- <span style="color: red;">CMP</span><span style="color: black;">PM</span><span style="color: blue;">60</span>: number of women that can access a certain grid cell within 60 minutes of travel.
 
The full description of the columns can also be found in the function documentation, running the `?read_access` command in R. The following sections show examples illustrating how to create spatial visualizations and charts out of the accessibility dataset.


## Map of travel time to access the nearest hospital

In this example, we compare the access time from each grid cell to the nearest public hospital by car and by public transport. To analyze the minimum travel time (`TMI`) to high complexity public hospitals (`SA`), we use the `TMISA` column. With the code below, we load the data visualization libraries and configure the maps showing the spatial distribution of access time by both transport modes. Because public transport trips are usually much longer than car trips, we truncate the travel time distribution to 60 minutes.

```{r}
#| warning: false
#| message: false
#| label: fig-time_to_hospital
#| fig-cap: Travel time to the closest high complexity public hospital in São Paulo
library(ggplot2)
library(patchwork)

# truncates travel times to 60 minutes
data_sp$TMISA <- ifelse(data_sp$TMISA > 60, 60, data_sp$TMISA)

ggplot(subset(data_sp, !is.na(mode))) +
  geom_sf(aes(fill = TMISA), color = NA, alpha = 0.9) +
  scale_fill_viridis_c(
    option = "cividis",
    direction = -1,
    breaks = seq(0, 60, 10),
    labels = c(seq(0, 50, 10), "60+")
  ) +
  labs(fill = "Time\n(minutes)") +
  facet_wrap(
    ~ mode,
    labeller = as_labeller(
      c(car = "Car", public_transport = "Public transport")
    )
  ) +
  theme_void()
```

## Map of employment accessibility

The accessibility dataset also makes it very easy to compare the number of accessible opportunities when considering different travel time thresholds. Using the code below, for example, we illustrate how to visualize, side-by-side, the spatial distribution of employment accessibility by public transport trips of up to 60 and 90 minutes.

```{r}
#| label: fig-accessible_jobs
#| fig-cap: Job accessibility by public transport in São Paulo 
# determine min and max values for the legend
limit_values  <-c(0, max(access_pt $CMATT90, na.rm = TRUE) / 1000000)

fig60 <- ggplot(subset(access_pt, ! is.na(mode))) +
  geom_sf(aes(fill = CMATT60 / 1000000), color = NA, alpha = 0.9) +
  scale_fill_viridis_c(option = "inferno", limits = limit_values) +
  labs(subtitle = "Up to 60 minutes" , fill = "Jobs\n(millions)") +
  theme_void()

fig90 <- ggplot(subset(access_pt, ! is.na(mode))) +
  geom_sf(aes(fill = CMATT90 / 1000000), color = NA, alpha = 0.9) +
  scale_fill_viridis_c(option = "inferno", limits = limit_values) +
  labs(subtitle = "Up to 90 minutes", fill = "Jobs\n(millions)") +
  theme_void()

fig60 + fig90 + plot_layout(guides = "collect")
```

## Accessibility inequalities

Finally, `{aopdata}` accessibility dataset can be used to analyze accessibility inequalities across different Brazilian cities in several different ways. In this subsection, we present three examples of this type of analysis.
 
### Inequality in travel time to access opportunities

In this first example, we compare the average travel time to the nearest high complexity public hospital for people of different income levels. To do this, we calculate, for each income group, the average travel time to reach the nearest high complexity hospital, weighted by the population of each grid cell. Weighting the travel time by population is necessary because each cell has a different population size, thus contributing differently to the average accessibility of the population as a whole.

Before performing the calculation, we should note that some grid cells cannot reach any high complexity hospital within two hours of travel. In such cases, the columns with minimum travel time information assume an infinite value (`Inf`). To deal with this situation in our example, we replace all `Inf` values by a travel time of 120 minutes.

```{r}
#| label: fig-time_to_hospital_by_income
#| fig-cap: Average travel time by public transport to the nearest high complexity hospital in São Paulo
# copies access data into a new data.frame
ineq_pt <- data.table::as.data.table(access_pt)

# replaces Inf values with 120
ineq_pt [, TMISA := ifelse(is.infinite(TMISA), 120, TMISA)]

# calculates the average travel time by income decile
ineq_pt <- ineq_pt[
  ,
  .(avrg = weighted.mean(x = TMISA, w = P001, na.rm = TRUE)),
  by = R003
]
ineq_pt <- subset(ineq_pt, ! is.na(avrg))

ggplot(ineq_pt) +
  geom_col(aes(y = avrg, x = factor(R003)), fill = "#2c9e9e", color = NA) +
  scale_x_discrete(
	  labels = c("D1\npoorest", paste0("D", 2:9), "D10\nwealthiest")
  ) +
  labs(x = "Income decile", y = "Travel time (minutes)") +
  theme_minimal()
```

### Inequality in the number of accessible opportunities

Another way of examining accessibility inequalities is by comparing the number of opportunities that can be reached by different population groups considering the same transport modes and travel time limits. In this case, we analyze the total number of jobs accessible by people of different income deciles by public transport in up to 60 minutes. To do this, we look at the active cumulative access (`CMA`) to total jobs (`TT`) in to up to 60 minutes, which is represented by the column `CMATT60` in the dataset.

```{r}
#| label: fig-accessible_jobs_by_income
#| fig-cap: Distribution of job accessibility by public transport in up to 60 minutes of travel in São Paulo
ggplot(subset(access_pt, !is.na(R003))) +
  geom_boxplot(
    aes(x = factor(R003), y = CMATT60 / 1000000, color = factor(R003))
  ) +
  scale_color_brewer(palette = "RdBu") +
  labs(
    color = "Income\ndecile",
    x = "Income decile",
    y = "Accessible jobs (millions)"
  ) +
  scale_x_discrete(
    labels = c("D1\npoorest", paste0("D", 2:9), "D10\nwealthiest")
  ) +
  theme_minimal()
```

Finally, we can also compare how the usage of different transport modes can lead to different accessibility levels and how the discrepancy between modes varies across cities. In the example below, we compare the number of jobs that one can access in up to 30 minutes of walking and driving. To do this, we first download accessibility estimates by both transport modes for all cities included in the `{aopdata}` package.

```{r}
#| message: false
data_car <- aopdata::read_access(
  city = "all",
  mode = "car",
  year = 2019,
  showProgress = FALSE
)

data_walk <- aopdata::read_access(
  city = "all",
  mode = "walk",
  year = 2019,
  showProgress = FALSE
)
```
 
Next, we calculate, for each city and transport mode, the weighted average number of jobs accessible by trips of up to 30 minutes (`CMATT30`). We then join these estimates together into a single table and calculate the ratio between car and walk accessibility levels.

```{r}
avg_car <- data_car[
  ,
  .(access_car = weighted.mean(CMATT30, w = P001, na.rm = TRUE)),
  by = name_muni
]

avg_walk <- data_walk[
  ,
  .(access_walk = weighted.mean(CMATT30, w = P001, na.rm = TRUE)),
  by = name_muni
]

# merges the data and calculates the ratio between access by car and on foot
avg_access <- merge(avg_car, avg_walk)
avg_access[, ratio := access_car / access_walk]

head(avg_access)
```

Finally, we can analyze the results using a chart:

```{r}
#| label: fig-car_walk_ratio
#| fig-cap: Ratio between job accessibility levels by car and by foot considering trips of up to 30 minutes in the 20 biggest Brazilian cities
ggplot(avg_access, aes(x = ratio, y = reorder(name_muni, ratio))) +
  geom_bar(stat = "identity") +
  geom_text(aes(x = ratio + 3 , label = paste0(round(ratio), "x"))) +
  labs(y = NULL, x = "Ratio between car and walk accessibility") +
  theme_classic()
```

As expected, @fig-car_walk_ratio shows that car trips lead to much higher accessibility levels than equally long walking trips. This difference, however, greatly varies across cities. In São Paulo and Brasília, a 30-minute car trip allows one to access, on average, 54 times more jobs than what it would be possible to access with walking trips. In Belém, the city from our sample with the smallest difference, one can access 17 times more jobs by car than by foot - still a substantial difference, but much smaller than in other cities.