-
Notifications
You must be signed in to change notification settings - Fork 12
/
search_index.json
31 lines (31 loc) · 383 KB
/
search_index.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[
["index.html", "Applied Spatial Statistics with R Preface Spatial Analysis and Spatial Statistics Why this Text? Plan Audience Requisites Words of Appreciation Versioning", " Applied Spatial Statistics with R Antonio Paez 2020-03-01 Preface “Patterns cannot be weighed or measured. Patterns must be mapped.” — Fritjof Capra, The Web of Life: A New Scientific Understanding of Living Systems Spatial Analysis and Spatial Statistics The field of spatial statistics has experienced phenomenal growth in the past 20 years. From being a niche subdiscipline in quantitative geography, statistics, regional science, and ecology at the beginning of the 1990s, it is now a mainstay in applications in a multitude of fields, including medical imaging, remote sensing, civil engineering, geology, statistics and probability, spatial epidemiology, end ecology, to name just a few disciplines. The growth in research and applications in spatial statistics has been in good measure fueled by the explosive growth in geotechnologies: technologies for sensing and describing the natural, social, and built environments on Earth. An outcome of this is that spatial data are, to an unprecedented level, within the reach of multitudes. Hardware and software have become cheaper and increasingly powerful, and we have transitioned from a data poor environment (in all respects, but particularly in terms of spatial data) to a data rich environment. Twenty years ago, for instance, technical skills in spatial analysis included tasks such as digitizing. As a Masters student, I spent many boring hours digitizing paper maps before I could do any analysis on the single-seat (and relatively expensive) Geographic Information System (GIS) available in my laboratory. In that place at that time I was more or less a freak: altough there was an institutional push to adopt GIS, relatively few in my academic environment saw the value of spending hours digitizing maps, something that nowadays we would consider relatively low-level technical work. Surely, the time of a Masters student, let alone a professional researcher or business analyst, is more valuable than that. Indeed, very little time is spent anymore in such tasks low-level tasks, as data increasingly are collected and disseminated in native digital formats. Instead, there is a huge appetite for what could be called the brainware of spatial analysis, the intelligence counterpart of the hardware, software, and data provided by geotechnologies. The contribution of brainware to spatial analysis is to make sense of vast amounts of data, in effect transforming them into information. This information in turn can be useful to understand basic scientific questions (e.g., changes in land cover), to support public policy (e.g., what is the value capture of public infrastructure), and to inform business decisions (e.g., what levels of demand can be expected given the distribution of retail outlets). There are numerous forms of spatial analysis, including normative techniques (such as spatial optimization; see Tong and Murray 2012) and geometric and cartographic analysis (for instance, map algebra; Tomlin 1990). Among these, spatial statistics is one of the key elements in the family of toolboxes for spatial analysis. So what is spatial statistics? Very quickly, I will define spatial statistics as the application of statistical techniques to data that have geographical references - in other words, to the statistical analysis of maps. Like statistics more generally, spatial statistics is interested in hypothesis testing and inference. What distiguishes it as a branch of the broader field of statistics is its explicit interest in situations where data are not independent from each other (like throws of a dice) but rather display systemic associations. These associations, when seen through the lens of cartography, can manifest themselves as patterns of similarities (birds of a feather flock together) or disimilarities (repulsion due spatial competition among firms) - as two common examples of spatial patterns. Spatial statistics covers a broad array of techniques for the analysis of spatial patterns, including tools for testing whether patterns are random or not, and a wide variety of modelling approaches as well. These tools enhance the brainware of analysts by allowing them to identify and possibly model patterns for inferring processes and/or for making spatial predictions. Why this Text? The objective of this book is to introduce selected topics in applied spatial statistics. The foundation for the book are the notes that I have developed over several years of teaching applied spatial statistics at McMaster University. This course is a specialist course for senior-level undergraduate geographers and students in other disciplines who are often working towards specializations in GIS. Over the course of the years, my colleagues at McMaster and I have used at least three different textbooks for teaching spatial statistics. I have personally used McGrew and Monroe (2009) to introduce fundamental statistical concepts to geographers. McGrew and Monroe (currently on a third edition with Lembo) do a fine job of introducing statistics as a tool for decision making, and therefore offer a very valuable resource to learn matters of inference, for instance. Many of the examples in the book are geographical; however, the book is relatively limited in its coverage of spatial statistics (particularly models for spatial processes), which is a limitation for teaching a specialist course on this topic. My text of choice early on (approximately between 2003 and 2010) was the excellent book Interactive Spatial Data Analysis by Bailey and Gatrell (1995). A notable aspect of Bailey and Gatrell was that the book was accompanied by a software application to implement the techniques they discussed. I started using their book as a graduate student around 1998, but even then the limitations of the software that accompanied the book were apparent - in particular the absence of updates or a central repository for code (the book had a sleeve to store a \\(3\\frac{1}{2}\\) floppy disk to install the software). Despite the regretable obsolesence of the software, the book provided then, and still does, a very accessible yet rigorous treatment of many topics of interest in spatial statistics. Bailey and Gatrell’s book was, I believe, the first attempt to bridge, on the one hand, the need to teach mid- and upper-level university courses in spatial statistics, and on the other, the challenges of doing so with very specialized texts on this topic, including the excellent but demanding Spatial Econometrics (Anselin 1988), Advanced Spatial Statistics (Griffith 1988), Spatial Data Analysis in the Social and Environmental Sciences (Haining 1990), not to mention Statistics for Spatial Data (Cressie 1993). More recently, as Bailey and Gatrell aged, my book of choice for teaching spatial statistics became O’Sullivan and Unwin’s Geographical Information Analysis (O’Sullivan and Unwin 2010). This book updated a number of topics that were not covered by Bailey and Gatrell. To give one example, much work happened in the mid- to late-nineties with the development of local forms of spatial analysis, including pioneering research by Getis and Ord on concentration statistics (Getis and Ord 1992), Anselin’s Local Indicators of Spatial Association (Anselin 1995), and Brunsdon, Fotheringham, and Charlton’s research on geographically weighted regression (Brunsdon, Fotheringham, and Charlton 1996). These and related local forms of spatial analysis have become hugely influential in the intervening years, and are duly covered by O’Sullivan and Unwin in a way that merges well with a course focusing on spatial statistics - although other specialist texts also exist that delve in much more depth into this topic (e.g., Fotheringham and Brunsdon 1999; and Lloyd 2010). These resources, and many more, have proved invaluable for my teaching for the past few years, and I am sure that their influence will be evident in the present book. Other excellent sources are also available, including Applied Spatial Data Analysis in R (Bivand, Pebesma, and Gómez-Rubio 2008), Spatial Data Analysis in Ecology and Agriculture Using R (Plant 2012), An Introduction to R for Spatial Analysis & Mapping (Brunsdon and Comber 2015), Spatial Point Patterns: Methodology and Applications with R (Baddeley, Rubak, and Turner 2016), and Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019). This is in addition to other resources available online, such as M. Gismond’s Intro to GIS and Spatial Analysis and R. Hijmans’s Spatial Data Analysis and Modeling with R. So, if there are some excellent resources for teaching and learning spatial statistics, why am I moved to unleash on the world yet another text on this topic? I am convinced that there is richness in variety. As demand for training in spatial statistics grows, there is potential for different sources to satisfy different needs. Some books are geared towards specialized topics (e.g., point pattern analysis; Baddeley, Rubak, and Turner 2016) and cover their subject matter in much more depth than I could in an undegraduate course. For this reason, they are more useful as a reference or a tool for learning for researchers and graduate students. Other books focus more heavily on mapping in R than a course on spatial statistics can comfortably accommodate (e.g., Brunsdon and Comber 2015; Lovelace, Nowosad, and Muenchow 2019). And yet other books are geared towards specific disciplines (e.g., ecology and agriculture; Plant 2012). Bivand et al. (2008) is an excellent reference. At the time of their writing, much work was devoted to issues of spatial data representation. As a consequence, a good portion of their book is concerned with the critical issue of handling spatial data, including data classes and import/export operations which, while essential, happen for most practitioners at a baser level. My approach can be seen as complementary to some of the texts above. I have tried to write a text that introduces key concepts of data handling and mapping in R as they are needed to learn and practice spatial statistical analysis. This I have tried to do in as intuitive way as I could. Readers will see that the computational part of the book - everything that usually lives “under the hood” so to speak - is all bare in the open. The code is extensively documented as it is introduced (with some repetition for pedagogical purposes). Once that a reader has seen and used some commands, we proceed to introduce more sophisticated computational approaches, which are in turn documented extensively when they first appear. I like to think of this approach as introducing coding by stealth, with a gentle ramp for those students who may not have extensive experience in computer-speak. The computational aspects of the book constitute the “how to”. How to calculate a summary statistic. How to create a plot. How to map a variable. The how is an essential foundation for then exercising the brainware. By introducing the how to in a relatively gentle way, I have been able to concentrate in introducing (again, in what I hope is an intuitive way!) key concepts in spatial statistics. The text is not meant to be used as a reference, although some lectors may find that it works in that way in particular with respect to the implementation of techniques. Rather, the text is more suitable to be read linearly - indeed as a course on the topic of spatial statistics. Readers who have familiarized themselves with the text can possibly find it useful as a reference, but I do not recommend using it as a reference in the first place. Lastly, the focus of the text is on applied spatial statistics. There is, inevitably, a component of math, but I have tried, to the extent of my ability, to make the underlying math as intuitive and accessible as possible. As noted above, there is also an important computational component - in particular, as per the title, using the R statistical language. As McElreath (2016) notes, in addition to the pedagogical value of teaching statistics using a coding approach, much of statistics has in fact become so computational that coding skills are increasingly indispensable. I tend to agree with this, and there are reasons to believe that one of the strenghts of this approach as well is to make statistical work as open, clear, and reproducible as possible (see Rey 2009). Plan My aim with this book is to introduce key concepts and techniques in the statistical analysis of spatial data in an intuitive way. While there are more advanced treatments of every single one of these topics, this book should be appealing to undergraduate students or others who are approaching the topic for the first time. The book is organized thematically following the cannonical approach seen, for instance, in Bailey and Gatrell (1995), Bivand et al. (2008), and O’Sullivan and Unwin (2010). This approach is to conceptualize data by their unit of support. Accordingly, data are seen as being represented by: Discrete processes in space (e.g., points and events). Aggregations into zones for statistical purposes (e.g. demographic variables into census areas). As discrete measurements in space of an underlying continuous process (e.g. weather stations monitoring temperature) The book is organized in such a way that each chapter covers a topic that builds on previous material. All chapters, starting with Chapter 3, are followed by an activity. I have used the materials presented in this texts (in a variety of incarnations) for teaching spatial data analysis in different settings. Primarily, these notes have been used in the course GEOG 4GA3 Applied Spatial Statistics at McMaster University. This course is a full (Canadian) academic term, which typically means 13 weeks of classes. The course is organized as a 2-hour-per-week class, with a GIS-lab component which uses a complementary set of notes. For this reason, each chapter is designed to cover very approximately the material that I am used to cover in a 50 minutes lecture in a traditional classroom-lecturing setting. In this case, the activities that follows each chapter could be assigned as homework, optional materials, or as lab materials. For instructors who do not have a lab component, the activities could easily be adapted as lab exercises. More recently, I have experimented with delivery of contents in a flipped classroom format (see here for a discussion of flipped classrooms). Briefly, a flipped classroom changes the role of the instructor and the delivery of contents. In a flipped classroom, the instructor minimizes lecturing time, opting instead for offering study materials in advance (often the materials are online and may have an interactive component). This frees the instructor from the tyranny of lecturing, so that in-class time can be dedicated instead to hands-on activities. The instructor is no longer a magical source of wisdom, but rather a partner in the learning process. Under this scenario, students are responsible for reading the chapter or chapters required in advance to a class. The class then is dedicated to the activity that follows the chapter, with students working individually or in small groups in the activity. I have broken a 50-minutes session of this type as follows: 10 minutes for a short mini-lecture and to discuss any questions about the preceding reading/study materials, followed by 30 minutes to complete the activity; during this time I engage individually or in small groups with the students as they work; and before the end of the 50-minutes session a 10 minute recap, where I summarize the key aspects of the lesson, clearly identify the threshold concepts covered, and indicate how this relates to the next lesson. Increasingly I see this format as a form of apprenticeship, where the students learn by doing, and see links (which I have yet to explore) to experiential learning. In addition to the two formats above (traditional classroom-lecture and flipped classroom), I have also used portions of these notes to teach short courses in different places (e.g., the University of Western Australia and the Gran Sasso Scientific Institute in Italy). The materials can, with some relatively minor modifications, be used in this way. As I continue to work on these notes, I hope to be able to add optional (or bonus) chapters, that could be used 1) to extend a course on spatial statistics beyond the 13 week horizon of the Canadian term, and/or 2) to offer more advanced material to interested readers see here for an example on spatial filtering. Audience The notes were designed for a course in geography, but in fact, could be easily adjusted for an audience of earth scientists, environmental scientists, econometricians, planners, or students in other disciplines who have an interest in and work with georeferenced datasets. The prerequisites are an introductory college/university level course on multivariate statistics, ideally covering the fundamentals of probability, hypothesis testing, and multivariate linear regression analysis. Requisites To fully benefit from this text, up-to-date copies of R and RStudio are highly recommended. Many examples in the text use datasets that have been packaged for convenience as an R package. To install the package (geog4ga3) use the following command(which requires devtools): library(devtools) devtools::install_github("paezha/Spatial-Statistics-Course", subdir = "geog4ga3") The source files for the chapters and activites can be obtained from the following GitHub repository: https://github.com/paezha/paezha.github.io/tree/master/applied_spatial_statistics It is also possible to suggest edits to the text, and it only requires a GitHub account (sign-up at github.com). After logging into GitHub, you can click on the ‘edit me’ icon (the fourth icon on the left top corner of the toolbar at the top). Words of Appreciation I would like to express my gratitude to the Paul R. MacPherson Institute for Leadership, Innovation and Excellence in Teaching. The Institute supported, through its Student Partners program, my work with some amazing student partners. As part of this program, I worked with Mr. Rajveer Ubhi in the Fall of 2018 and Winter of 2019 organizing all the materials for the text, documenting the code, and ensuring that it satisfied student needs. I also had the opportunity to work with Ms. Megan Coad and Ms. Alexis Polidoro in the Fall of 2019 and Winter of 2020. As former students of the course, Ms. Coad and Polidoro helped to develop a set of mini-lectures to accompany the materials, continued to document the code, and tested the activities. In the Winter 2020 they will also accompany me in the classroom to work directly with new students when we offer the course again. Dr. Anastasios Dardas helped develop illustrative applications that helped us understand the value of interactivity in delivering many of the contents. Working with these wonderful individuals has been a pleasure, and I am grateful for their contributions to this effort. Versioning These notes were developed using the following version of R: ## _ ## platform x86_64-w64-mingw32 ## arch x86_64 ## os mingw32 ## system x86_64, mingw32 ## status ## major 3 ## minor 6.2 ## year 2019 ## month 12 ## day 12 ## svn rev 77560 ## language R ## version.string R version 3.6.2 (2019-12-12) ## nickname Dark and Stormy Night References "],
["preliminaries-installing-r-and-rstudio.html", "Chapter 1 Preliminaries: Installing R and RStudio 1.1 Introduction 1.2 Learning Objectives 1.3 R: The Open Statistical Computing Project 1.4 Packages in R", " Chapter 1 Preliminaries: Installing R and RStudio 1.1 Introduction Statistical analysis is the study of the properties of a dataset. There are different aspects of statistical analysis, and they often require that we work with data that are messy. According to Wickham and Grolemund (2016), computer-assisted data analysis includes the steps outlined in Figure 1.1. First, the data are imported to a suitable software application. This can include data from primary sources (suppose that you collected coordinates using a GPS) or from secondary sources (the Census of Canada). Data will likely be text tables, or an Excel file, among other possible formats. Before data can be analyzed, they need to be tidied. This means that the data need to be arranged in such a way that they match the process that you are interested in. For instance, a travel survey can be organized so that each row is a traveler, or as an alternative so that each row is a trip. Once that data are tidy, Exploratory Data Analysis (EDA) and/or its geographical extension Exploratory Spatial Data Analysis (ESDA) can be conducted. This involves transforming the raw data into information. Examples of transformations include calculating the mean and the standard deviation. Visualization is also part of this exploratory exercise. In EDA this could be creating a histogram or a scatterplot. Mapping is a key visualization technique in spatial statistics. Modeling is a process that further extracts information from the data, typically by looking at relationships between multiple variables. All of the tasks mentioned above, and many more, can be handled easily in a variety of software applications. For this course, you will use the statistical computing language R. Figure 1.1: The process of doing data analysis (from Wickham and Grolemund, 2016) 1.2 Learning Objectives In this reading, you will learn: How to install R. About the RStudio Interactive Development Environment. About packages in R. 1.3 R: The Open Statistical Computing Project 1.3.1 What is R? R is an open-source language for statistical computing. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland, in New Zealand, as a way to offer their students an accessible, no-cost tool for their courses. R is now maintained by the R Development Core Team, and is developed by hundreds of contributors around the globe. R is an attractive alternative to other software applications for data analysis (e.g., Microsoft Excel, STATA) due to its open-source character (i.e., it is free), its flexibility, and large and dedicated user community. The presence of a very active community of developers and users, especially in an open context, means if there is something you want to do (for instance, linear regression), it is very likely that someone has already developed functionality for it in R. A good way to think about R is as a core package, to which a library, consisting of additional packages, can be attached to increase its functionality. R can be downloaded for free at: https://cran.rstudio.com/ R comes with a built-in console (a user graphical interface), but better alternatives to the basic interface exist, including RStudio, an Integrated Development Environment, or IDE for short. RStudio can also be downloaded for free, by visiting the website: https://www.rstudio.com/products/rstudio/download/ R requires you to work using the command line, which is going to be unfamiliar to many of you accustomed to user-friendly graphical interfaces. Do not fear. People worked for a long time using the command line, or even more cumbersomely, using punched cards in early computers. Graphical user interfaces are convenient, but they have a major drawback, namely their inflexibility. A program that functions based on graphical user interfaces allows you to do only what is hard-coded in the user interface. Command line, as we will see, is somewhat more involved, but provides much more flexibility in operation, and it frees you from the constraints inherent in a point-and-click system. Go ahead. Install R and RStudio in your computer. (If you are at McMaster working in the GIS lab, you will find that these have already been installed there). Before introducing some basic functionality in R, lets quickly take a tour of the RStudio IDE. 1.3.2 The RStudio IDE The RStudio IDE provides a very complete interface to interact with the language R, and do much more in addition. It consists of a window with several panes. Some panes include, in addition, several tabs. There are the usual drop-down menus for common operations, such as creating new files, saving, common commands for editing, etc. See Figure 1.2 below. Figure 1.2: The RStudio IDE The editor pane allows you to open and work with text and other files, where you can write instructions that can be passed on to the program. Writing something in the editor does not execute any instructions, it merely records them for possible future use. In fact, much of what is written in the editor will not be instructions, but rather comments, discussion, and other text that is useful to understand code. The console pane is where instructions are passed on to the program. When an instruction is typed (or copied and pasted) there, R will understand that it needs to do something. The instructions must be written in a way that R understands, otherwise errors will occur. If you have typed instructions in the editor, you can use “ctrl-Enter” (in Windows) or “cmd-Enter” (in Mac) to send to the console and execute. The environment is where all data that is currently in memory is reported. The History tab acts like a log: it keeps track of the instructions that have been executed in the console. The last pane includes a number of useful tabs. The File tab allows you to navigate your computer, change the working directory, see what files are where, and so on. The Plot tab is where plots are rendered, when instructions require R to do so. The Packages tab allows you to manage packages, which as mentioned above, are pieces of code that can augment the functionality of R. The Help tab is where you can consult the documentation for functions/packages/see examples, and so on. The Viewer tab is for displaying local web content, for instance, to preview a Notebook (more on Notebooks soon). This brief introduction should have allowed you to install both R and RStudio. The next thing that you will need is a library of packages. 1.4 Packages in R According to Wickham (2015) packages are the basic units of reproducible code in the R multiverse. Packages allow a developer to create a self-contained unit of code that often is meant to achieve some task. For instance, there are packages in R that specialize in statistical techniques, such as cluster analysis, visualization, or data manipulation. Some packages can be miscellaneous tools, or contain mostly datasets. Packages are a very convenient way of maintaining code, data, and documentation, and also of sharing all these resources. Packages can be obtained from different sources (including making them!). One of the reasons why R has become so successful is the relative facility with which packages can be distributed. A package that I use frequently is called tidyverse (Wickham 2017). The tidyverse is a collection of functions for data manipulation, analysis, and visualization. This package can be downloaded and installed in your personal library of R packages by using the function install.packages, as follows: install.packages("tidyverse") The function install.packages retrieves packages from the Comprehensive R Archive Network, or CRAN for short. CRAN is a collection of sites (accessible via the internet) that carry identical materials for distribution for R. There are other ways of distributing packages. For instance, throughout this book you will make use of a package called geog4ga3 that contains a collection of datasets and functions used in the readings or activities. This package is not on CRAN, but instead can be obtained from GitHub, a repository and versioning system. To retrieve packages from GitHub you need a function called install_github, which in turn is part of the package devtools. To download and install the package geog4ga3, you need first to download and install devtools as follows: install.packages("devtools") Once that a package has been downloaded and installed, it needs to be loaded into a session to be available to use. I find it useful to think of packages that I download as “books” that I place in my personal “bookshelf”. Some “books” I obtain from the central library (i.e., CRAN), while others are shared by friends, and some I have even written myself. Once that the “books” are in my “bookshelf” they are part of my own personal library. This means that they are available for use. Next time I want to use a “book” from my library, I need to retrieve it from the bookshelf. This is similar to taking the book and opening it on my desk: now all the magic contained in the package is available for use! Similarly, once that the book is in my library, I do not need to retrieve it again from the bookstore - a package, once installed, does not need to be installed again (it might need updates, but that is a different matter). This analogy suggests that I can have many packages in my library, only some of which I may need at any specific time for a task. To retrieve a package (i.e., a book) from the library, so that we can use it, the function library is invoked as in this example: library(devtools) This allows you to use all the functions in the package devtools. In particular, at this point you want to use a function that allows you to retrieve other packages! With the functionality of devtools::install_github you can download and install the companion package for the book by running the following instruction: install_github("paezha/Spatial-Statistics-Course", subdir="geog4ga3") This will install the package (i.e., put it in your library) so that you can also benefit from its functionality. References "],
["basic-operations-and-data-structures-in-r.html", "Chapter 2 Basic Operations and Data Structures in R 2.1 Learning Objectives 2.2 RStudio IDE 2.3 Some Basic Operations 2.4 Data Classes in R 2.5 Data Types in R 2.6 Indexing and Data Transformations 2.7 Visualization 2.8 Creating a Simple Map", " Chapter 2 Basic Operations and Data Structures in R NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. The preceding chapter showed you how to install R and RStudio, and explained some key concepts, such as packages as fundamental units of reproducible code, and the concept of your library (where the packages that you install are stored). Now that you have installed R and RStudio we can begin with an overview of basic operations and data structures in this computing language. Please note that this document you are reading, called an R Notebook, is an example of what is called literate programming, a style of document that uses code to illustrate a discussion, as opposed to the traditional programming style that uses natural language to discuss/document the code. By focusing on natural language as opposed to computer-speak, literate programming flips around the usual manner of technical writing to make documents more intuitive and accessible. Whenever you see a chunk of code in an R Notebook, you can run it (by clicking the ‘play’ icon on the top right corner) to see the results. Try it! print("Hello, Geography 4GA3") ## [1] "Hello, Geography 4GA3" If, instead of the R Notebook, you are reading this chapter as a WebBook or in print, you can type or copy and paste the instructions on your console. Try it! The chunk of code above instructed R (and through R the computer) to print (or display on the screen) some text. 2.1 Learning Objectives In this practice, you will learn: Basic operations in R. Data classes, data types, and data transformations. More about the use of packages in R. Basic visualization. 2.2 RStudio IDE If you are reading this, you probably already read the introductory chapter that instructed you to install R and RStudio. We can now proceed to discuss some basic concepts of operations and data types. 2.3 Some Basic Operations R can perform many types of operations. Some simple operations are arithmetic. Other are logical. And so on. For instance, R can be instructed to conduct sums, as follows: # `R` understands numbers and arithmetic operators such as `+` for addition 2 + 2 ## [1] 4 R can be instructed to do multiplications: # The sign to instruct `R` to multiply is `*` 2 * 3 ## [1] 6 And sequences of operations, possibly using brackets to indicate their order. Compare the following two expressions: 2 * 3 + 5 ## [1] 11 2 * (3 + 5) ## [1] 16 Other operations produce logical results (values of true and false): # Is the statement true? 3 > 2 ## [1] TRUE # Is this true? 3 < 2 ## [1] FALSE And of course, you can combine operations in an expression: 2 * 3 + 5 < 2 * (3 + 5) ## [1] TRUE As you can see, R can be used as a calculator, but it is much more powerful than that. We can also create variables. You can think of a variable as a box with a name, whose contents can change. Variables are used to keep track of important stuff in your calculations, and to automate operations. To create a variable, a value is assigned to a name, using this notation <-. You can read this x <- 2 as “assign the value of 2 to a variable called x”. For instance: # `<-` means "put the value of 2 in the object called `x`" x <- 2 # `<-` means "put the value of 3 in the object called `y`" y <- 3 # `<-` means "put the value of 5 in the object called `z`" z <- 5 Check your “Global Environment”, the tab where the contents of your “Workspace” are displayed for you. You can also simply type the name of the variable in the Console to see its contents. Now that we have some variables with values, we can express operations as follows (same as above) x * y + z ## [1] 11 x * (y + z) ## [1] 16 However, if we wanted, we could change the values of any of x, y, and/or z and repeat the operations. This allows to automate some instructions: x <- 4 x * y + z ## [1] 17 The famous mathematician Henri Poincaré once wrote that “[m]athematics is the art of giving the same name to different things”. Working with a computer language is a lot like that: giving the same name to different values allows us to explore with ease “what would happen if…”. It is a very powerful tool to help us understand the world. 2.4 Data Classes in R As you saw above R can work with different data classes. Some data are numbers. Other data are logical (i.e., take values of TRUE or FALSE). These are some data classes: Numerical Character Logical Factor The existence of different data classes is very useful, since it allows you to store information in different forms. For instance, you may want to save some text: name <- "Hamilton" Or numerical information: population <- 551751 If you wish to check what class an object is, you can use the function class: class(name) ## [1] "character" class(population) ## [1] "numeric" 2.5 Data Types in R R can work with different data types, including scalars (essentially matrices with only one element), vectors (matrices with one dimension of size 1) and matrices (more generally). print('This is a scalar') ## [1] "This is a scalar" 1 ## [1] 1 print('This is a vector') ## [1] "This is a vector" # c() is a function to concatenate, that is, to put values in a vector c(1,2,3,4) ## [1] 1 2 3 4 print('This is a matrix') ## [1] "This is a matrix" # matrix() creates a two-dimensional array with `nrow` rows, and `ncol` columns matrix(c(1,2,3,4),nrow = 2, ncol=2) ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 The command c() is used to concatenate the arguments, that is, to join them in a single object. The objects must be of the same class: they must be all numeric, or all character, or all logical, and so on. We cannot combine different data classes in a vector. The command matrix() creates a matrix with the specified number of rows and columns. An important data type in R is a data frame. A data frame is a table consisting of rows and columns - commonly a set of vectors that have been collected for convenience. A data frame is used to store data in digital format. (If you have used Excel or another spreadsheet software before, data frames will be familiar to you: they look a lot like a sheet in a spreadsheet.) A data frame can accommodate large amounts of information (several billion individual items). The data can be numeric, character, logical, and so on. Each grid cell in a data frame has an address that can be identified based on the row and column it belongs to. R can use these addresses to perform mathematical operations. `R`` labels columns alphabetically and rows numerically (or less commonly alphabetically). To illustrate a data frame, let us first create the following vectors, that include names (character class), populations (numeric class), average salaries (numeric class), and coordinates (numeric class) of some cities: # c() is a function to concatenate, that is, to put values in a vector Name <- c('Hamilton','Waterloo','Toronto') Population <- c(551751, 219153, 2731571) AvgSalary <- c(45692, 57625, 48920) Latitude <- c(43.255203, 43.4668, 43.6532) Longitude <- c(-79.843826, -80.51639, -79.3832) Again, note that <- is an assignment. In other words, it assigns the item on the right to the name on the left. After you execute the chunk of code above, you will notice that new values appear in your Environment. These are five vectors of size 1:3. You can also see what is the class of the vector: one that is composed of alphanumeric information (or chr, for ‘character’) and four columns that are numeric (num). These vectors can be collected in a dataframe. This is done for convenience, so we know that all these data belong together in some way. Please note that to create a data frame, the vectors must have the same length. In other words, you cannot create a table with elements that have different numbers of rows (other data types allow you to do this, but not data frames). We will now create a data frame. We will call it “Cities”. There are rules for names (for example, they cannot begin with a number), but in most cases it helps if the names are intuitive and easy to remember. The function used to create a data frame is data.frame() and the arguments are the vectors that we wish to collect there. Cities <- data.frame(Name, Population, AvgSalary, Latitude, Longitude) After running the chunk above, now you have a new object in your environment, namely a data frame called Cities. If you double clic on Cities in the Environment tab, you will see that this data frame has five columns (labeled Name, Population, AvgSalary, Latitude, and Longitude), and three rows. You can enter data into a data frame and then use the many built-in functions of R to perform various types of analysis. At this point, you may notice that Name, which was an alphanumeric vector, was converted to a factor in the data frame. A factor (data class) is a way to store nominal/categorical variables that may have two or more levels. Nominal variables are like labels. In the present case, the factor variable has three levels, corresponding to three cities. If we had information for multiple years, each city might appear more than once, for each year that information was available. 2.6 Indexing and Data Transformations Data frames store information that is related in a compact way. To perform operations effectively, it is useful to understand the way R locates information in a data frame. As noted before, each grid cell has an address, or in other words an index, that can be referenced in several convenient ways. For instance, assume that you wish to reference the first value of the data frame, that is, row 1 of column Name. To do this, you would use the following instruction: # To index elements in a data frame we use square brackets `[]` the first number in the square bracket is the row, and the second number (separated by a comma) is the column Cities[1,1] ## [1] Hamilton ## Levels: Hamilton Toronto Waterloo This will recall the element in the first row and first column of Cities. It also tells you what are the levels of this variable. As an alternative, you could type: Cities$Name[1] ## [1] Hamilton ## Levels: Hamilton Toronto Waterloo As you see, this has the same effect. The string sign $ is used to reference columns in a data frame. Therefore, R will call the first element of Name in data frame Cities. Cities[1,2] is identical to Cities$Name[2]. Try changing the code in the chunk and executing. If you type Cities$Name, R will recall the full column. Indexing is useful to conduct operations. Suppose for instance, that you wished to calculate the total population of two cities, say Hamilton and Waterloo. You can execute the following instructions: # The string sign `$` is used to make reference to a column in the data frame. The square brackets index the row in the column. Cities$Population[1] + Cities$Population[2] ## [1] 770904 (More involved indexing is also possible, for example, if we use logical operators. Do not worry too much about the details at this point, just verify that the results are identical) # The indexing now is a logical statement. The double equal sign `==` is used to make logical comparisons. `R` will find the rows for which `Cities$Name=='Hamilton'` in the first element of the sum, and the rows for which `Cities$Name=='Waterloo'` is true in the second element of the sum. Cities$Population[Cities$Name=='Hamilton'] + Cities$Population[Cities$Name=='Waterloo'] ## [1] 770904 Suppose that you wanted to calculate the total population of the cities in your data frame. To do this, you would use the function sum(): # `sum()` is a function to add all elements in a numerical vector - which could be a column in a data frame sum(Cities$Population) ## [1] 3502475 You have already seen how it allows you to store in memory the results of some instruction, by means of an assignment <-. You can also perform many other useful operations. For instance, calculate the maximum value for a set of values: # `max()` finds the maximum value in a numerical vector max(Cities$Population) ## [1] 2731571 And, if you wanted to find which city is the one with the largest population, you would use a logical statement as an index: # `R` will find all rows for which the statement `Cities$Population==max(Cities$Population)`, that is, all the rows with a population identical to the maximum population! Cities$Name[Cities$Population==max(Cities$Population)] ## [1] Toronto ## Levels: Hamilton Toronto Waterloo As you see, Toronto is the largest city (by population) in this dataset. Using indexing in imaginative ways provides a way to do fairly sophisticated data analysis. Likewise, the function for finding the minimum value for a set of values is min(): # `min() finds the minimum value in a numerical vector min(Cities$Population) ## [1] 219153 Try calculating the average of the population of the cities, using the command mean(). Use the empty chunk below for this (the result should be 1167492), or do this in your console in RStudio: Finding the maximum and minimum, aggregating (calculating the sum of a series of values), and finding the average are examples of transformations applied to the data. They give insights into aspects of the dataset that are not necessarily evident from the raw data, especially if the number of observations (or cases) is large. Imagine trying to visually scan a spreadsheet with ten thousand observations to find the maximum value stored there! 2.7 Visualization The data frame, in essence a table, informative as it is, is no usually the best way to learn from the data. Transformations (or descriptive statistics as discussed above) are helpful to understand important properties of a dataset. In addition, visualization is often a valuable complement to data analysis. Say, we might be interested in finding which city has the largest population and which city has the smallest population in a dataset. We could achieve this by using similar instructions as before, for example: # `paste()` is similar to `print()`, except that it converts everything to characters before printing. We use this function because the contents of `Name` in the data frame `Cities` are not characters, but levels of a factor` paste('The city with the largest population is',Cities$Name[Cities$Population==max(Cities$Population)]) ## [1] "The city with the largest population is Toronto" paste('The city with the smallest population is', Cities$Name[Cities$Population==min(Cities$Population)]) ## [1] "The city with the smallest population is Waterloo" Another way, perhaps more convenient of understanding these data is by visualizing them, using for instance a bar chart. We will proceed to create a bar chart, using a package called ggplot2. This package implements a grammar of graphics, and is a very flexible way of creating plots in R. Since ggplot2 is a package, we first must ensure that it is installed. You can install it using the command install as follows: # Once you have installed a package, it does not need to be installed again! It already is in your library and you only need to load it with `library()` install.packages("ggplot2") As an alternative to the install.packages() function, you can use the Packages tab in RStudio. Simply navigate to the tab, click install, and select ggplot2 from there. Note that you need to install the package only once! Essentially install adds it to your library of packages, where it will remain available. Once the package is installed, it becomes available, but to use it you must load it in memory (similar to opening a “book” on your desktop as you work). For this, we use the command library(), which is used to load a package, that is, to activate it for use. Assuming that you already have installed ggplot2, we proceed to load it: library(ggplot2) Now all commands from the ggplot2 package are available to you. The package ggplot2 works by layering a series of objects, beginning with a blank plot, to which we can add things. The command to create a plot is ggplot(). This command accepts different arguments. For instance, we can pass data to it in the form of a data frame. We can also indicate different aesthetic values, that is, the things that we wish to plot. None of this is plotted, though, until we indicate which kind of geom or geometric object we wish to plot. For a bar chart, we would use the following instructions: # The function `ggplot()` creates an object for plotting, using a data frame as indicated by the input argument `data =`. Furthermore, we can specify how to map elements in the data frame to things in the plot. In this example, we wish to map the names of cities to the x-axis of the plot, and the population to the y-axis of the plot. Accordingly, we define as aesthetic values `aes()` `x = Name` and `y = Population`. The geometric object that we wish to plot is bars, so we use `geom_bar()` with the argument `stat = "identity"` so the data are not transformed before plotting: ggplot(data = Cities, aes(x = Name, y = Population)) + geom_bar(stat = "identity") Since this is the first time that we use ggplot(), it is informative to break down these instructions. We are asking ggplot2 to create a plot that will use the data frame Cities. Furthermore, we tell it to use the values of Names in the x-axis, and the values of Population in the y-axis. Run the following chunk: ggplot(data = Cities, aes(x = Name, y = Population)) Notice how ggplot2 creates a blank plot, and it has yet to actually render any of the population information in there. We layer elements on a plot by using the + sign. It is only when we tell the package to add some geometric element that it renders something on the plot. In the previous case, we told ggplot2 to draw bars (by using the geom_bar() function). The argument of geom_bar was stat = 'identity', to indicate that the data for the y-axis was to be used ‘as-is’ without further statistical transformations. There are many different geoms that can be used in ggplot2. You can always consult the help/tutorial files by typing ??ggplot2 in the console. See: ??ggplot2 2.8 Creating a Simple Map We will see how maps are used in spatial statistical analysis. The simplest one that can be created is a so-called dot map. A dot map simply displays the locations of events of interest, as points. A dot map is, in fact, simply a scatterplot of the coordinates of events. We can use ggplot2 to create a simple dot map of the cities in our sample dataset. For this, we create a ggplot2 object, and for the x and y aesthetics we use the coordinates. The geometric element that we want to render is a point: # The longitude is mapped to the x-axis of the plot and the latitude is map to the y-axis of the plot. The function `geom_points()` is used to draw points: ggplot(data = Cities, aes(x = Longitude, y = Latitude)) + geom_point() This is a simple dot map that simply shows the locations of the cities. We can add labels by means of the geometric element text: # `geom_text()` is used to write text on the plot, still using the longitude and latitude information: ggplot(data = Cities, aes(x = Longitude, y = Latitude)) + geom_point() + geom_text(aes(label = Name)) The dot map above tells us the location of the cities in our dataframe and their name. We can include more information in the plot in different ways. For example, a proportional symbol map changes the size of the symbols (the points) to add information to the plot. To create a proportional symbol map, we add to the aesthetics the instruction to use some variable for the size of the symbols: # The `size` of the points will be proportional to the `Population` values in the data frame ggplot(data = Cities, aes(x = Longitude, y = Latitude)) + geom_point(aes(size = Population)) + geom_text(aes(label = Name)) AFurthermore, we can fix the position of the labels by adding a vertical justification to the text (vjust), and to avoid the text from being cut we can also expand the limits of the plot (expand_limits()): ggplot(data = Cities, aes(x = Longitude, y = Latitude)) + geom_point(aes(size = Population)) + geom_text(aes(label = Name), vjust = 2) + expand_limits(x = c(-80.7, -79.2), y = c(43.2, 43.7)) The example above has guided you in the creation of a relatively simple proportional symbols map! You can see that creating a plot is simply a matter of instructing R (through ggplot2) to complete a series of instructions: Create a ggplot2 object using a dataset, which will render stuff at locations given by variable1 and variable 2: ggplot(data = dataset, aes(x = variable1, y = variable2)) Add stuff to the plot. For instance, to add points use geom_point, to add lines use geom_line, and so on. Check the ggplot2 Cheat Sheet for more information on how to use this package. A last note. Many other visualization alternatives (for instance, Excel) provide point-and-click functions for creating plots. In contrast, ggplot2 in R requires that the plot be created by meticulously instructing the package what to do. While this is more laborious, it also means that you have complete control over the creation of plots, which in turn allows you to create more flexible and inventive visuals. Below are some of figures that I have created using R in recent years, including diagrams, thematic maps, and raster data. Figure 2.1: Example of visualization: diagram of catchment areas for accessibility analysis (from Paez, Higgins, and Vivona (2018) Figure 2.2: Example of visualization: accessibility to family doctors in Hamilton (from Paez, Higgins, and Vivona (2018) Figure 2.3: Example of visualization: water sources (triangles) and households (circles) in a region in central Kenya (from Paez et al. (2020) This concludes your basic overview of basic operations and data structures in R. You will have an opportunity to learn more about creating maps in R with your reading. "],
["introduction-to-mapping-in-r.html", "Chapter 3 Introduction to Mapping in R 3.1 Learning Objectives 3.2 Suggested Readings 3.3 Preliminaries 3.4 Packages 3.5 Exploring Dataframes and a Simple Proportional Symbols Map 3.6 Improving on the Proportional Symbols Map 3.7 Some Simple Spatial Analysis 3.8 Other Resources", " Chapter 3 Introduction to Mapping in R NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. Spatial statistics is a sub-field of spatial analysis that has grown in relevance in recent years as a result of 1) the availability of information that is geo-coded, in other words, that has geographical references; and 2) the availability of software to analyze such information. A key technology fuelling this trend is that of Geographical Information Systems (GIS). GIS are, in simplest terms, digital mapping for the 21st century. In most cases, however, GIS go beyond cartographic functions to also enable and enhance our ability to analyze data. There are many available packages for geographical information analysis. Some are very user friendly, and widely available in many institutional contexts, such as ESRI’s Arc software. Others are fairly specialized, such as Caliper’s TransCAD, which implements many operations of interest for transportation engineering and planning. Others packages have the advantage of being more flexible and/or free. Such is the case of the R statistial computing language. R has been adopted by many in the spatial analysis community, and a number of specialized libraries have been developed to support mapping and spatial data analysis functions. The objective of this note is to provide an introduction to mapping in R. Maps are one of the fundamental tools of spatial statistics and spatial analysis, and R allows for many GIS-like functions. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. In the previous reading/practice you created a simple proportional symbols map. In this reading/practice you will learn how to create more sophisticated maps in R. 3.1 Learning Objectives In this reading, you will: Revisit how to install and load a package. Learn how to invoke a data and view the data structure. Learn how to easily create maps using R. Think about how statistical maps help us understand patterns. 3.2 Suggested Readings Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapters 2-3. Springer: New York Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 3. Sage: Los Angeles 3.3 Preliminaries It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: # The function `ls()` lists all objects in the Environment, that is, your current workspace; `rm()` removes all objects listed in the argument `list = ` rm(list = ls()) 3.4 Packages According to Wickham (2015) packages are the basic units of reproducible code in the R multiverse. Now that your workspace is clear, you can proceed to load a package. In this case, the package is the one used for this book/course, called geog4ga3: #The function 'library' is used to load the data we want to work with. In this case, it is the geog4ga3 master package that we want to work with library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' The package includes a few datasets that will be used throughout the book: #The function 'data' is used to check if a dataset is present within any loaded packages. In this case, we are looking for 'snow_deaths' data("snow_deaths") 3.5 Exploring Dataframes and a Simple Proportional Symbols Map If you correctly loaded the library, you can now access the dataframes in the package geog4ga3. For this section, you will need two dataframes, namely snow_pumps and snow_deaths: #The function 'head' will display the first few rows of the dataframe, snow_deaths head(snow_deaths) ## long lat Id Count ## 0 -0.1379301 51.51342 1 3 ## 1 -0.1378831 51.51336 2 2 ## 2 -0.1378529 51.51332 3 1 ## 3 -0.1378120 51.51326 4 1 ## 4 -0.1377668 51.51320 5 4 ## 5 -0.1375369 51.51318 6 2 These data are from the famous London cholera example by John Snow (not the one from Game of Thrones, but the British physician). John Snow is considered the father of spatial epidemiology, and his study mapping the outbreak is credited with helping find its cause.This study investigates the cholera outbreak of Soho, London, in 1854. The dataframe snow_deaths includes the geocoded addresses of cholera deaths in long and lat, and the number of deaths (the Count) recorded at each address, as well as unique identifiers for the addresses (Id). A second dataframe snow_pumps includes the geocoded locations of water pumps in Soho: head(snow_pumps) ## long lat Id Count ## 01 -0.1366679 51.51334 251 1 ## 1100 -0.1395862 51.51388 252 1 ## 250 -0.1396710 51.51491 253 1 ## 310 -0.1316299 51.51235 254 1 ## 410 -0.1335944 51.51214 255 1 ## 510 -0.1359191 51.51154 256 1 As in your previous reading, it is possible to map the cases using ggplot2. Begin by loading the package tidyverse: #'Tidyverse' is a collection of R packages designed for data science used in everyday data analyses library(tidyverse) Now, you can create a blank ggplot2 object from which you can render the points for deaths and the pumps. #The function 'ggplot' is used for data visualization - it creates a graph. The function 'geom_point' tells R you want to create a plot of points. 'data = snow_deaths' tells R you want to use the 'snow_deaths' dataframe. 'aes' stands for aesthetics of your graph where 'x = long' sets the x axis to 'long', where 'y = lat' sets the y axis to 'lat', where 'color = blue' colours the points blue and 'shape = 16' assigns the shape of the points - in this case, '16' are circles and '17' are triangles ggplot() + geom_point(data = snow_deaths, aes(x = long, y = lat), color = "blue", shape = 16) + geom_point(data = snow_pumps, aes(x = long, y = lat), color = "black", shape = 17) This map is a decent example of how to represent visually some contents in the dataframe. Here, information is displayed using different colours and symbols to represent pumps and deaths from the London Cholera Example. Though this map provides useful insights, it is not of the greatest quality. We will illustrate other ways of creating maps below, including interactive maps. 3.6 Improving on the Proportional Symbols Map A package that extends the functionality of mapping in R is leaflet. A key feature of the leaflet package is the ability to make maps interactive for the user. We will see next how to enhance our proportional symbol map using this package. First you need to load the package (you need to install it first if you have not already): # 'Leaflet' is a package used for visualizing data on a map in R. 'Magrittr' is a package used for creating pipe operators #install.packages('leaflet') # Run only if you have not yet installed `leaflet`! #install.packages('magrittr') # Run only if you have not yet installed `magrittr`! library(leaflet) library(magrittr) The first step is to create a leaflet object, which will be saved in m : # Here, we create a `leaflet` object and assign it to the variable, 'm'. The 'setView' function sets the view of the map where 'lng = -0.136' sets the longitutde, 'lat = 51.513' sets the latitude and the map zoom is set to 16. The '%>%' is a pipe operator that passes the output from the left hand side of the operator to the first argument of the right hand side of the operator. In this case we are telling `R` that we want to center the map on the set longitude and latitude, with a zoom level of 16, which corresponds roughly to a neighborhood m <- leaflet(data = snow_deaths) %>% setView(lng = -0.136, lat = 51.513, zoom = 16) This map looks like this at this point: m The map is empty! This is because we have not yet added any geographical information to plot. We can begin by adding a basemap as follows: # We are adding a basemap or background map of the study location by means of the `addTiles` function to the 'm' variable m <- m %>% addTiles() m The map now shows the neighborhood in Soho where the cholera outbreak happened. Now, at long last, we can add the cases of cholera deaths to the map. For this, we indicate the coordinates (preceded by ~), and set an option for clustering by means of the clusterOptions in the following fashion: # We are adding the cholera deaths to the map using 'group = Deaths'. The '~' symbol tells R to use the same longitude and latitude values used in the previous block of code and the 'clusterOptions = markerClusterOptions()' clusters a large number of markers on the map - in this case it is clusturing number of deaths into icons with numbers m <- m %>% addMarkers(~long, ~lat, clusterOptions = markerClusterOptions(), group = "Deaths") m The map now displays the locations of cholera deaths on the map. If you zoom in, the clusters will rearrange accordingly. Try it! The other information that we have available is the location of the water pumps, which we can add to the map above (notice that the Broad Street Pump is already shown in the basemap!): m %>% addMarkers(data = snow_pumps, ~long, ~lat, group = "Pumps") An alternative and quicker way to run the same bit of code is by means of pipe operators (%>%). These operators make writing code a lot faster, easier to read, and more intuitive! Recall that a pipe operator will take the output of the preceding function, and pass it on as the first argument of the next: m_test <- leaflet () %>% setView(lng = -0.136, lat = 51.513, zoom = 16) %>% addTiles() %>% addMarkers(data = snow_deaths, ~long, ~lat, clusterOptions = markerClusterOptions(), group = "Deaths") %>% addMarkers(data = snow_pumps, ~long, ~lat, group = "Pumps") m_test The above results in a much nicer map. Is this map informative? What does it tell you about the incidence of cholera and the location of the pumps? 3.7 Some Simple Spatial Analysis Despite the simplicity of this map, we can begin to do some spatial analysis. For instance, we could create a heatmap. You have probably seen heatmaps in many different situations before, as they are a popular visualization tool. Heatmaps are created based on a spatial analytical technique called kernel analysis. We will cover this technique in more detail later on. For the time being, it can be illustrated by taking advantage of the leaflet.extras package, which contains a heatmap function. Load the package as follows: #install.packages("leaflet.extras") # Run only if you have not installed `leaflet.extras` yet! library(leaflet.extras) Next, create a second leaflet object for this example, and call it m2. Notice that we are using the same setView parameters: m2 <- leaflet(data = snow_deaths) %>% setView(lng = -0.136, lat = 51.513, zoom = 16) %>% addTiles() Then, add the heatmap. The function used to do this is addHeatmap. We specify the coordinates and the variable for the intensity (i.e., each case in the dataframe is representative of Count deaths at the address). Two parameters are important here, the blur and the radius. If you are working with the R notebook version of the book, experiment changing these parameters: # The 'addHeatmap' function is making a heat map. We specify the coordinates, same as the block of code above. The 'intensity' function sets a numeric value, the 'blur' specifies the amount of blur to apply and the 'radius' function sets the radius of each point on the heatmap m2 %>% addHeatmap(lng = ~long, lat = ~lat, intensity = ~Count, blur = 40, max = 1, radius = 25) Lastly, you can also add markers for the pumps as follows: m2 %>% addHeatmap(lng = ~long, lat = ~lat, intensity = ~Count, blur = 40, max = 1, radius = 25) %>% addMarkers(data = snow_pumps, ~long, ~lat, group = "Pumps") And everything together: m2_test <- leaflet(data = snow_deaths) %>% setView(lng = -0.136, lat = 51.513, zoom = 16) %>% addTiles() %>% addHeatmap(lng = ~long, lat = ~lat, intensity = ~Count,blur = 40, max = 1, radius = 25) %>% addMarkers(data = snow_deaths, ~long, ~lat, clusterOptions = markerClusterOptions(), group = "Deaths") %>% addMarkers(data = snow_pumps, ~long, ~lat, group = "Pumps") m2_test A heatmap (essentially a kernel density of spatial points; more on this in a later chapter) makes it very clear that most cases of cholera happend in the neighborhood of one (possibly contaminated) water pump! At the time, Snow noted with respect to this geographical pattern that: “It will be observed that the deaths either very much diminished, or ceased altogether, at every point where it becomes decidedly nearer to send to another pump than to the one in Broad street. It may also be noticed that the deaths are most numerous near to the pump where the water could be more readily obtained.” Snow’s analysis helped to convince officials to close the pump, after which the cholera outbreak subsided. This illustrates how even some relatively simple spatial analysis can help to inform public policy and even save lives. You can read more about this case here. In this practice you have learned how to implement some simple mapping and spatial statistical analysis using R. In future readings we will further explore the potential of R for both. 3.8 Other Resources For more information on the functionality of leaflet, please check Leaflet for R References "],
["activity-1-statistical-maps-i.html", "Chapter 4 Activity 1: Statistical Maps I 4.1 Housekeeping Questions 4.2 Learning Objectives 4.3 Preliminaries 4.4 Creating a simple thematic map 4.5 Activity", " Chapter 4 Activity 1: Statistical Maps I Remember, you can download the source file for this activity from here. 4.1 Housekeeping Questions Answer the following questions: What are the office hours of your instructor this term? How are assignments graded? What is the policy for late assignments in this course? 4.2 Learning Objectives In this activity you will: Discuss statistical maps and what makes them interesting. 4.3 Preliminaries In the practice that preceded this activity, you used ggmap to create a proportional symbol map, a mapping technique used in spatial statistics for visualization of geocoded event information. As well, you implemented a simple technique called kernel analysis to the map to explore the distribution of events in the case of the cholera outbreak of Soho in London in 1854. Geocoded events are often called point patterns, so with the cholera data you were working with a point pattern. In this activity, we will map another type of spatial data, called areal data. Areas are often administrative or political jurisdictions. For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(sf) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' 4.4 Creating a simple thematic map If you successfully loaded package geog4ga3 a dataset called HamiltonDAs should be available for analysis: data(HamiltonDAs) Check the class of this object: class(HamiltonDAs) ## [1] "sf" "data.frame" As you can see, this is an object of class sf, which stands for simple features. Objects of this class are used in the R package sf (see here) to implement standards for spatial objects. You can examine the contents of the dataset by means of head (which will show the top rows): head(HamiltonDAs) ## Simple feature collection with 6 features and 7 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 563306.2 ymin: 4777681 xmax: 610844.5 ymax: 4793682 ## epsg (SRID): 26917 ## proj4string: +proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs ## ID GTA06 VAR1 VAR2 VAR3 VAR4 VAR5 ## 1 2671 5030 0.74650172 0.2596975 0.6361925 0.2290084 0.7223464 ## 2 2716 5077 0.78107142 0.4413119 0.5690740 0.8997258 0.4163702 ## 3 2710 5071 0.78824936 0.4632757 0.4197216 0.1619401 0.3052948 ## 4 2745 5108 0.82064933 0.6365193 0.9504535 0.4992477 0.6046399 ## 5 2810 5177 0.09131849 0.4455965 0.3539603 0.4919869 0.6366968 ## 6 2740 5103 0.22257665 0.6288826 0.1341962 0.6635202 0.4429712 ## geometry ## 1 MULTIPOLYGON (((605123.4 47... ## 2 MULTIPOLYGON (((606814 4784... ## 3 MULTIPOLYGON (((605293 4785... ## 4 MULTIPOLYGON (((607542.7 47... ## 5 MULTIPOLYGON (((564681.8 47... ## 6 MULTIPOLYGON (((574373.4 47... Or obtain the summary statistics by means of summary: summary(HamiltonDAs) ## ID GTA06 VAR1 VAR2 VAR3 ## 2299 : 1 4050 : 1 Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 2300 : 1 4051 : 1 1st Qu.:0.3680 1st Qu.:0.3800 1st Qu.:0.3521 ## 2301 : 1 4052 : 1 Median :0.5345 Median :0.4937 Median :0.5699 ## 2302 : 1 4053 : 1 Mean :0.5241 Mean :0.4966 Mean :0.5548 ## 2303 : 1 4054 : 1 3rd Qu.:0.6938 3rd Qu.:0.6091 3rd Qu.:0.7378 ## 2304 : 1 4055 : 1 Max. :1.0000 Max. :1.0000 Max. :1.0000 ## (Other):291 (Other):291 ## VAR4 VAR5 geometry ## Min. :0.0000 Min. :0.0000 MULTIPOLYGON :297 ## 1st Qu.:0.2989 1st Qu.:0.2998 epsg:26917 : 0 ## Median :0.5476 Median :0.4810 +proj=utm ...: 0 ## Mean :0.5325 Mean :0.5001 ## 3rd Qu.:0.7894 3rd Qu.:0.6915 ## Max. :1.0000 Max. :1.0000 ## The above will include a column for the geometry of the spatial features. The dataframe includes all Dissemination Areas (or DAs for short) for the Hamilton Census Metropolitan Arean in Canada. DAs are a type of geography used by the Census of Canada, in fact the smallest geography that is publicly available. To create a siple map we can use ggplot2, which previously we used to map points. Now, the geom for objects of class sf can be used to plot areas. To create such a map, we layer a geom object of type sf on a ggplot2 object. For instance, to plot the DAs: #head(HamiltonDAs) ggplot(HamiltonDAs) + geom_sf(fill = "gray", color = "black", alpha = .3, size = .3) We selected color “black” for the polygons, with a transparency alpha = 0.3 (alpha = 0 is completely transparent, alpha = 1 is completely opaque, try it!), and line size 0.3. This map only shows the DAs, which is nice. However, as you saw in the summary of the dataframe above, in addition to the geometric information, a set of (generic) variables is also included, called VAR1, VAR2,…, VAR5. Thematic maps can be created using these variables. The next chunk of code plots the DAs and adds info. The fill argument is used to select a variable to color the polygons. The function cut_number is used to classify the values of the variable in \\(k\\) groups of equal size, in this case 5 (notice that the lines of the polygons are still black). The scale_fill_brewer function can be used to select different palettes or coloring schemes): ggplot(HamiltonDAs) + geom_sf(aes(fill = cut_number(HamiltonDAs$VAR1, 5)), color = "black", alpha = 1, size = .3) + scale_fill_brewer(palette = "Reds") + coord_sf() + labs(fill = "Variable") Now you have seen how to create a thematic map with polygons (areal data), you are ready for the following activity. 4.5 Activity NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to organize data, create a plot, and so on in support of analysis and interpretation. These tasks are indicated by a star (*). Create thematic maps for variables VAR1 through VAR5 in the dataframe HamiltonDAs. Remember that you can introduce new chunks of code. Imagine that these maps were found, and for some reason the variables were not labeled. They may represent income, or population density, or something else. Which of the five maps you just created is more interesting? Rank the five maps from most to least interesting. Explain the reasons for your ranking. "],
["mapping-in-r-continued.html", "Chapter 5 Mapping in R: Continued 5.1 Learning Objectives 5.2 Suggested Readings 5.3 Preliminaries 5.4 Summarizing a Dataframe 5.5 Factors 5.6 Subsetting Data 5.7 Pipe Operator 5.8 More on Visualization", " Chapter 5 Mapping in R: Continued NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. In the preceding chapters, you were introduced to the following concepts: Basic operations in R. These include arithmetic and logical operations, among others. Data classes in R. Data can be numeric, characters, logical values, etc. Data types in R. Ways to store data, for instance as vector, matrix, dataframes, etc. Indexing. Ways to retrieve information from a data frame by referring to its location therein. Creating simple maps in R. Please review the previous practices if you need a refresher on these concepts. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. 5.1 Learning Objectives In this reading, you will learn: How to quickly summarize the descriptive statistics of a dataframe. More about factors. Factors are a class of data that is used for categorical data. For instance, a parcel may be categorizes as developed or undeveloped; a plot of land may be zoned for commercial, residential, or industrial use; a sample may be mineral x or y. These are not quantities but rather reflect a quality of the entity that is being described. How to subset a dataset. Sometimes you want to work with only a subset of a dataset. This can be done using indexing with logical values, or using specialized functions. More on the use of pipe operators. A pipe operator allows you to pass the results of a function to another function. It makes writing instructions more intuitive and simple. You have already seen pipe operators earlier: they look like this %>%. You will add layers to a ggplot object to improve a map. 5.2 Suggested Readings Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapters 2-3. Springer: New York. Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 3. Sage: Los Angeles. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 1-3. John Wiley & Sons: New Jersey. 5.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Now that your workspace is clear, you can proceed to invoke the sample dataset. You can do this by means of the function data. data("missing_df") The dataframe missing_df includes \\(n = 65\\) observations (Note: text between $ characters is mathematical notation in LaTeX). These observations are geocoded using a false origin and coordinates normalized to the unit-square (the extent of their values is between zero and one). The coordinates are x and y. In addition, there are three variables associated with the locations (VAR1, VAR2, VAR3). The variables are generic. Feel free to think of them as housing prices, concentrations in ppb of some contaminant or any other variable that will help clarify your understading. Finally, a factor variable states whether the variables were measured for a location: if the status is “FALSE”, the values of the variables are missing. 5.4 Summarizing a Dataframe Obtaining a set of descriptive statistics for a dataframe is very simple thanks to the function summary. For instance, the summary of missing_df is: # `summary()` reports basic descriptive statistics of columns in a data frame summary(missing_df) ## x y VAR1 VAR2 ## Min. :0.01699 Min. :0.01004 Min. : 50.0 Min. : 50.0 ## 1st Qu.:0.22899 1st Qu.:0.19650 1st Qu.: 453.3 1st Qu.: 570.1 ## Median :0.41808 Median :0.50822 Median : 459.1 Median : 574.4 ## Mean :0.49295 Mean :0.46645 Mean : 458.8 Mean : 562.1 ## 3rd Qu.:0.78580 3rd Qu.:0.74981 3rd Qu.: 465.4 3rd Qu.: 594.2 ## Max. :0.95719 Max. :0.98715 Max. :1050.0 Max. :1050.0 ## NA's :5 NA's :5 ## VAR3 Observed ## Min. : 50.0 FALSE: 5 ## 1st Qu.: 630.3 TRUE :60 ## Median : 640.0 ## Mean : 638.1 ## 3rd Qu.: 646.0 ## Max. :1050.0 ## NA's :5 This function reports the minimum, maximum, mean, median, and quantile values of a numeric variable. When variables are characters or factors, their frequency is reported. For instance, in missing_df, there are five instances of FALSE and sixty instances of TRUE. 5.5 Factors A factor describes a category. You can examine the class of a variable by means of the function class. From the summary, it is clear that several variables are numeric. However, for Observed, it is not evident if the variable is a character or factor. Use of class reveals that it is indeed a factor: class(missing_df$Observed) ## [1] "factor" Factors are an important data type because they allow us to store information that is not measured as a quantity. For example, the quality of the cut of a diamond is categorized as Fair < Good < Very Good < Premium < Ideal. Sure, we could store this information as numbers from 1 to 5. However, the quality of the cut is not a quantity, and should not be treated like one. In the dataframe missing_df, the variable Observed could have been coded as 1’s (for missing) and 2’s (for observed), but this does not mean that “observed” is twice the amount of “missing”! In this case, the numbers would not be quantities but labels. Factors in R allow us to work directly with the labels. Now, you may be wondering what does it mean when the status of a datum’s Observed variable is coded as FALSE. If you check the summary again, there are five cases of NA in the variables VAR1 through VAR3. NA essentially means that the value is missing. Likely, the five NA values correspond to the five missing observations. We can check this by subsetting the data. 5.6 Subsetting Data We subset data when we wish to work only with parts of a dataset. We can do this by indexing. For example, we could retrieve the part of the dataframe that corresponds to the FALSE values in the Observed variable: missing_df[missing_df$Observed == FALSE,] ## x y VAR1 VAR2 VAR3 Observed ## 61 0.34 0.83 NA NA NA FALSE ## 62 0.29 0.52 NA NA NA FALSE ## 63 0.13 0.32 NA NA NA FALSE ## 64 0.62 0.10 NA NA NA FALSE ## 65 0.88 0.85 NA NA NA FALSE Data are indexed by means of the square brackets [ and ]. The indices correspond to the rows and columns. The logical statement missing_df$Observed == False selects the rows that meet the condition, whereas leaving a blank for the columns simply means “all columns”. As you can see, the five NA values correspond, as anticipated, to the locations where Observed is FALSE. Using indices is only one of many ways of subsetting data. Base R also has a subset command, that is implemented as follows: subset(missing_df, Observed == FALSE) ## x y VAR1 VAR2 VAR3 Observed ## 61 0.34 0.83 NA NA NA FALSE ## 62 0.29 0.52 NA NA NA FALSE ## 63 0.13 0.32 NA NA NA FALSE ## 64 0.62 0.10 NA NA NA FALSE ## 65 0.88 0.85 NA NA NA FALSE And the package dplyr (part of the tidyverse) has a function called filter: filter(missing_df, Observed == FALSE) ## x y VAR1 VAR2 VAR3 Observed ## 1 0.34 0.83 NA NA NA FALSE ## 2 0.29 0.52 NA NA NA FALSE ## 3 0.13 0.32 NA NA NA FALSE ## 4 0.62 0.10 NA NA NA FALSE ## 5 0.88 0.85 NA NA NA FALSE The three approaches give the same result, but subset and filter are somewhat easier to write. You could nest any of the above approaches as part of another function. For instance, if you wanted to do a summary of the selected subset of the data, you would: summary(filter(missing_df, Observed == FALSE)) ## x y VAR1 VAR2 VAR3 ## Min. :0.130 Min. :0.100 Min. : NA Min. : NA Min. : NA ## 1st Qu.:0.290 1st Qu.:0.320 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA ## Median :0.340 Median :0.520 Median : NA Median : NA Median : NA ## Mean :0.452 Mean :0.524 Mean :NaN Mean :NaN Mean :NaN ## 3rd Qu.:0.620 3rd Qu.:0.830 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA ## Max. :0.880 Max. :0.850 Max. : NA Max. : NA Max. : NA ## NA's :5 NA's :5 NA's :5 ## Observed ## FALSE:5 ## TRUE :0 ## ## ## ## ## Or: summary(missing_df[missing_df$Observed == FALSE,]) ## x y VAR1 VAR2 VAR3 ## Min. :0.130 Min. :0.100 Min. : NA Min. : NA Min. : NA ## 1st Qu.:0.290 1st Qu.:0.320 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA ## Median :0.340 Median :0.520 Median : NA Median : NA Median : NA ## Mean :0.452 Mean :0.524 Mean :NaN Mean :NaN Mean :NaN ## 3rd Qu.:0.620 3rd Qu.:0.830 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA ## Max. :0.880 Max. :0.850 Max. : NA Max. : NA Max. : NA ## NA's :5 NA's :5 NA's :5 ## Observed ## FALSE:5 ## TRUE :0 ## ## ## ## ## Nesting functions makes it difficult to read the code, since functions are evaluated from the innermost to the outermost function, whereas we are used to reading from left to right. Fortunately, R implements (as part of package magrittr which is required by tidyverse) a so-called pipe operator that simplifies things and allows for code that is more intuitive to read. 5.7 Pipe Operator A pipe operator is written this way: %>%. Its objective is to pass forward the output of a function to a second function, so that they can be chained to create more complex instructions that are still relatively easy to read. For instance, instead of nesting the subsetting instructions in the summary function, you could do the subsetting first, and pass the results of that to the summary for further processing. This would look like this: # Remember, the pipe operator `%>%` takes pases the value of the left-hand side to the function on the right-hand side subset(missing_df, Observed == FALSE) %>% summary() ## x y VAR1 VAR2 VAR3 ## Min. :0.130 Min. :0.100 Min. : NA Min. : NA Min. : NA ## 1st Qu.:0.290 1st Qu.:0.320 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA ## Median :0.340 Median :0.520 Median : NA Median : NA Median : NA ## Mean :0.452 Mean :0.524 Mean :NaN Mean :NaN Mean :NaN ## 3rd Qu.:0.620 3rd Qu.:0.830 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA ## Max. :0.880 Max. :0.850 Max. : NA Max. : NA Max. : NA ## NA's :5 NA's :5 NA's :5 ## Observed ## FALSE:5 ## TRUE :0 ## ## ## ## ## The code above is read as “subset missing_df and pass the results to summary”. Pipe operators make writting and reading code somewhat more natural. 5.8 More on Visualization Observations in the sample dataset are georeferenced, and so they can be plotted. Since they are based on false origins and are normalized, we cannot map them to the surface of the Earth. However, we can still visualize their spatial distribution. This can be done by using ggplot2. For instance, for missing_df: # `coord_fixed()` forces the plot to use a ratio of 1:1 for the units in the x- and y-axis; in this case, since the values we are mapping to those axes are coordinates, we wish to represent them using the same scale, i.e., one unit in x looks identical to one unit in y (as an experiment, repeat the plot without fixing the coordinates) ggplot() + geom_point(data = missing_df, aes(x = x, y = y), shape = 17, size = 3) + coord_fixed() The above simply plots the coordinates, so that we can see the spatial distribution of the observations. (Notice the use of coord_fixed to maintain the aspect ratio of the plot to 1, i.e. the relationship between width and height). You have control of the shape of the markers, as well as their size. You can consult the shapes available here. Experiment with different shapes and sizes if you wish. The dataframe missing_df includes more attributes that could be used in the plot. For instance, if you wished to create a thematic map showing VAR1 you would do the following: ggplot() + geom_point(data = missing_df, aes(x = x, y = y, color = VAR1), shape = 17, size = 3) + coord_fixed() The shape and size assignments happen outside of aes, and so are applied equally to all observations. In some cases, you might want to let other aesthetic attributes vary with the values of a variable in the dataframe. For instance, if we let the sizes change with the value of the variable: ggplot() + geom_point(data = missing_df, aes(x = x, y = y, color = VAR1, size = VAR1), shape = 17) + coord_fixed() ## Warning: Removed 5 rows containing missing values (geom_point). Note how there is a warning, saying that five observations were removed because data were missing! These are likely the five locations where Observed == FALSE! To make it more clear which observations are these, you could set the shape to vary according to the value of Observed, as follows: ggplot() + geom_point(data = missing_df, aes(x = x, y = y, color = VAR1, shape = Observed), size = 3) + coord_fixed() Now it is easy to see the locations of the five observations that were Observed == FALSE!, which are labeled with grey circles. You can change the coloring scheme by means of scale_color_distiller (you can can check the different color palettes available here): ggplot() + geom_point(data = missing_df, aes(x = x, y = y, color = VAR1, shape = Observed), size = 3) + scale_color_distiller(palette = "RdBu") + coord_fixed() You will notice maybe that with this coloring scheme some observations become very light and difficult to distinguish from the background. This can be solved in many different ways (for instance, by changing the color of the background!). A simple fix is to add a layer with hollow symbols, as follows: ggplot() + geom_point(data = missing_df, aes(x = x, y = y, color = VAR1), shape = 17, size = 3) + geom_point(data = missing_df, aes(x = x, y = y), shape = 2, size = 3) + scale_color_distiller(palette = "RdBu") + coord_fixed() Finally, you could try subsetting the data to have greater control of the appareance of your plot, for instance: ggplot() + geom_point(data = subset(missing_df, Observed == TRUE), aes(x = x, y= y, color = VAR1), shape = 17, size = 3) + geom_point(data = subset(missing_df, Observed == TRUE), aes(x = x, y= y), shape = 2, size = 3) + geom_point(data = subset(missing_df, Observed == FALSE), aes(x = x, y= y), shape = 18, size = 4) + scale_color_distiller(palette = "RdBu") + coord_fixed() These are examples of creating and improving the aspect of simple symbol maps, which are often used to represent observations in space. References "],
["activity-2-statistical-maps-ii.html", "Chapter 6 Activity 2: Statistical Maps II 6.1 Housekeeping Questions 6.2 Learning objectives 6.3 Suggested reading 6.4 Preliminaries 6.5 Activity", " Chapter 6 Activity 2: Statistical Maps II Remember, you can download the source file for this activity from here. 6.1 Housekeeping Questions Answer the following questions: How many examinations are there in this course? What is the date of the first examination? Where is the office of your instructor? 6.2 Learning objectives In this activity you will: Learn about patterns and processes, including random patterns. Understand the general approach to retrieve a process from a pattern. Discuss the importance of discriminating random patterns. 6.3 Suggested reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 1-3. John Wiley & Sons: New Jersey. 6.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Now that your workspace is clear, you can proceed to invoke the datasets required for this activity: data("missing_df") data("PointPattern1") data("PointPattern2") data("PointPattern3") The datasets include the following dataframe which will be used in the first part of the activity: missing_df This dataframe includes \\(n = 65\\) observations (Note: text between $ characters is mathematical notation in LaTeX). These observations are geocoded using a false origin and coordinates normalized to the unit-square (the extent of their values is between zero and one). The coordinates are x and y. In addition, there are three variables associated with the locations (VAR1, VAR2, VAR3). The variables are generic. Feel free to think of them as if they were housing prices or concentrations in ppb of some contaminant. Finally, a factor variable states whether the variables were measured for a location: if the status is “FALSE”, the values of the variables are missing. The following dataframes will be used in the second part of the activity: PointPattern1 PointPattern2 PointPattern3 The dataframes PointPattern* are locations of some generic events. The coordinates x and y are also based on a false origin and are normalized to the unit-square. Feel free to think of these events as cases of flu, the location of trees of a certain species, or the location of fires. 6.5 Activity NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to organize data, create a plot, and so on in support of analysis and interpretation. These tasks are indicated by a star (*). 1.* Create thematic maps for variables VAR1 through VAR3 in the dataframe missing_df. 2.* Plot all three point patterns. Suppose that you were tasked with estimating the value of a variable for the locations where those were not measured. For instance, you could be a realtor, and you need to assess the value of a property, and the only information available is the published values of other properties in the region. As an alternative, you could be an environmental scientist, and you need to estimate what the concentration of a contaminant at a site, based on previous measurements at other sites in the region. Propose one or more ways to guess those missing values, and explain your reasoning. The approach does not need to be the same for all variables! Imagine that you are a public health official and you need to plan services to the public. If you were asked to guess where the next event would emerge, where would be your guess in each map? Explain your answer. "],
["maps-as-processes-null-landscapes-spatial-processes-and-statistical-maps.html", "Chapter 7 Maps as Processes: Null Landscapes, Spatial Processes, and Statistical Maps 7.1 Learning Objectives 7.2 Suggested Readings 7.3 Preliminaries 7.4 Random Numbers 7.5 Null Landscapes 7.6 Stochastic Processes 7.7 Simulating Spatial Processes 7.8 Processes and Patterns", " Chapter 7 Maps as Processes: Null Landscapes, Spatial Processes, and Statistical Maps NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. In last practice your learning objectives were: How to obtain a descriptive summary of a dataframe. Factors and how to use them. How to subset a dataframe. Pipe operators and how to use them. How to improve your maps. Please review the previous practices if you need a refresher on these concepts. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). 7.1 Learning Objectives In this chapter, you will learn: How to generate random numbers with different properties. About Null Landscapes. About stochastic processes. How to create new columns in a dataframe using a formula. How to simulate a spatial process. 7.2 Suggested Readings Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Analysing Spatial Data (pp. 169-171). Springer: New York. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 4. John Wiley & Sons: New Jersey. 7.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) 7.4 Random Numbers Colloquially, we understand random as something that happens in an unpredictable manner. The same word in statistics has a precise meaning, as the outcome of a process that cannot be predicted with any form of certainty. The question whether random processes exist is philosophically interesting. In the early stages of the invention of science, there was much optimism that humans could one day understand every aspect of the universe. This notion is well illustrated by Laplace’s Demon, a hypothetical entity that could predict the state of the universe in the future based on an all-encompassing knowledge of the state of the universe at any past point in time (see here). There are two important limitations to this perspective. First, there is the assumption that the mechanisms of operation of phenomena are well understood (in the case of Laplace’s Demon, it was somewhat naively assumed that classical Newtonian mechanics were sufficient). And secondly, the assumption that all relevant information is available to the observer. There are many processes in reality that are not fully understood, which make Laplace’s Demon an interesting, but unreliable source on predicting the state of the universe. Furthermore, there are often constraints in terms of how much and how acurately information can be collected with respect to any given phenomenon. ##Types of Processes A process can be deterministic. However, When limited knowledge/limited information prevent us from being able to make certain predictions, we assume that the process is random. It is important to note that “random” does not mean that just any outcome is possible. For instance, if you flip a coin, there are only two possible outcomes. If you roll a dice, there are only six possible outcomes. The concentration of a pollutant cannot be negative. The height of a human adult cannot be zero or 10 meters. And so on. It is the result of the possible outcomes that is random, as there is no process controlling the respective outcome. Over time, many formulas have been devised to describe different types of random processes. A random probability distribution function describes the probability of observing different outcomes. For instance, a formula for processes similar to coin flips was discovered by Bernoulli in 1713 (see here). The following function reports a random binomial variable. The number of observations n is how many random numbers we require. The size is the number of trials. For instance, if the experiment was flipping a coin, it would be how many times we get heads in size flips. The probability of success prob is the probability of getting heads in any given toss. Execute the chunk repeatedly to see what happens. #This function simulates the outcome of flipping a coin. Here, we are simulating the result for flipping heads, which has a probability of 0.5. The value of `n` is the number of experiments and `size` is the number of trials in each experiment rbinom(n = 1, size = 1, prob = 0.5) ## [1] 0 It can be noted that although there are only two outcomes, we do not have control over the result of the process, making the result random. If you tried this “experiment” repeatedly, you would find that “heads” (1s) and “tails” (0s) appear each about 50% of the time. A way to implement this is to increase n- think of this as recruiting more people to do coin flips at the same time: n <- 1000 # Number of people tossing the coin one time. coin_flips <- rbinom(n = n, size = 1, prob = 0.5) sum(coin_flips)/n ## [1] 0.505 What happens if you change the size to 0, and why? The binomial function is an example of a discrete probability distribution function, because it can take only one of a discrete (limited) number of values (i.e., 0 and 1). Other random probability distribution functions are for continuous variables, variables that can take any value within a predefined range. The most famous of this distributions is the normal distribution, which you may know also as the bell curve. This probability distribution is attributed to Gauss (see here). The normal distribution is defined by a centering paramater (the mean of the distribution) and a spread parameter (the standard deviation). In the normal distribution, 68% of values are within one standard deviation from the mean, 95% of values are within two standard deviations from the mean, and 99.7% of values are within three standard deviations from the mean. The following function reports a value taken at random from a normal distribution with mean zero and standard deviation sd of one. Execute this chunk repeatedly to see what happens: # This function generates random numbers based on the normal distribution conditional on the given arguments, i.e., the mean and the standard deviation `sd`. rnorm(1, mean = 0, sd = 1) ## [1] -0.6578215 Let’s say that the average height of men in Canada is 170.7 cm and the standard deviation is 7 cm. The heigh of a random person in this population would be: rnorm(1, mean = 170.7, sd = 7) ## [1] 180.092 And the distribution of heights of n men in this population would be: #Creating a data frame using the random numbers generated from n=1000 people. The results in the data frame are then plotted using ggplot. The end result is a distribution of heights of 1000 men. You are able to see which heights are most common out of the sample. n <- 1000 height <- rnorm(n, mean = 170.7, sd = 7) height <- data.frame(height) # `geom_histogram()` is a geometric object in `ggplot2` that represents the frequency of values in a vector as a bar chart ggplot(data = height, aes(x = height)) + geom_histogram() ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Men shorter than 150 cm would be extremely rare, as well as men taller than 190 cm. 7.5 Null Landscapes So what do random variables have to do with maps? Random variables can be used to generate purely random maps. These are called null landscapes or neutral landscapes in spatial ecology (With and King 1997) (Paper is available to download). The concept of null landscapes is quite useful. They provide a benchmark to compare the results of statistical maps. Let’s see how to generate a null landscape of events. Suppose that there is a landscape with coordinates in the unit square, that is divided in very small discrete units of land. Each of these units of land can be the location of an event. For example, a tree might be present; or a case of a disease. Let’s first create a landscape. For this, we will use the expand.grid function to find all combinations of two sets of coordinates in the unit interval, using small partitions: # expand.grid created a set of coordinates by obtaining all the combinations of the input variables. Here, our landscape ranges in the x-axis from 0 to 1, increasing by 0.05, and the y-axis also from 0 to 1, increasing by 0.05 coords <- expand.grid(x = seq(from = 0, to = 1, by = 0.05), y = seq(from = 0, to = 1, by = 0.05)) Now, let’s generate a binomial random variable to go with these coordinates. # `nrow()` returns the number of rows that are present in a data frame. Here, it returns the number of rows in the data frame `coords` events <- rbinom(n = nrow(coords), size = 1, prob = 0.5) We will collect the coordinates and the random variable in a dataframe for plotting: # `data.frame()` collects the inputs in a data frame; they must have the same number of rows null_pattern <- data.frame(coords, events) We can plot the null landscape we just generated as follows: ggplot() + geom_point(data = filter(null_pattern, events == 1), aes(x = x, y = y), shape = 15) + coord_fixed() By changing the probability prob in the function rbinom you can make the event more or less likely, i.e., frequent. If you are working with the notebook version of this document you can try changing the parameters to see what happens. A continuous random variable can also be used to generate a null landscape. For instance, imagine that a group of individuals are asked to stand in formation, and that they arrange themselves purely at random. What would a map of their heights look like? First, we will generate a random variable using the same parameters we mentioned above for the height of men in Canada: #heights will be random numbers genreated based on the average height of men, 7 standard deviations, and the null landscape "coords" created previously. heights <- rnorm(n = nrow(coords), mean = 170.7, sd = 7) The random values that were generated can be collected in a dataframe with the coordinates for the purpose of plotting: null_trend <- data.frame(coords, heights) One possible map of heights when the individuals stand in formation at random would look like this: # Our plot is created based on the dataframe of coords and heights. The value of `x` is plotted to the x-axis, the value of `y` is plotted to the y-axis, and the color of the points depends on the values of `heights`. We can change the _scale_ of colors by means of `scale_color_distiller()`. There, palette `spectral` associates higher values of `heights` as red (taller men), while lower values of `heights` (i.e., shorter men) are appear in blue. More generally, we can control the scale of aesthetic aspects of the plot by means of scale_*something* (scale_shape, scale_size, etc.) ggplot() + geom_point(data = null_trend, aes(x = x, y = y, color = heights), shape = 15) + scale_color_distiller(palette = "Spectral") + coord_fixed() These two examples illustrate only two of many possible techniques to generate null landscapes. We will discuss other strategies to work with null landscapes later in the course. 7.6 Stochastic Processes Some processes are random, such as the ones used above to create null landscapes. These processes take values with some probability, but cannot be predicted with any certainty. Let’s illustrate using again a unit square: # Remember that `expand.grid()` will find all combinations of values in the inputs coords <- expand.grid(x = seq(from = 0, to = 1, by = 0.05), y = seq(from = 0, to = 1, by = 0.05)) Here is an example of a random pattern of events: # Create a random variable and join to the coordinates to generate a null landscape events <- rbinom(n = nrow(coords), size = 1, prob = 0.5) null_pattern <- data.frame(coords, events) # Plot the null landscape you just created ggplot() + geom_point(data = subset(null_pattern, events == 1), aes(x = x, y = y), shape = 15) + coord_fixed() A systematic or deterministic process is one that contains no elements of randomness, and can therefore be predicted with complete certainty. For instance (note the use of xlim to set the extent of x axis in the plot): # Copy the coordinates to a new object deterministic_point_pattern <- coords # `mutate()` adds new variables to a data frame while preserving esixting variables. Here, we create a new column in our data frame, called `events` that will take the value of `x` (the position of an observation along the x-axis) and will `round()` it, i.e., if it is less than 0.5 it will round it to zero, and if it is equal to or greater than 0.5 it will round to 1 deterministic_point_pattern <- mutate(deterministic_point_pattern, events = round(x)) # Plot the new landscape: `filter()` keeps the rows in a dataframe that meet a condition (for example, that the value of `events` is 1), and discards the rest ggplot() + geom_point(data = filter(deterministic_point_pattern, events == 1), aes(x = x, y = y), shape = 15) + xlim(0, 1) + coord_fixed() In the process above, we used the function round() and the coordinate x. The function gives a value of one for all points with x > 0.5, and a value of zero to all points with x <= 0.5. The pattern is fully deterministic: if I know the value of the x coordinate I can predict whether an event will be present. A stochastic process, on the other hand, is a process that is neither fully random or deterministic, but rather a combination of the two. Let’s illustrate: # Copy the coordinates to a new object stochastic_point_pattern <- coords # Here, we combine the function `round()`, which does an deterministic operation, and `rbinom()` to generate a random number stochastic_point_pattern <- mutate(stochastic_point_pattern, events = round(x) - round(x) * rbinom(n = nrow(coords), size = 1, prob = 0.5)) # Plot the new landscape ggplot() + geom_point(data = subset(stochastic_point_pattern, events == 1), aes(x = x, y = y), shape = 15) + xlim(0, 1) + coord_fixed() The process above has a deterministic component (the probability of an event is zero if x <= 0.5), and a random component (the probability of a coordinate being an event is 0.5 when x > 0.5). The landscape is not fully random, but also it is not fully deterministic. Instead, it is the result of a stochastic process, a process that combines deterministic and random elements. 7.7 Simulating Spatial Processes Null landscapes are interesting as a benchmark. More interesting are landscapes that emerge as the outcome of a non-random process - either a systematic/deterministic or stochastic process. Here we will see more ways to introduce a systematic element into a null landscape to simulate spatial processes. Let’s begin with the point pattern, using the same landscape that we used above. We will first copy the coordinates of the landscape to a new dataframe, that we will call pattern1: # Copy the coordinates to a new object, called `pattern1` pattern1 <- coords Next, we will use the function mutate from the dplyr package that is part of the tidyverse. This function adds a column to a data frame that could be calculated using a formula. For instance, we will now make the probability prob of the random binomial number generator a function of the coordinates: # Remember, mutate adds a new column to a data frame. In this example, mutate creates a new column, `events` using random binomial values; however, notice that the `prob` is not 0.5! Instead, it depends on `x` the position of the event on the x-axis pattern1 <- mutate(pattern1, events = rbinom(n = nrow(pattern1), size = 1, prob = (x))) Plot this pattern: ggplot() + geom_point(data = subset(pattern1, events == 1), aes(x = x, y = y), shape = 15) + coord_fixed() Since the probability of a “success” in the binomial experiment is proportional to the value of x (the coordinate of the event), now the events are clustered to the right of the plot. The underlying process in this case can be described in simple terms as “the probability of an event increases in the east direction”. In a real process, this could be possibly as a result of wind conditions, soil fertility, or other environmental factors that follow a trend. Let’s see what happens when we make this probability a function of the y coordinate: # Overwrite the `events`, now the probability of success in the random binomial number generator is a function of `y`, the position of the event on the y-axis pattern1 <- mutate(pattern1, events = rbinom(n = nrow(pattern1), size = 1, prob = (y))) # Plot the new events ggplot() + geom_point(data = subset(pattern1, events == 1), aes(x = x, y = y), shape = 15) + coord_fixed() Since the probability of a “success” in the binomial experiment is proportional to the value of y (the coordinate of the event), now the events are clustered to the top. The probability could be the interaction of the two coordinates: # Now the probability is the product of `x` and `y` pattern1 <- mutate(pattern1, events = rbinom(n = nrow(pattern1), size = 1, prob = (x * y))) # Plot ggplot() + geom_point(data = subset(pattern1, events == 1), aes(x = x, y = y), shape = 15) + coord_fixed() Which of course means that the events cluster on the top-right corner. A somewhat more sophisticated example could make the probability a function of distance from the center of the region: # Copy the coordinates to the object `pattern1` pattern1 <- coords # In this case, `mutate()` creates a new variable, `distance`, which is the straight line distance from the center of the region (at coordinates x = 0.5 and y = 0.5). Now the probability of success in the random binomial number generator depends on this `distance` pattern1 <- mutate(pattern1, distance = sqrt((0.5 - x)^2 + (0.5 - y)^2), events = rbinom(n = nrow(pattern1), size = 1, prob = 1 - exp(-0.5 * distance))) Don’t worry too much about the formula that I selected to generate this process; we will see different tools to describe a spatial process. In this particular example, I selected a function that makes the probability increase with distance from the center of the region. Plot this pattern: ggplot() + geom_point(data = subset(pattern1, events == 1), aes(x = x, y = y), shape = 15) + coord_fixed() As you would expect, there are few events near the center, and the number of events tends to increase away from the center. To conclude this practice, let’s revisit the example of the people standing in formation. Now, taller people are asked to stand towards the back of the formation (assuming that the back is in the positive direction of the y-axis). As a result of this instruction, now the sorting is not random, since taller people tend to stand towards the back. However, people are not able to assess the height of each other exactly, so there will be some random variation in the distribution of heights. We can simulate this by making the height a function of position. First, we copy the coordinates to a new dataframe for our trend experiment: trend1 <- coords Again we use mutate to add a column to a data frame that could be calculated using a formula. For instance, we will now make the probability prob of the random binomial number generator a function of the coordinates: trend1 <- mutate(trend1, heights = 160 + 20 * y + rnorm(n = nrow(pattern1), mean = 0, sd = 7)) If people have a preference for standing next to people about their same height, and shorter people have a preference for standing near the front, this is a possible map of heights in the formation: ggplot() + geom_point(data = trend1, aes(x = x, y = y, color = heights), shape = 15) + scale_color_distiller(palette = "Spectral") + coord_fixed() As expected, shorter people are towards the “front” (bottom of the plot) and taller people towards the back. It is not a uniform process, since there is still some randomness, but a trend can be clearly appreciated. 7.8 Processes and Patterns O’Sullivan and Unwin (2010) make an important distinction between processes and patterns. A process is like a recipe, a sequence of events or steps, that leads to an outcome, that is, a pattern. You can think of the simulation procedures above as having two components: the process is the formula, function, or algorithm used to simulate a pattern. For instance, a random process could be based on the binomial distribution, whereas a stochastic process would have in addition to a random component some deterministic elements. The pattern is the outcome of the process. In the case of spatial processes, the outcome is typically a statistical map. The procedures in the preceding sections illustrate just a few different ways to simulate spatial processes with the aim of generating statistical maps that display spatial patterns. There are in fact many more ways to simulate spatial processes, and articles (e.g., Geyer and Møller 1994) - and even books (e.g., Moller and Waagepetersen 2003) - have been written on this topic! Simulation is a very valuable tool in spatial statistics, as we shall see in later chapters. It is important to note, however, that in the vast majority of cases we do not actually know the process; that is precisely what we wish to infer. Understanding process generation in a statistical sense, as well as null landscapes, is a useful tool that can help us to infer processes in applications with empirical (as opposed to simulated) data. In this sense, spatial statistics is often a tool used to make decisions about spatial patterns: are they random? And, if they are not random, can we infer the underlying process? References "],
["activity-3-maps-as-processes.html", "Chapter 8 Activity 3: Maps as Processes 8.1 Practice Questions 8.2 Learning Objectives 8.3 Suggested Reading 8.4 Preliminaries 8.5 Activity", " Chapter 8 Activity 3: Maps as Processes Remember, you can download the source file for this activity from here. 8.1 Practice Questions Answer the following questions: What is a Geographic Information System? What distinguishes a statistical map from other types of mapping techniques? What is a null landscape? 8.2 Learning Objectives In this activity, you will: Simulate landscapes using various types of processes. Discuss the difference between random and non-random landscapes. Think about ways to decide whether a landscape is random. 8.3 Suggested Reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 4. John Wiley & Sons: New Jersey. 8.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) In the practice that preceded this activity, you learned how to simulate null landscapes and spatial processes. 8.5 Activity NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to organize data, create a plot, and so on in support of analysis and interpretation. These tasks are indicated by a star (*). 1.* Simulate and plot a landscape using a random, stochastic, or deterministic process. It is your choice whether to simulate a point pattern or a continuous variable. Identify the key parameters that make a landscape more or less random. Repeat several times changing those parameters. Recreate any one of the maps you created and share the map with a fellow student. Ask them to guess whether the map is random or non-random. Repeat step 2 several times (depending on time, between two and four times). Propose one or more ways to decide whether a landscape is random, and explain your reasoning. The approach does not need to be the same for point patterns and continuous variables! "],
["point-pattern-analysis-i.html", "Chapter 9 Point Pattern Analysis I 9.1 Learning Objectives 9.2 Suggested Readings 9.3 Preliminaries 9.4 Point Patterns 9.5 Processes and Point Patterns 9.6 Intensity and Density 9.7 Quadrats and Density Maps 9.8 Defining the Region for Analysis", " Chapter 9 Point Pattern Analysis I NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. In last practice your learning objectives were: How to generate random numbers with different properties. About Null Landscapes. How to create new columns in a dataframe using a formula. How to simulate a spatial process. Please review the previous practices if you need a refresher on these concepts. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. 9.1 Learning Objectives In this practice, you will learn: A formal definition of point pattern. Processes and point patterns. The concepts of intensity and density. The concept of quadrats and how to create density maps. More ways to control the look of your plots, in particular faceting and adding lines. 9.2 Suggested Readings Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 3. Longman: Essex. Baddeley A, Rubak E, Turner R (2016) Spatial Point Pattern: Methodology and Applications with R, Chapter 1, 1.1 - 1.2. CRC: Boca Raton. Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 7. Springer: New York. Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, 6.1 - 6.6. Sage: Los Angeles. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey. 9.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(spatstat) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Load the data that you will use for this practice: data("PointPatterns") Quickly check the contents of this dataframe: summary(PointPatterns) ## x y Pattern ## Min. :0.0169 Min. :0.005306 Pattern 1:60 ## 1st Qu.:0.2731 1st Qu.:0.289020 Pattern 2:60 ## Median :0.4854 Median :0.550000 Pattern 3:60 ## Mean :0.5074 Mean :0.538733 Pattern 4:60 ## 3rd Qu.:0.7616 3rd Qu.:0.797850 ## Max. :0.9990 Max. :0.999808 The dataframe contains the x and y coordinates of four different patterns of points, each with \\(n=60\\) events. 9.4 Point Patterns Previously you created different types of maps and learned about different kinds of processes (i.e., random, stochastic, deterministic). A map that you have seen in several occasions is one where the coordinates of an event of interest are available. The simplest kind of data of this type is called a point pattern. This occurs when only the coordinates are available. A point pattern is given by a set of events of interest that are observed in a region \\(R\\). A region has an infinite number of points, essentially coordinates \\((x_i, y_i)\\) on the plane. The number of points is infinite, because there is a point defined by, say, coordinates (1,1), and also a point for coordinates (1.1,1), and for coordinates for (1.01,1), and so on. Any location that can be described by a set of coordinates contained in the region is a point. Not all points are events, however. An event is defined as a point where something of interest happened. This could be the location where a tree exists, or a crime happened, the epicenter of an earthquake, a case of a disease was reported, and so on. There might be one such occurrence, or more. Each event is be denoted by: \\[ \\textbf{s}_i \\] with coordinates: \\[ (x_i,y_i). \\] Sometimes other attributes of the events have been measured as well. For example, the event could be an address where cholera was reported (as in John Snow’s famous map). In addition to the address (which can be converted into the coordinates of the event), the number of cases could be recorded. Other examples could be the height and diameter of trees, the magnitude of an earthquake, etc. It is important, for reasons that will be discussed later, that the point pattern is a complete enumeration. What this means is that every event that happened has been recorded! Interpretation of most analyses becomes dubious if the events are only sampled, that is, if only a few of them have been observed and recorded. 9.5 Processes and Point Patterns Point patterns are interesting in many applications. In these applications, a key question of interest is whether the pattern is random. Imagine a point pattern that records crimes in a region. The pattern might be random, in which case there is no way to anticipate where the next occurrence of criminal activity will be. Non-random patterns, on the other hand, are likely the outcome of some meaningful process. For instance, crimes might cluster as a consequence of some common environmental variable (e.g., concentration of wealth). On the contrary, they might repeal each other (e.g., the location of a crime draws attention of law enforcement, and therefore the next occurrence of a crime tends to happen away from it). Deciding whether the pattern is random or not is the initial step towards developing hypotheses about the underlying process. Consider for example the following patterns. To create the following figure, you can use faceting by means of ggplot2::facet_wrap(): # This uses function "ggplot" to plot data "PointPatterns" loaded into the data frame earlier, by means of X and Y coordinates # the function `facet_wrap()` is used to create multiple plots according to one (or more) variables in the dataset. Here, it is used to create individual plots for each of the four patterns in the dataframe, but put them all in a single figure ggplot() + geom_point(data = PointPatterns, aes(x = x, y = y)) + facet_wrap(~ Pattern) + coord_fixed() As you can see, faceting is a convenient way to simultaneously plot different parts of a dataframe (in the present case, the different Patterns). In the preceding activity, you were asked to generate ideas regarding possible ways of deciding whether a map of events (i.e., a point pattern) is random. In this chaper we will formalize a specific way to do so, by considering the intensity of the process. 9.6 Intensity and Density The intensity of a spatial point process is the expected number of events per unit area. This is conventionally denoted by the greek letter \\(\\lambda\\). In most cases the process is not know, so its intensity cannot be directly measured. In its place, the density of the point pattern is taken as the empirical estimate of the intensity of the underlying process. The density of the point pattern is calculated very simply as the number of events divided by the area of the region, that is: \\[ \\hat{\\lambda} = \\frac{(S \\in R)}{a} = \\frac{n}{a}. \\] Notice the use of the “hat” symbol on top of the Greek lambda. This symbol is called “caret”. The hat notation is used to indicate an estimated value of an unobserved parameter of a process as opposed to the true (but usually unknown) value. In the present case this is the intensity of the spatial point process. Consider one of the point patterns in your sample dataset, say “Pattern 1”. If we filter for “Pattern 1” we can then summarize it: filter(PointPatterns, Pattern == "Pattern 1") %>% summary() ## x y Pattern ## Min. :0.0285 Min. :0.005306 Pattern 1:60 ## 1st Qu.:0.3344 1st Qu.:0.236509 Pattern 2: 0 ## Median :0.5247 Median :0.500262 Pattern 3: 0 ## Mean :0.5531 Mean :0.500248 Pattern 4: 0 ## 3rd Qu.:0.8417 3rd Qu.:0.761218 ## Max. :0.9888 Max. :0.999808 We see that there are \\(n = 60\\) points in this dataset. Since the region is the unit square (check how the values of the coordinates range from approximately zero to approximately 1), the area of the region is 1. This means that for “Pattern 1”: \\[ \\hat{\\lambda} = \\frac{60}{1} = 60 \\] This is the overall density of the point pattern. 9.7 Quadrats and Density Maps The overall density of a point process (calculated above) can be mapped by means of the geom_bin2d function of the ggplot2 package. This function divides two dimensional space into bins and reports the number of events or the density of the events in the bins. Let’s give this a try: # `geom_bin2d()` creates a tessellation and counts the number of events in each of the "tiles" in the tesselation. It then assigns colors based on the count of events. The `binwidth` determines the size of the squaes in the tessellation, in this case squares of size 1 by 1...which corresponds to the size of the region! ggplot() + geom_bin2d(data = subset(PointPatterns, Pattern == "Pattern 1"), aes(x = x, y = y), binwidth = c(1, 1)) + coord_fixed() Let’s see step-by-step how this plot is made. ggplot() creates a plot object. geom_bin2d is called to plot a map of counts of events in the space defined by the bins. The dataframe used for plotting the bins is PointPatterns, subset so that only the points in “Pattern 1” are used. The coordinates x and y are used to plot (in aes(), we indicate that x in the dataframe corresponds to the x axis in the plot, and y in the dataframe corresponds to y axis in the plot) The size of the bin is defined as 1-by-1 (binwidth = c(1, 1)) coord_fixed is applied to ensure that the aspect ratio of the plot is one (one unit of x is the same length as one unit of y in the plot). The map of the overall density of the process above is not terribly interesting. It only reports what we already knew, that globally the density of the point pattern is 60. It would be more interesting to see how the density varies across the region. We do this by means of the concept of quadrats. Imagine that instead of calculating the overall (or global) intensity of the point pattern, we subdivided the region into a set of smaller subregions. For instance, we could draw horizontal and vertical lines to create smaller squares: # `geom_vline()` draws vertical lines that cross the x-axis at the points indicated; `geom_hline()` draws horizontal lines that cross the y-axis at the points indicated ggplot() + geom_vline(xintercept = seq(from = 0, to = 1, by = 0.25)) + geom_hline(yintercept = seq(from = 0, to = 1, by = 0.25)) + geom_point(data = filter(PointPatterns, Pattern == "Pattern 1"), aes(x = x, y = y)) + coord_fixed() Notice how we used to create the vertical lines (geom_vline) and horizontal lines (geom_hline), from 0 to 1 every 0.25 units of distance respectively. This creates a tessellation that divides the original region into 16 smaller squares, or subregions. Each of the smaller squares used to subdivide the region is called a quadrat. To make things more interesting, instead of calculating the overall density, we can calculate the density for each quadrat. Now the size of the quadrats will be \\(0.25\\times 0.25\\). Here we visualize the density of the quadrats: ggplot() + geom_bin2d(data = filter(PointPatterns, Pattern == "Pattern 1"), aes(x = x, y = y), binwidth = c(0.25, 0.25)) + geom_point(data = filter(PointPatterns, Pattern == "Pattern 1"), aes(x = x, y = y)) + scale_fill_distiller(palette = "RdBu") + coord_fixed() You can, of course, change the size of the quadrats. Let’s take a look at the four point patterns (by means of faceting), after creating a variable to easily control the size of the quadrat. Let’s call this variable q_size: # `q_size` controls the size of the quadrats; experiment changing this parameter q_size <- 0.5 ggplot() + geom_bin2d(data = PointPatterns, aes(x = x, y = y), binwidth = c(q_size, q_size)) + geom_point(data = PointPatterns, aes(x = x, y = y)) + facet_wrap(~ Pattern) + scale_fill_distiller(palette = "RdBu") + coord_fixed() Notice the differences in the density maps? Try changing the size of the quadrat to 1. What happens, and why? Next, try a smaller quadrat size, say 0.25. What happens, and why? Try even smaller quadrat sizes, but greater than zero. What happens now? The package spatstat (Baddeley, Rubak, and Turner 2016) includes numerous functions for the analysis of point patterns. A relevant function for us at this stage, is quadratcount(), which returns the number of events per quadrat. To use this function, we need to convert the point patterns to a type of object used by spatstat denominated ppp (for plannar point pattern). This is simple, thanks to a utility function in spatstat called as.ppp. This function takes as arguments (inputs) a set of coordinates, and data to define a window. To benefit from the functionality of spatstat we will convert our data frame with spatial patterns into ppp objects. First, define the window by means of the owin function, and using the 0 to 1 interval for our region: # `owin()` creates a window for `ppp` objects, which becomes the _region_ under study. Here, we define a window that is the unit square and we will discuss the importance of an appropriate definition of the region later. The windows in `spatstat` need not be squares or rectangles, and can actually be irregular shapes Wnd <- owin(c(0,1), c(0,1)) Now, a ppp object can be created: # `as.ppp()` will take an object and convert it to a `ppp` object. Here, it does a fairly good job of guessing the contents of the data frame! The second argument to create the `ppp` object is a window, that is, an `owin` object ppp1 <- as.ppp(PointPatterns, Wnd) If you examine these new ppp objects, you will see that they pack the same basic information (i.e., the coordinates), but also the range of the region and so on: summary(ppp1) ## Marked planar point pattern: 240 points ## Average intensity 240 points per square unit ## ## Coordinates are given to 8 decimal places ## ## Multitype: ## frequency proportion intensity ## Pattern 1 60 0.25 60 ## Pattern 2 60 0.25 60 ## Pattern 3 60 0.25 60 ## Pattern 4 60 0.25 60 ## ## Window: rectangle = [0, 1] x [0, 1] units ## Window area = 1 square unit As you can see, the ppp object includes the four patterns, calculates the frequency of each (the number of events), and their respective overall intensities. Objects of the class ppp can be plotted using base R plotting functions: plot(ppp1) To plot each pattern separately we can split the different patterns using the function split.ppp(). Notice how $ works for indexing the patterns here, just as it does for indexing columns in a data frame: plot(split.ppp(ppp1)$`Pattern 1`) Once the patterns are in ppp form, quadratcount can be used to compute the counts of events. To calculate the count separately for each pattern, you need to use again split.ppp() (if you don’t index a pattern, it will apply the function to all of them). The other two arguments are the number of quadrats in the horizontal (nx) and the vertical (ny) directions: quadratcount(split(ppp1), nx = 4, ny = 4) ## List of spatial objects ## ## Pattern 1: ## x ## y [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1] ## [0.75,1] 3 5 1 6 ## [0.5,0.75) 2 3 4 6 ## [0.25,0.5) 5 4 2 3 ## [0,0.25) 2 4 4 6 ## ## Pattern 2: ## x ## y [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1] ## [0.75,1] 14 2 2 6 ## [0.5,0.75) 0 0 4 6 ## [0.25,0.5) 6 3 1 2 ## [0,0.25) 4 6 2 2 ## ## Pattern 3: ## x ## y [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1] ## [0.75,1] 2 11 5 7 ## [0.5,0.75) 1 1 6 4 ## [0.25,0.5) 1 10 3 2 ## [0,0.25) 2 1 2 2 ## ## Pattern 4: ## x ## y [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1] ## [0.75,1] 4 5 6 3 ## [0.5,0.75) 3 3 4 2 ## [0.25,0.5) 3 3 4 2 ## [0,0.25) 5 4 6 3 Compare the counts of the quadrats for each pattern. They should replicate what you observed in the density plots before. 9.8 Defining the Region for Analysis It is important when conducting the type of analysis described above (and more generally any analysis with point patterns), to define a region for analysis that is consistent with the pattern of interest. Consider for instance what would happen if the region was defined, instead of in the unit square, as a bigger region. Create a second window: # This new window measure 3 units in the x-axis, and also 3 units in the y-axis (from -1 to 2) Wnd2 <- owin(c(-1,2), c(-1,2)) Create a second ppp object using this new window: # Here, we use the same events as before, but place them in the larger window we just created ppp2 <- as.ppp(PointPatterns, Wnd2) Repeat the plot but using the new ppp object: plot(split.ppp(ppp2)$`Pattern 1`) Repeat but now using an even bigger region. Create a third window: Wnd3 <- owin(c(-2, 3), c(-2, 3)) And also a third ppp object using the third window: ppp3 <- as.ppp(PointPatterns, Wnd3) Now the plot looks like this: plot(split.ppp(ppp3)$`Pattern 1`) Which of the three regions that you saw above is more appropriate? What do you think is the effect of selecting an inappropriate region for the analysis? This concludes this chapter. The next activity will illustrate how quadrats are a useful tool to explore the question whether a map is random. References "],
["activity-4-point-pattern-analysis-i.html", "Chapter 10 Activity 4: Point Pattern Analysis I 10.1 Practice questions 10.2 Learning objectives 10.3 Suggested reading 10.4 Preliminaries 10.5 Activity", " Chapter 10 Activity 4: Point Pattern Analysis I Remember, you can download the source file for this activity from here. 10.1 Practice questions Answer the following questions: What is a random process? What is a deterministic process? What is a stochastic process? What is a pattern? What is the usefulness of a null landscape? 10.2 Learning objectives In this activity, you will: Use the concept of quadrats to analyze a real dataset. Learn about a quadrat-based test for randomness in point patterns. Learn how to use the p-value of a statistical test to make a decision. Think about the distribution of events in a null landscape. Think about ways to decide whether a landscape is random. 10.3 Suggested reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey. 10.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity. In addition to tidyverse, you will need spatstat, a package designed for the analysis of point patterns (you can learn about spatstat here and here): library(tidyverse) library(spatstat) library(maptools) # Needed to convert a `Spatial Polygons` object into an `owin` object library(sf) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' In the practice that preceded this activity, you learned about the concepts of intensity and density, about quadrats, and also how to create density maps. Begin by loading the data that you will use in this activity: data("Fast_Food") data("Gas_Stands") data("Paez_Mart") Next the geospatial files need to be read. For this example, the city boundary of Toronto is provided in two different formats, as a dataframe (which can be used to plot using ggplot2) and as a SpatialPolygons object, a format widely used in R for spatial analysis. The : data("Toronto") If you inspect your workspace, you will see that the following dataframes are there: Fast_Food Gas_Stands Paez_Mart These are locations of a selection of fast food restaurants, and also of gas stands in Toronto (data are from 2008). Paez Mart on the other hand is a project to cover Toronto with convenience stores. The points are the planned locations of the stores. Also, there should be an object of class sf. This dataframe contains the city boundary of Toronto: class(Toronto) ## [1] "sf" "data.frame" Try plotting the following: ggplot() + geom_sf(data = Toronto, color = "black", fill = NA, alpha = 1, size = .3) + geom_sf(data = Paez_Mart) + coord_sf() As discussed in the preceding chapter, the package spatstat offers a very rich collection of tools to do point pattern analysis. To convert the three sets of events (i.e., the fast food establishments, gas stands, and Paez Mart) into ppp objects we first must define a region or window. To do this we take the sf and convert to an owin (a window object) for use with the package spatstat (this is done via SpatialPolygons, hence as(x, \"Spatial\"): # `as.owin()` will take a "foreign" object (foreign to `spatstat`) and convert it into an `owin` object. Here, there are two steps involved: first, we take the `sf` object with the boundaries of Toronto and convert it into a "Spatial" object, and then the "Spatial" object is passed on to `as.owin()` Toronto.owin <- as(Toronto, "Spatial") %>% as.owin() # Requires `maptools` package And, then convert the dataframes to ppp objects (this necessitates that we extract the coordinates of the events by means of st_coordinates): Fast_Food.ppp <- as.ppp(st_coordinates(Fast_Food), W = Toronto.owin) Gas_Stands.ppp <- as.ppp(st_coordinates(Gas_Stands), W = Toronto.owin) Paez_Mart.ppp <- as.ppp(st_coordinates(Paez_Mart), W = Toronto.owin) These objects can now be used with the functions of the spatstat package. For instance, you can calculate the counts of events by quadrat by means of quadrat.count. The input must be a ppp object, and the number of quadrats on the horizontal (nx) and vertical (ny) direction (notice how I use the function table to present the frequency of quadrats with number of events): q_count <- quadratcount(Fast_Food.ppp, nx = 3, ny = 3) table(q_count) ## q_count ## 0 6 44 48 60 64 85 144 163 ## 1 1 1 1 1 1 1 1 1 As you see from the table, there is one quadrat with zero events, one quadrat with six events, one quadrat with forty-four events, and so on. You can also plot the results of the quadratcount() function! plot(q_count) A useful function in the spatstat package is quadrat.test. This function implements a statistical test that compares the empirical distribution of events by quadrats to the distribution of events as expected under the hypothesis that the underlying process is random. This is implemented as follows: q_test <- quadrat.test(Fast_Food.ppp, nx = 3, ny = 3) ## Warning: Some expected counts are small; chi^2 approximation may be inaccurate q_test ## ## Chi-squared test of CSR using quadrat counts ## ## data: Fast_Food.ppp ## X2 = 213.74, df = 8, p-value < 2.2e-16 ## alternative hypothesis: two.sided ## ## Quadrats: 9 tiles (irregular windows) The quadrat test reports a \\(p\\)-value which can be used to make a decision. The \\(p\\)-value is the probability that you will be mistaken if you reject the null hypothesis. To make a decision, you need to know what is the null hypothesis, and your own tolerance for making a mistake. In the case above, the \\(p\\)-value is very, very small (2.2e-16 = 0.00000000000000022). Since the null hypothesis is spatial randomness, you can reject this hypothesis and the probability that this decision is mistaken is vanishingly small. Try plotting the results of quadrat.test: plot(q_test) Now that you have seen how to do some analysis using quadrats, you are ready for the next activity. 10.5 Activity NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to organize data, create a plot, and so on in support of analysis and interpretation. These tasks are indicated by a star (*). 1.* Use Fast_Food, Gas_Stands, Paez_Mart, and Toronto to create density maps for the three point patterns. Select a quadrat size that you think is appropriate. 2.* Use Fast_Food.ppp, Gas_Stands, and Paez_Mart, and the function quadratcount to calculate the number of events per quadrat. Remember that you need to select the number of quadrats in the horizontal and vertical directions! 3.* Use the function table() to examine the frequency of events per quadrat for each of the point patterns. Show your density maps to a fellow student. Did they select the same quadrat size? If not, what was their rationale for their size? Again, use the function table() to examine the frequency of events per quadrat for each of the point patterns. What are the differences among these point patterns? What would you expect the frequency of events per quadrat to be in a null landscape? Use Fast_Food.ppp, Gas_Stands, and Paez_Mart, and the function quadrat.test to calculate the test of spatial independence for these point patterns. What is your decision in each case? Explain. "],
["point-pattern-analysis-ii.html", "Chapter 11 Point Pattern Analysis II 11.1 Learning Objectives 11.2 Suggested Readings 11.3 Preliminaries 11.4 A Quadrat-based Test for Spatial Independence 11.5 Limitations of Quadrat Analysis: Size and Number of Quadrats 11.6 Limitations of Quadrat Analysis: Relative Position of Events 11.7 Kernel Density", " Chapter 11 Point Pattern Analysis II NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. In the last practice/session your learning objectives included: A formal definition of point pattern. Processes and point patterns. The concepts of intensity and density. The concept of quadrats and how to create density maps. More ways to control the look of your plots, in particular faceting and adding lines. Please review the previous practices if you need a refresher on these concepts. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. 11.1 Learning Objectives In this practice, you will learn: The intuition behind the quadrat-based test of independence. About the limitations of quadrat-based analysis. The concept of kernel density. More ways to manipulate objects to do point pattern analysis using spatstat. 11.2 Suggested Readings Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 3. Longman: Essex. Baddeley A, Rubak E, Turner R (2016) Spatial Point Pattern: Methodology and Applications with R, Chapter 6. CRC: Boca Raton. Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 7. Springer: New York. Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, 6.1 - 6.6. Sage: Los Angeles. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey. 11.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(spatstat) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Load the datasets that you will use for this practice: data("PointPatterns") data("pp0_df") PointPatterns is a data frame with four sets of spatial events, labeled as “Pattern 1”, “Pattern 2”, “Pattern 3”, and “Pattern 4”. Each set has \\(n=60\\) events. You can check the class of this object by means of the function class class(). class(PointPatterns) ## [1] "data.frame" The second data frame (i.e., pp0_df) includes the coordinates x and y of two sets of spatial events, labeled as “Pattern 1” and “Pattern 2”. The summary for PointPatterns shows that these point patterns are located in a square-unit window (check the max and min values of x and y): summary(PointPatterns) ## x y Pattern ## Min. :0.0169 Min. :0.005306 Pattern 1:60 ## 1st Qu.:0.2731 1st Qu.:0.289020 Pattern 2:60 ## Median :0.4854 Median :0.550000 Pattern 3:60 ## Mean :0.5074 Mean :0.538733 Pattern 4:60 ## 3rd Qu.:0.7616 3rd Qu.:0.797850 ## Max. :0.9990 Max. :0.999808 The same is true for pp0_df: summary(pp0_df) ## x y marks ## Min. :0.0456 Min. :0.03409 Pattern 1:36 ## 1st Qu.:0.2251 1st Qu.:0.22963 Pattern 2:36 ## Median :0.4282 Median :0.43363 ## Mean :0.4916 Mean :0.47952 ## 3rd Qu.:0.7812 3rd Qu.:0.77562 ## Max. :0.9564 Max. :0.94492 As seen in the previous practice and activity, the package spatstat employs a type of object called ppp (for planar point pattern). Fortunately, it is relatively simple to convert a data frame into a ppp object by means of as.ppp(). This function requires that you define a window for the point pattern, something we can do by means of the owin function: # "W" will appear in your environment as a defined window with boundaries of (1,1) W <- owin(xrange = c(0, 1), yrange = c(0, 1)) Then the data frames are converted using the as.ppp function: # Converts the data frame to planar point pattern using the defined window "W" pp0.ppp <- as.ppp(pp0_df, W = W) PointPatterns.ppp <- as.ppp(PointPatterns, W = W) You can verify that the new objects are indeed of ppp-class: #"class" is an excellent tool to use when verifying the execution of a previous line of code class(pp0.ppp) ## [1] "ppp" class(PointPatterns.ppp) ## [1] "ppp" 11.4 A Quadrat-based Test for Spatial Independence In the preceding activity, you used a quadrat-based spatial independence test to help you decide whether a pattern was random (the function was quadrat.test). We will now review the intuition of the test. Let’s begin by plotting the patterns. You can use split to do plots for each pattern separately, instead of putting all of them in a single plot (this approach is not as refined as ggplot2, where we have greater control of the aspect of the plots; on the other hand, it is quick): #The split functions separates without defining a window. This is a quicker option to get relative results plot(split(PointPatterns.ppp)) Recall that you can also plot individual patterns by using $ followed by the factor that identifies the desired pattern (this is a way of indexing different patterns in ppp-class objects): # Using "$" acts as a call sign to retrive information from a data frame. In this case, you are calling "Pattern 4" from "PointPatterns.ppp" plot(split(PointPatterns.ppp)$"Pattern 4") Now calculate the quadrat-based test of independence: # `quadrat.test()` generates a quadrat-based test of independence, in this case, for "Pattern 2" called from "PointPatterns.ppp", using 3 quadrats in the direction of the x-axis and 3 quadrats in the direction of the y-axis q_test <- quadrat.test(split(PointPatterns.ppp)$"Pattern 2", nx = 3, ny = 3) q_test ## ## Chi-squared test of CSR using quadrat counts ## ## data: split(PointPatterns.ppp)$"Pattern 2" ## X2 = 48, df = 8, p-value = 1.976e-07 ## alternative hypothesis: two.sided ## ## Quadrats: 3 by 3 grid of tiles Plot the results of the quadrat test: plot(q_test) As seen in the preceding chapter, the expected distribution of events on quadrats under the null landscape tends to be quite even. This is because each quadrat has equal probability of having the same number of events (depending on size, when the quadrats are not all the same size the number will be proportional to the size of the quadrat). If you check the plot of the quadrat test above, you will notice that the first number (top left corner) is the number of events in the quadrat. The second number (top right corner) is the expected number of events for a null landscape. The third number is a residual, based on the difference between the observed and expected number of events. More specifically, the residual is a Pearson residual, defined as follows: \\[ r_i=\\frac{O_i - E_i}{\\sqrt{E_i}}, \\] where \\(O_i\\) is the number of observed events in quadrat \\(i\\) and \\(E_i\\) is the number of expected events in quadrat \\(i\\). When the number of observed events is similar to the number of expected events, \\(r_i\\) will tend to be a small value. As their difference grows, the residual will also grow. The independence test is calculated from the residuals as: \\[ X^2=\\sum_{i=1}^{Q}r_i^2, \\] where \\(Q\\) is the number of quadrats. In other words, the test is based on the squared sum of the Pearson residuals. The smaller this number is, the more likely that the observed pattern of events is not different from a null landscape (i.e., a random process), and the larger it is, the more likely that it is different from a null landscape. This is reflected by the \\(p\\)-value of the test (technically, the \\(p\\)-value is obtained by comparing the test to the \\(\\chi^2\\) distribution, pronounced “kay-square”). Consider for instance the first pattern in the examples: plot(quadrat.test(split(PointPatterns.ppp)$"Pattern 1", nx = 3, ny = 3)) You can see that the Pearson residual of the top left quadrat is indeed -0.6567673, the next to its right is -0.2704336, and so on. The value of the test statistic should be then: # The "Paste" function joins together several arguments as characters. Here, this is a string of values for "X2", where X2" is the squared sum of the residuals paste("X2 = ", (-0.65)^2 + (-0.26)^2 + (0.52)^2 + (-0.26)^2 + (0.9)^2 + (0.52)^2 + (-1)^2 + (0.13)^2 + (0.13)^2) ## [1] "X2 = 2.9423" Which you can confirm by examining the results of the test (the small difference is due to rounding errors): quadrat.test(split(PointPatterns.ppp)$"Pattern 1", nx = 3, ny = 3) ## ## Chi-squared test of CSR using quadrat counts ## ## data: split(PointPatterns.ppp)$"Pattern 1" ## X2 = 3, df = 8, p-value = 0.1313 ## alternative hypothesis: two.sided ## ## Quadrats: 3 by 3 grid of tiles Explore the remaining patterns. You will notice that the residuals and test statistic tend to grow as more events are concentrated in space. In this way, the test is a test of density of the quadrats: is their density similar to what would be expected from a null landscape? 11.5 Limitations of Quadrat Analysis: Size and Number of Quadrats As hinted by the previous activity, one issue with quadrat analysis is the selection of the size for the quadrats. Changing the size of the quadrats has an impact on the counts, and in turn on the aspect of density plots and even the results of the test of independence. For example, the results of the test for “Pattern 2” in the dataset change when the number of quadrats is modified. For instance, with a small number of quadrats: quadrat.test(split(PointPatterns.ppp)$"Pattern 2", nx = 2, ny = 1) ## ## Chi-squared test of CSR using quadrat counts ## ## data: split(PointPatterns.ppp)$"Pattern 2" ## X2 = 1.6667, df = 1, p-value = 0.3934 ## alternative hypothesis: two.sided ## ## Quadrats: 2 by 1 grid of tiles Compare to four quadrats: quadrat.test(split(PointPatterns.ppp)$"Pattern 2", nx = 2, ny = 2) ## ## Chi-squared test of CSR using quadrat counts ## ## data: split(PointPatterns.ppp)$"Pattern 2" ## X2 = 6, df = 3, p-value = 0.2232 ## alternative hypothesis: two.sided ## ## Quadrats: 2 by 2 grid of tiles And: quadrat.test(split(PointPatterns.ppp)$"Pattern 2", nx = 3, ny = 2) ## ## Chi-squared test of CSR using quadrat counts ## ## data: split(PointPatterns.ppp)$"Pattern 2" ## X2 = 23.2, df = 5, p-value = 0.0006182 ## alternative hypothesis: two.sided ## ## Quadrats: 3 by 2 grid of tiles Why is the statistic generally smaller when there are fewer quadrats? A different issue emerges when the number of quadrats is large: quadrat.test(split(PointPatterns.ppp)$"Pattern 2", nx = 4, ny = 4) ## Warning: Some expected counts are small; chi^2 approximation may be inaccurate ## ## Chi-squared test of CSR using quadrat counts ## ## data: split(PointPatterns.ppp)$"Pattern 2" ## X2 = 47.2, df = 15, p-value = 6.84e-05 ## alternative hypothesis: two.sided ## ## Quadrats: 4 by 4 grid of tiles A warning now tells you that some expected counts are small: space has been divided so minutely, that the expected number of events per quadrat has become too thin; as a consequence, the approximation to the probability distribution may be inaccurate. While there are no hard rules to select the size/number of quadrats, the following rules of thumb are sometimes suggested: Each quadrat should have a minimum of two events. The number of quadrats is selected based on the area (A) of the region, and the number of events (n): \\[ Q=\\frac{2A}{N} \\] Caution should be exercised when interpreting the results of the analysis based on quadrats, due to the issue of size/number of quadrats. 11.6 Limitations of Quadrat Analysis: Relative Position of Events Another issue with quadrat analysis is that it is not sensitive to the relative position of the events within the quadrats. Consider for instance the following two patterns in pp0: plot(split(pp0.ppp)) These two patterns look quite different. And yet, when we count the events by quadrats: plot(quadratcount(split(pp0.ppp), nx = 3, ny = 3)) This example highlights how quadrats are relatively coarse measures of density, and fail to distinguish between fairly different event distributions, in particular because quadrat analysis does not take into account the relative position of the events with respect to each other. 11.7 Kernel Density In order to better take into account the relative position of the events with respect to each other, a different technique can be devised. Imagine that a quadrat is a kind of “window”. We use it to observe the landscape. When we count the number of events in a quadrat, we simply peek through that particular window: all events inside the “window” are simply counted, and all events outside the “window” are ignored. Then we visit another quadrat and do the same, until we have visited all quadrats. Imagine now that we define a window that, unlike the quadrats which are fixed, can move and visit different points in space. This window also has the property that, instead of counting the events that are in the window, it gives greater weight to events that are close to the center of the window, and less weight to events that are more distant from the center of the window. We can define such a window by selecting a function that declines with increasing distance. We will call this function a kernel. An example of a function that can work as a moving window is the following. # Here we create a data.frame to use for plotting; it includes a single column with a variable called `dist` for distance, that varies between -3 and 3; the function `stat_function()` is used in `ggplot2` to transform an input by means of a function, which in this case is `dnorm` the normal distribution! `ylim()` sets the limits of the plot in the y-axis ggplot(data = data.frame(dist = c(-3, 3)), aes(dist)) + stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1)) + ylim(c(0, 0.45)) As you can see, the value of the function declines with increasing distance from the center of the window (when dist == 0; note that the value never becomes zero!). Since we used the normal distribution, this is a Gaussian kernel. The shape of the Gaussian kernel depends on the standard deviation, which controls how “big” the window is, or alternatively, how quickly the function decays. We will call the standard deviation the kernel bandwidth of the function. Since the bandwidth controls how rapidly the weight assigned to distant events decays, if the argument changes, so will the shape of the kernel function. As an experiment, change the value of the argument sd in the chunk above. You will see that as it becomes smaller, the slope of the kernel becomes steeper (and distant observations are downweighted more rapidly). On the contrary, as it becomes larger, the slope becomes less steep (and distant events are weighted almost as highly as close events). Kernel density estimates are usually obtained by creating a fine grid that is superimposed on the region. The kernel function then visits each point on the grid and obtains an estimate of the density by summing the weights of all events as per the kernel function. Kernel density is implemented in spatstat and can be used as follows. The input is a ppp object, and optionally a sigma argument that corresponds to the bandwidth of the kernel: # The "density" function computes estimates of kernel density. Here we are creating a Kernel Density estimate using "pp0.ppp" from our data frame by means of a bandwidth defined by "sigma" kernel_density <- density(split(pp0.ppp), sigma = 0.1) plot(kernel_density) Compare to the distribution of events: plot(split(pp0.ppp)) It is important to note that the gradation of colors is different in the two kernel density plots. Whereas the smallest value in the plot on the left is less than 20 and the largest is greater than 100, on the other plot the range is only between 45 to approximately 50. Thus, the intensity of the process is much higher at places in Pattern 1 that in Pattern 2. The plots above illustrate how the map of the kernel density is better able to capture the variations in density across the region. In fact, kernel density is a smooth estimate of the underlying intensity of the process, and the degree of smoothing is controlled by the bandwidth. References "],
["activity-5-point-pattern-analysis-ii.html", "Chapter 12 Activity 5: Point Pattern Analysis II 12.1 Practice questions 12.2 Learning objectives 12.3 Suggested reading 12.4 Preliminaries 12.5 Activity", " Chapter 12 Activity 5: Point Pattern Analysis II Remember, you can download the source file for this activity from here. 12.1 Practice questions Answer the following questions: How does the quadrat-based test of independence respond to a small number of quadrats? How does the quadrat-based test of independence respond to a large number of quadrats? What are the limitations of quadrat analysis? What is a kernel function? How does the bandwidth affect a kernel function? 12.2 Learning objectives In this activity, you will: Explore a dataset using quadrats and kernel density. Experiment with different parameters (number/size of kernels and bandwidths). Discuss the impacts of selecting different parameters. Hypothesize about the underlying spatial process based on your analysis. 12.3 Suggested reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey. 12.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity. In addition to tidyverse, you will need spatstat, a package designed for the analysis of point patterns (you can learn about spatstat here and here): library(tidyverse) library(spatstat) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' In the practice that preceded this activity, you learned about the concepts of intensity and density, about quadrats, and also how to create density maps. Begin by loading the data that you will use in this activity: data("bear_df") This dataset was sourced from the Scandinavia Bear Project, a Swedish-Noruegian collaboration that aims to study the ecology of brown bears, to provide decision makers with evidence to support bear management, and to provide information regarding bears to the public. You can learn more about this project here. The project involves tagging bears with GPS units, so that their movements can be tracked. The dataset includes coordinates of one bear’s movement over a period of several weeksin 2004. The dataset was originally taken from the adehabitatLT package but was somewhat simplified for this activity. Instead of full date and time information, the point pattern is marked more simply as “Day Time” and “Night Time”, to distinguish between diurnal and nocturnal activity of the bear. Summarize the contents of this dataframe: summary(bear_df) ## x y marks ## Min. :515743 Min. :6812138 Day Time :502 ## 1st Qu.:518995 1st Qu.:6813396 Night Time:498 ## Median :519526 Median :6816724 ## Mean :519321 Mean :6816474 ## 3rd Qu.:519983 3rd Qu.:6818111 ## Max. :522999 Max. :6821440 The Min. and Max. of x and y give us an idea of the region covered by this dataset. We can use these values to approximate a window for the region (as an experiment, you could try changing these values to create regions of different sizes): W <- owin(xrange = c(515000, 523500), yrange = c(6812000, 6822000)) Next, we can convert the dataframe into a ppp-class object suitable for analysis using the package spatstat: bear.ppp <- as.ppp(bear_df, W = W) You can check the contents of the ppp object by means of summary: summary(bear.ppp) ## Marked planar point pattern: 1000 points ## Average intensity 1.176471e-05 points per square unit ## ## Coordinates are given to 1 decimal place ## i.e. rounded to the nearest multiple of 0.1 units ## ## Multitype: ## frequency proportion intensity ## Day Time 502 0.502 5.905882e-06 ## Night Time 498 0.498 5.858824e-06 ## ## Window: rectangle = [515000, 523500] x [6812000, 6822000] units ## (8500 x 10000 units) ## Window area = 8.5e+07 square units Now that you have loaded the dataframe and converted to a ppp object, you are ready for the next activity. 12.5 Activity *1. Analyze the point pattern for the movements of the bear using quadrat and kernel density methods. Experiment with different quadrat sizes and kernel bandwidths. Explain your choice of parameters (quadrat sizes and kernel bandwidths) to a fellow student. Decide whether these patterns are random, and support your decision. Do you see differences in the activity patterns of the bear by time of day? What could explain those differences, if any? Discuss the limitations of your conclusions, and of quadrat/kernel (density-based) approaches more generally. "],
["point-pattern-analysis-iii.html", "Chapter 13 Point Pattern Analysis III 13.1 Learning Objectives 13.2 Suggested Readings 13.3 Preliminaries 13.4 Motivation 13.5 Nearest Neighbors 13.6 \\(G\\)-function", " Chapter 13 Point Pattern Analysis III NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. In the last practice/session your learning objectives included: The intuition behind the quadrat-based test of independence. The concept of kernel density. The limitations of density-based analysis More ways to work with ppp objects. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. 13.1 Learning Objectives In this practice, you will learn: About clustered and dispersed (or regular) patterns. The concept of nearest neighbors. About distance-based methods for point pattern analysis. About the G-function for the analysis of event-to-event nearest neighbor distances. 13.2 Suggested Readings Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 3. Longman: Essex. Baddeley A, Rubak E, Turner R (2016) Spatial Point Pattern: Methodology and Applications with R, Chapter 8. CRC: Boca Raton. Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 7. Springer: New York. Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, 6.1 - 6.6. Sage: Los Angeles. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey. 13.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(spatstat) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Load the dataset that you will use for this practice: data("pp0_df") Examine the contents of the data frame you just loaded: summary(pp0_df) ## x y marks ## Min. :0.0456 Min. :0.03409 Pattern 1:36 ## 1st Qu.:0.2251 1st Qu.:0.22963 Pattern 2:36 ## Median :0.4282 Median :0.43363 ## Mean :0.4916 Mean :0.47952 ## 3rd Qu.:0.7812 3rd Qu.:0.77562 ## Max. :0.9564 Max. :0.94492 As you can see, this data frame includes a set of coordinates for two point patterns, labeled “Pattern 1” and “Pattern 2”, each of which consists of \\(n=36\\) events. The range of the coordinates (between 0 and 1) suggests a window as follows: # Remember, `owin()` is used to create a window to frame a point pattern in the package `spatstat` W <- owin(c(0,1), c(0,1)) This creates an owin object that defines a region in the unit square. Given window object W, it is possible to transform the dataframe into a ppp object: # Remember, `as.ppp()` will take a foreign object (foreign to `spatstat`) and convert it into a `ppp` object pp0.ppp <- as.ppp(pp0_df, W = W) If you need a refresher on how to create ppp objects see Chapter 9, Point Pattern Analysis. 13.4 Motivation Quadrats and kernel density are examples of density-based analysis. These techniques are useful to help you understand variations in the distribution of events at a relatively large scale, but as previously discussed, may sometimes be less informative by not taking into account small scale variations in the locations of the events. For this reason, the following two patterns, despite being very different, give identical number of counts per quadrat: # The `split()` function is used to divide data in the vector into groups using a categorical variable; in this case, the `ppp` object includes only the coordinates and a variable that identifies the coordinates as belonging to "Pattern 1" or "Pattern 2". For this reason, the split is accomplished according to this variable plot(split(pp0.ppp)) # Arguments `nx` and `ny` indicate the number of quadrats on the x and y directions respectively plot(quadratcount(split(pp0.ppp), nx = 3, ny = 3)) The two patterns above have similar density, However, “Pattern 1” displays clustering, a situation characterized by events generally being in close proximity to others. “Pattern 2”, on the other hand, displays dispersion or regularity, a situation where points tend to be located at similar distances from each other. With some fiddling of the parameters, quadrats can be coaxed to tease out the variations in density, for instance: plot(quadratcount(split(pp0.ppp), nx = 9, ny = 9)) As a visualization technique, this gives a better sense of the variations in density. However, as noted previously, the quality of the test of independence deteriorates when there are many quadrats with small counts. As an alternative, kernel density can be used to visualize the smoothed estimate of the density: plot(density(split(pp0.ppp), sigma = 0.075)) However, even when we can visualize the variations in density, we cannot, from the kernel estimate alone, tell if high/low values exceed those of a null landscape - in other words, we lack at the moment a way to test the hypothesis that the density is higher than what would be expected from a null landscape. In this practice you will learn about a family of techniques that instead of measuring the density, explore patterns by means of distance distributions. 13.5 Nearest Neighbors Let us begin by introducing the concept of a nearest neighbor. The nearest neighbor of a location is the event that is closest to said location given some metric. This metric is usually Euclidian distance on the plane, that is, distance as measured using a straight line between the location and the event. In principle, the metric can be selected according to the characteristics of a dataset: this could be Euclidean distance, great circle distance, or network distance, for events on networks.(see Figure 13.1). Figure 13.1: Examples of distance metrics In this way, the nearest neighbor of \\(i\\) is the event \\(j\\) with the shortest distance \\(d\\) from location \\(i\\): \\[ \\text{Event }j\\text{ is the nearest neighbor of location }i\\text{ if: }d_{ij}\\le d_{ik} \\forall k \\] Ties are relatively rare in most realistic point patterns (even in regular patterns), and may not have a big impact on the analysis. The package spatstat includes functions to calculate Euclidean distances. Three functions are relevant: pairdist(): returns the pairwise distance between all pairs of events i and j. nndist(): returns a vector of distances from events to to their corresponding nearest neighbors; these distances are obtained by sorting the pairwise distances, and selecting the minimum value for each event. distmap(): returns a pixel image with the distance from each pixel to the nearest event; in effect this is a map of the distances between empty spaces and their corresponding nearest events. With these functions we can calculate, for instance, the following distances: # Function `nndist()` will calculate the distance of each event to its nearest neighbor pp0_nn1 <- nndist(split(pp0.ppp)$"Pattern 1") The value of nndist() is a vector with \\(n\\) distances, where \\(n\\) is the number of events in the pattern. The first distance in the vector is the distance from the first event in the series to its nearest neighbor, the second is the distance from the second event in the series to its nearest neighbor, and so on. Let us explore the distribution of these distances by means of a histogram: # Remember, `geom_histogram()` adds a histogram to a `ggplot2` object; the `binwidth` argument defines the size of each bin for the histogram ggplot(data = data.frame(dist = pp0_nn1), aes(dist)) + geom_histogram(binwidth = 0.03) Notice how most events (20 out of 36) have a nearest neighbor at a relatively short distance (<0.05). What does this mean? Compare to the distribution of distances in “Pattern 2” of pp0.ppp: # Calculate the distances to nearest neighbors in the second point pattern, i.e., "Pattern 2" pp0_nn2 <- nndist(split(pp0.ppp)$"Pattern 2") # Create a histogram to explore the distribution of values of distances to nearest neighbors ggplot(data = data.frame(dist = pp0_nn2), aes(dist)) + geom_histogram(binwidth = 0.03) In this case, most events (more than 30 out of 36) have a nearest neighbor at a distance of approximately 0.15. What does this mean? The two histograms above are interesting in that they reveal, for “Point Pattern 1” that most events are only a short distance away from another event (indicative of clustering), whereas for “Point Pattern 2” the suggestion is that almost all events have a nearest neighbor at a distance that is constant (indicative of regularity). However, the histograms do not convey more spatial information. Another useful tool to explore the distribution of distances to nearest neighbors is a Stienen diagram. A Steinen diagram is essentially a proportional symbol plot of the events. The sizes of symbols are proportional to the distance to their nearest neighbor. For example, for “Pattern 1” in pp0.ppp (Notice the use of %mark% to add an attribute to the ppp object; the attribute is the distance to the nearest neighbor): # The function %mark% is used to add a variable (a "mark") to a `ppp` object. In this example, the variable we are adding to "Pattern 1" is the distance from the event to its nearest neighbor, as calculated above split(pp0.ppp)$"Pattern 1" %mark% (pp0_nn1) %>% plot(markscale = 1, main = "Stienen diagram") In this diagram, the largest circle is not very large: even events that are relatively isolated are not a long distance away from their nearest neighbor. This fits the definition of clustering as a situation where events tend to be relatively close to each other. Compare to the Stienen diagram of “Pattern 2”: split(pp0.ppp)$"Pattern 2" %mark% (pp0_nn2) %>% plot(markscale = 1, main = "Stienen diagram") Notice how all circles are very similar in size: this fits the definition of dispersion, where events are more or less equally distant from their nearest neighbors. What would these diagrams look for a null landscape? We can use the function runifpoint from the spatstat package to generate a null landscape: # `runifpoint()` is a function to generate random coordinates based on the uniform random distribution function. The argument tells the function to create n = 36 random coordinates for our null landscape; this null landscape is contained in the window `W`, same as our previous point patterns rand_ppp <- runifpoint(n = 36, win = W) If we plot the Stienen diagram for this point pattern: # Calculate the distances to nearest neighbors for the null landscape rand_nn <- nndist(rand_ppp) # Add the distances as calculated above to the point pattern using %mark% and plot the Stienen diagram rand_ppp %mark% (rand_nn) %>% plot(markscale = 1, main = "Stienen diagram") In a null landscape, the distribution of the size of the symbols would tend to be random! The concept of nearest neighbors is useful to define a family of techniques that are based on the distribution of distances to nearest neighbors. Three such techniques are introduced here. 13.6 \\(G\\)-function As you have seen above, the distribution of distances to nearest neighbors presents distinctive characteristics for different types of patterns. What is needed is a convenient way to summarize the distribution of distances to nearest neighbors. A way to do so is by means of a plot of the cumulative distribution function. A cumulative distribution is simply the proportion of events that have a nearest neighbor at a distance less than some value \\(x\\). When the value of \\(x\\) is very small, no events have a nearest neighbor at \\(d_{ij}<x\\). When \\(x\\) is very large all events have a nearest neighbor at \\(d_{ij}<x\\). The cumulative distribution thus depends on the value of \\(x\\). Imagine for instance the following hypothetical distribution of distances of ten events to their nearest neighbors (the first event’s nearest neighbor is at a distance of 1, the second event’s nearest neighbor is at 2, the third’s at 0.5, and so on): nnd <- c(1, 2, 0.5, 2.5, 1.7, 4, 3.5, 1.2, 2.3, 2.8) When \\(x = 0\\), zero events have a nearest neighbor at that distance or less. Two events have nearest neighbors at distances \\(d_{ij} <= 1\\). Five events have a nearest neighbor at distances \\(d_{ij} <= 2\\). Eight events have a nearest neighbor at dist \\(d_{ij} <= 3\\). And all events have a nearest neighbor at distances \\(d_{ij} <= 4\\). We can plot these numbers of events as a proportion: # Create a data frame for plotting the proportion of events with a nearest neighbor at a distance $d_ij <= x$ df <- data.frame(x = c(0, 1, 2, 3, 4), proportion = c(0, 3/10, 5/10, 8/10, 10/10)) # `geom_line()` creates lines that connect the coordinates of the data inputs ggplot() + geom_line(data = df, aes(x = x, y = proportion)) The cumulative distribution function of distances from event to nearest neighbor is called a \\(G\\)-function. This function is defined as follows, with \\(d_{ik}\\) as the distance from the event at i to its nearest neighbor: \\[ \\hat{G}(x)=\\frac{(d_{ik}\\le x, \\forall i)}{n} \\] This function (with a hat, because it is estimated from the data), can be used to explore spatial point patterns. When doing so, it is useful to know that the theoretical value of \\(G\\) (assuming a null landscape generated by a Poisson distribution) is as follows: \\[ G_{pois}(x) = 1 - exp(-\\lambda \\pi x^2). \\] When the empirical \\(\\hat{G}(x)\\) is greater than the theoretical function, this suggests that the events tend to be closer than expected, compared to the null landscape. This would be indicative of a pattern of events that form clusters. On the contrary, when the empirical function is less than the theoretical function, this would suggest that the events tend to be further away from each other than expected, compared to the null landscape. This would be indicative of a dispersed or regular pattern. The \\(G\\)-function is implemented in spatstat as Gest (for \\(G\\) estimated): # Use split to calculate the G-function only for "Pattern 1" g_pattern1 <- Gest(split(pp0.ppp)$"Pattern 1", correction = "none") (For the moment ignore the argument “correction”; we will discuss corrections later on.) The plot() function can be used to visualize the estimated G (with r = x): plot(g_pattern1) In the plot above, the empirical function is the solid black line, and the theoretical is the dashed red line. If you examine the empirical function, you will see that about 50% of events have a nearest neighbor at a distance of less than approximately 0.04. In the null landscape (theoretical function), in contrast, only about 16% of events have a nearest neighbor at less than 0.04: plot(g_pattern1) lines(x = c(0.04, 0.04), y = c(-0.1, 0.5), lty = "dotted") lines(x = c(-0.1, 0.04), y = c(0.5, 0.5), lty = "dotted") lines(x = c(-0.1, 0.04), y = c(0.16, 0.16), lty = "dotted", col = "red") Notice that the empirical function is above the theoritical function. This suggests is that in the actual landscape events tend to be much closer to other events in comparison the null landscape, and would therefore be suggestive of clustering. Compare to “Pattern 2”: g_pattern2 <- Gest(split(pp0.ppp)$"Pattern 2", correction = "none") plot(g_pattern2) Now the empirical function is below the one for the null landscape. Notice too that all events have a nearest neighbor in a limited range of distances, between 0.14 and 0.18. This is indicative of a dispersed, or regular pattern. And the random pattern that you created before: g_pattern_rnd <- Gest(rand_ppp, correction = "none") plot(g_pattern_rnd) In this case, the empirical function more closely resembles the theoretical function for the null landscape. This suggests a random pattern. By considering the distribution of distances to nearest neighbors, you can generate additional information on a point pattern to complement the density-based analysis of the preceding chapters. References "],
["activity-6-point-pattern-analysis-iii.html", "Chapter 14 Activity 6: Point Pattern Analysis III 14.1 Practice questions 14.2 Learning objectives 14.3 Suggested reading 14.4 Preliminaries 14.5 Activity", " Chapter 14 Activity 6: Point Pattern Analysis III Remember, you can download the source file for this activity from here. 14.1 Practice questions Answer the following questions: List and explain two limitations of quadrat analysis. What is clustering? What could explain a clustering in a set of events? What is regularity? What could explain it? Describe the concept of nearest neighbors. What is a cumulative distribution function? 14.2 Learning objectives In this activity, you will: Explore a dataset using distance-based approaches. Compare the characteristics of different types of patterns. Discuss ways to evaluate how confident you are that a pattern is random. 14.3 Suggested reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey. 14.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity. In addition to tidyverse, you will need spatstat, a package designed for the analysis of point patterns (you can learn about spatstat here and here): library(tidyverse) library(spatstat) library(maptools) # Needed to convert `SpatialPolygons` into `owin` object library(sf) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' In the practice that preceded this activity, you learned about the concepts of intensity and density, about quadrats, and also how to create density maps. For this practice, you will use the data that you first encountered in Activity 4, that is, the business locations in Toronto. Begin by reading the geospatial files, namely the city boundary of Toronto. You need the sf object, which will be converted into a spatstat window object: data("Toronto") Convert the sf object to an owin object (via SpatialPolygons, hence as(x, \"Spatial\"): Toronto.owin <- as.owin(as(Toronto, "Spatial")) # Requires `maptools` package Next the data that you will use in this activity needs to be loaded. Each dataframe is converted into a ppp object using the as.ppp function, again after extracting the coordinates of the events from the sf object: data("Fast_Food") Fast_Food.ppp <- as.ppp(st_coordinates(Fast_Food), W = Toronto.owin) # Add the classes of fast food to the ppp object: marks(Fast_Food.ppp) <- Fast_Food$Class data("Gas_Stands") Gas_Stands.ppp <- as.ppp(st_coordinates(Gas_Stands), W = Toronto.owin) data("Paez_Mart") Paez_Mart.ppp <- as.ppp(st_coordinates(Paez_Mart), W = Toronto.owin) If you inspect your workspace, you will see that the following ppp objects are there: Fast_Food.ppp Gas_Stands.ppp Paez_Mart.ppp These are locations of fast food restaurants and gas stands in Toronto (data are from 2008). Paez Mart on the other hand is a project to cover Toronto with convenience stores. The points are the planned locations of the stores. You can check the contents of ppp objects by means of summary: summary(Fast_Food.ppp) ## Marked planar point pattern: 614 points ## Average intensity 9.681378e-07 points per square unit ## ## Coordinates are given to 1 decimal place ## i.e. rounded to the nearest multiple of 0.1 units ## ## Multitype: ## frequency proportion intensity ## Chicken 82 0.1335505 1.292953e-07 ## Hamburger 209 0.3403909 3.295453e-07 ## Pizza 164 0.2671010 2.585906e-07 ## Sub 159 0.2589577 2.507067e-07 ## ## Window: polygonal boundary ## 10 separate polygons (no holes) ## vertices area relative.area ## polygon 1 4185 630935000.0 9.95e-01 ## polygon 2 600 2536260.0 4.00e-03 ## polygon 3 193 237206.0 3.74e-04 ## polygon 4 28 26539.7 4.18e-05 ## polygon 5 52 142793.0 2.25e-04 ## polygon 6 67 158439.0 2.50e-04 ## polygon 7 41 83470.2 1.32e-04 ## polygon 8 30 42934.1 6.77e-05 ## polygon 9 36 33866.6 5.34e-05 ## polygon 10 8 11069.2 1.75e-05 ## enclosing rectangle: [609550.5, 651611.8] x [4826375, 4857439] units ## (42060 x 31060 units) ## Window area = 634207000 square units ## Fraction of frame area: 0.485 Now that you have the data that you need in the right format, you are ready for the next activity. 14.5 Activity 1.* Calculate the event-to-event distances to nearest neighbors using the function nndist(). Do this for all fast food establishments (pooled) and then for each type of establishment (i.e, “Chicken”, “Hamburger”, “Pizza”, “Sub”). 2.* Create Stienen diagrams using the distance vectors obtained in Step 1. 3.* Plot the empirical G-function for all fast food establishments (pooled) and then for each type of establishment (i.e, “Chicken”, “Hamburger”, “Pizza”, “Sub”). Discuss the diagrams that you created in Question 2 with a fellow student. Is there evidence of clustering/regularity? How confident are you to make a decision whether the patterns are not random? What could you do to assess your confidence in making a decision whether the patterns are random? Explain. "],
["point-pattern-analysis-iv.html", "Chapter 15 Point Pattern Analysis IV 15.1 Learning Objectives 15.2 Suggested Readings 15.3 Preliminaries 15.4 Motivation 15.5 F-function 15.6 \\(\\hat{K}\\)-function", " Chapter 15 Point Pattern Analysis IV NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. In the last practice/session your learning objectives included: Learning about clustered and dispersed (or regular) patterns. Learning the concept of nearest neighbors. Learning about distance-based methods for point pattern analysis. Learning about the \\(G\\)-function for the analysis of event-to-event nearest neighbor distances. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. 15.1 Learning Objectives In this chapter, you will: Learn about the \\(F\\)- or empty space function. Consider the issue of patterns at multiple scales. Learn about the \\(K\\)-function. Apply both of these techniques using a simple example. 15.2 Suggested Readings Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 3. Longman: Essex. Baddeley A, Rubak E, Turner R (2016) Spatial Point Pattern: Methodology and Applications with R, Chapters 7 - 8. CRC: Boca Raton. Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 7. Springer: New York. Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, 6.1 - 6.6. Sage: Los Angeles. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey. 15.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(spatstat) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Load the datasets that you will use for this practice: data("pp1_df") data("pp2_df") data("pp3_df") data("pp4_df") data("pp5_df") These five dataframes include the coordinates of events set in the space of a unit square. To convert these dataframes into ppp objects we first define a window: # We use "owin" to define a window of coordinates which is in the five dataframes. W <- owin(c(0, 1), c(0, 1)) And then use the function as.ppp to convert into ppp: # `as.ppp()` is a function that we use to convert dataframes into ppp objects pp1.ppp <- as.ppp(pp1_df, W = W) pp2.ppp <- as.ppp(pp2_df, W = W) pp3.ppp <- as.ppp(pp3_df, W = W) pp4.ppp <- as.ppp(pp4_df, W = W) pp5.ppp <- as.ppp(pp5_df, W = W) 15.4 Motivation Distance-based approaches like the \\(\\hat{G}\\)-function provide a useful complement to density-based approached. They can be implemented in more ways than we have seen so far. In this practice, you will learn about two more tools for conducting distance-based analysis, the \\(\\hat{F}\\)-function and the \\(\\hat{K}\\)-function. 15.5 F-function The \\(\\hat{G}\\)-function was defined as the cumulative distribution of the distances from events to their nearest neighboring event. The \\(\\hat{F}\\)-function is based on the same premise, but instead of using event-to-event distances, it uses point-to-event distances. Recall that a point is an arbitrary location on a map that is not necessarily the location of an event. It may well be (and typically is) empty space. For this reason, the \\(\\hat{F}\\)-function is sometimes called the empty space function: when there is more empty space in a region, the distance from a point to the nearest neighboring event is typically longer. More formally, this function is defined as follows, with \\(d_{ik}\\) as the distance from the point at \\(i\\) (not necessarily an event!) to its nearest neighboring event at location \\(k\\): \\[ \\hat{F}(x)=\\frac{(d_{ik}\\le x, \\forall i)}{n} \\] Again, we use the hat notation to indicate that the function is estimated from the data. The theoretical distribution of this function is known (based on a null landscape generated by a spatially random Poisson process: remember that a Poisson process is a type of random process that consists of points randomly located on a landscape). It is as follows: \\[ F_{pois}(x) = 1 - exp(-\\lambda \\pi x^2). \\] Notice that the distribution is in fact identical to that for \\(G\\). This makes sense: if the distribution of events is spatially random, the distribution of empty space in the region must be random as well! The interpretation of \\(\\hat{F}(x)\\) is the opposite of \\(\\hat{G}(x)\\): when the empirical \\(\\hat{F}(x)\\) is greater than the theoretical function, this suggests that empty spaces are closer to events than expected, compared to the null landscape, as in a dispersed pattern. On the contrary, when the empirical function is less than the theoretical function, this would suggest a clustered pattern, since the events tend to be far away from the points used to calculate the function. The \\(\\hat{F}\\)-function can be implemented in at least two ways: (1) by using a fine grid to measure the distance to events; or (2) by measuring the distance to events from randomly drawn coordinates. The implementation in spatstat is the first one, which results in a pixel-based image of empty space. We can illustrate this function with the point pattern pp1.ppp. First, we verify that pp1.ppp is already a ppp object: class(pp1.ppp) ## [1] "ppp" Begin by plotting the pattern: plot(pp1.ppp) An empty space map is obtained by means of the distmap() function: # The "distmap()" function computes the distance map of point pattern X and returns the distance map as a pixel image empty_space_map1 <- distmap(pp1.ppp) The plot of this is: plot(empty_space_map1) Similar to the Stienen diagrams that you used previously, this map shows the distance from any location on the map to the nearest event: the smaller the value, the closer the point is to an event. It is evident in this pixel image that the values are mostly smaller, illustrating that points are closer to events. Compare the map above to pp2.ppp: empty_space_map2 <- distmap(pp2.ppp) plot(empty_space_map2) In the second point pattern, there is more open space in the region. This is also apparent from the symbols map: plot(pp2.ppp) The \\(\\hat{F}\\)-function is implemented in spatstat as Fest() (for F-estimated), and it requires a ppp object as an input. Another possible input is whether a correction is to be used. This refers to boundary corrections. Since we have not yet discussed them, select “none”: # The "Fest()" function computes an estimate of the empty space function, and it also called the "point to nearest event" distribution. This function estimates the nearest neighbours of a point (in this example, for pp1) f_pattern1 <- Fest(pp1.ppp, correction = "none") This function can be plotted as follows: plot(f_pattern1) The black line is the empirical function, and we see that it is in general very similar to the theoretical function that corresponds to a null landscape. Compare to the second pattern: f_pattern2 <- Fest(pp2.ppp, correction = "none") plot(f_pattern2) lines(x = c(0, 0.097), y = c(0.4, 0.4), col = "blue", lty = "dotted") lines(x = c(0.045, 0.045), y = c(0.0, 0.4), col = "blue", lty = "dotted") lines(x = c(0.097, 0.097), y = c(0.0, 0.4), col = "blue", lty = "dotted") In the empirical (black) pattern, points on a grid tend to be more distant from events than what you would expect from the null landscape. For example, whereas under the theoretical function 40% of points have a nearest event that is at a distance of approximately 0.045 or less, under the empirical function, the events are generally more distant from the points, and for the same value of F (0.4 or 40%) the distance is closer to 0.1. See: # Repeat the plot of the F-function of `pp2.ppp` and use the function `lines()` to add lines to compare the distances for a given value of F, say 0.4 (or 40%) plot(f_pattern2) lines(x = c(0, 0.097), y = c(0.4, 0.4), col = "blue", lty = "dotted") lines(x = c(0.045, 0.045), y = c(0.0, 0.4), col = "blue", lty = "dotted") lines(x = c(0.097, 0.097), y = c(0.0, 0.4), col = "blue", lty = "dotted") This suggests that the points are clustered. Try plotting the \\(\\hat{G}\\)-functions for the patterns in this example, and compare. 15.6 \\(\\hat{K}\\)-function A limitation of the two techniques that we have seen so far is that they deal with a single scale: the distance to the first nearest neighbor (or, more generally, to the \\(k\\)-th nearest neighbor; these functions can be used for the 2nd, 3rd, and so on nearest neighbor!). Their single scale nature means that these functions can easily miss patterns when they are only evident at different scales. Consider for instance the following point pattern: plot(pp3.ppp) The events above initially appear to be clustered. However, at a different scale, a second pattern becomes evident. In fact, what we observe is a regular distribution of clusters. At a smaller scale, a single cluster may actually be a random distribution of events. In contrast, the following pattern appears to be a random distribution of regularly spaced events: plot(pp4.ppp) Whereas the last point pattern is of clusters of dispersed events that are themselves regularly spaced: plot(pp5.ppp) Both \\(\\hat{G}(x)\\) or \\(\\hat{F}(x)\\) when applied to any of these patterns will strongly hint at clustering at the scale of the first nearest neighbor. Regrettably, they fail to detect patterns that might exist at other scales. For instance: f_pattern3 <- Fest(pp3.ppp, correction = "none") plot(f_pattern3) g_pattern3 <- Gest(pp3.ppp, correction = "none") plot(g_pattern3) A different technique, called the \\(\\hat{K}\\)-function, is designed to detect patterns at multiple scales (see Ripley 1976; and Haase 1995). The intuition behind the function is as follows. Imagine that you visit every on of the events in the point patter in sequence. Each time you visit an event you do the following: first, you create a circle with radius “x” centered on the event, and then you count the number of events that are within the circle. Then you increase “x” by some distance, and repeat the process. Once that you have created the last circle (which will be suitably large to capture patterns at that scale), you move and visit the next event in the pattern and repeat the exact same process. These counts of events at distances “x” are aggregated and normalized by the estimated intensity of the point pattern. More formally, this is (with \\(A\\) as the area of the region): \\[ \\hat{K}(x)=\\frac{1}{\\hat{\\lambda}A}\\sum_{i}\\sum_{j\\neq i}(d_{ij}\\le x). \\] As before, the theoretical values for this function are known for the case of a null landscape generated by a Poisson process: \\[ K_{pois}(x)=\\pi x^2. \\] When the empirical function is greater than the theoretical function, this would suggest that events are typically surrounded by more events at that distance than what the null landscape would have. This is interpreted as evidence of clustering. In contrast, when the empirical function is less than the theoretical one, this would suggest that events are typically surrounded by fewer events at that distance than what would be expected from a null landscape. This is interpreted as dispersion. The \\(\\hat{K}\\)-function is implemented in the package spatstat as Kest(). To see how this function works, plot pp3.ppp once more: plot(pp3.ppp) Next, use Kest() to calculate and plot the \\(\\hat{K}\\)-function: # `Kest()` function estimates nearest neighbours of a point on multiple scales, identifying more than just the distance to the first nearest neighbour. Here, we are applying the K-function to `pp3.ppp`. As before, ignore the correction; we will discuss this later k_pattern3 <- Kest(pp3.ppp, correction = "none") plot(k_pattern3) As seen from the plot, the function is suggestive of clustering at smaller scales, but regularity at a larger scale. Try this now with the last pattern: plot(pp5.ppp) If you calculate and plot the \\(\\hat{K}\\)-function: k_pattern5 <- Kest(pp5.ppp, correction = "none") plot(k_pattern5) You will see that the plot correctly suggests dispersion at the very small scale, followed by clustering at an intermediate scale. There are indeed clusters of nine events surrounded by empty space, before other clusters of regular events are detected at the largest scale, following a regular pattern. Of the distance-based techniques that you have seen so far, \\(\\hat{G}(x)\\) and \\(\\hat{F}(x)\\) are often used as complements. The \\(\\hat{K}(x)\\) is useful when exploring multi-scale patterns. This concludes the chapter, and our coverage of distance-based techniques. References "],
["activity-7-point-pattern-analysis-iv.html", "Chapter 16 Activity 7: Point Pattern Analysis IV 16.1 Practice questions 16.2 Learning objectives 16.3 Suggested reading 16.4 Preliminaries 16.5 Activity", " Chapter 16 Activity 7: Point Pattern Analysis IV Remember, you can download the source file for this activity from here. 16.1 Practice questions Answer the following questions: What does the \\(\\hat{G}\\)-function measure? What does the \\(\\hat{F}\\)-function measure? How do these two functions relate to one another? Describe the intution behind the \\(\\hat{K}\\)-function. How does the \\(\\hat{K}\\)-function capture patterns at multiple scales? 16.2 Learning objectives In this activity, you will: Explore a dataset using single scale distance-based techniques. Explore the characteristics of a point pattern at multiple scales. Discuss ways to evaluate how confident you are that a pattern is random. 16.3 Suggested reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey. 16.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity. In addition to tidyverse, you will need spatstat, a package designed for the analysis of point patterns (you can learn about spatstat here and here): library(tidyverse) library(spatstat) library(maptools) # Needed to convert `SpatialPolygons` into `owin`-class object library(sf) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' For this activity, you will use the same datasets that you used in Activity 6, including the geospatial files for Toronto’s city boundary: data("Toronto") Convert the sf object to an owin object (via SpatialPolygons, hence as(x, \"Spatial\"): Toronto.owin <- as.owin(as(Toronto, "Spatial")) # Requires `maptools` package Next, load the data that you will use in this activity. Each dataframe is converted into a ppp object using the as.ppp function, again after extracting the coordinates of the events from the sf object: data("Fast_Food") Fast_Food.ppp <- as.ppp(st_coordinates(Fast_Food), W = Toronto.owin) # Add the classes of fast food to the ppp object: marks(Fast_Food.ppp) <- Fast_Food$Class data("Gas_Stands") Gas_Stands.ppp <- as.ppp(st_coordinates(Gas_Stands), W = Toronto.owin) data("Paez_Mart") Paez_Mart.ppp <- as.ppp(st_coordinates(Paez_Mart), W = Toronto.owin) Now that you have the datasets in the appropriate format, you are ready for the next activity. 16.5 Activity *Plot the empirical \\(\\hat{F}\\)-function for all fast food establishments (pooled) and then for each type of establishment separately (i.e, “Chicken”, “Hamburger”, “Pizza”, “Sub”). *Plot the empirical \\(\\hat{K}\\)-function for all fast food establishments (pooled) and then for each type of establishment (i.e, “Chicken”, “Hamburger”, “Pizza”, “Sub”). Discuss your results with a fellow student. Is there evidence of clustering/regularity? What can you say about patterns at multiple-scales based on the graphs above? How confident are you to make a decision whether the patterns are not random? What could you do to assess your confidence in making a decision whether the patterns are random? Explain. "],
["point-pattern-analysis-v.html", "Chapter 17 Point Pattern Analysis V 17.1 Learning Objectives 17.2 Suggested Readings 17.3 Preliminaries 17.4 Motivation: Hypothesis Testing 17.5 Null Landscapes Revisited 17.6 Simulation Envelopes 17.7 Things to Keep in Mind!", " Chapter 17 Point Pattern Analysis V NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. In the last practice/session your learning objectives included: Learning about the \\(\\hat{F}\\)- or empty space function. Considering the issue of patterns at multiple scales. Learning about the \\(\\hat{K}\\)-function. Applying these techniques using a simple example. Please review the previous practices if you need a refresher on these concepts. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. 17.1 Learning Objectives In this chapter, you will: Revisit the concept of hypothesis testing Revisit the concept of null landscapes. Learn about the use of simulation for hypothesis testing. Learn to implement simulation envelopes Consider some caveats when working with point patterns 17.2 Suggested Readings Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 3. Longman: Essex. Baddeley A, Rubak E, Turner R (2016) Spatial Point Pattern: Methodology and Applications with R, Chapter 10. CRC: Boca Raton. Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 7. Springer: New York. Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, 6.1 - 6.6. Sage: Los Angeles. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey. 17.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(spatstat) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Load the datasets that you will use for this practice: data("pp1_df") data("pp2_df") data("pp3_df") data("pp4_df") data("pp5_df") These five dataframes include the coordinates of events set in the space of a unit square. To convert these dataframes into ppp objects we first define a window: W <- owin(c(0, 1), c(0, 1)) And then use the function as.ppp to convert into ppp: pp1.ppp <- as.ppp(pp1_df, W = W) pp2.ppp <- as.ppp(pp2_df, W = W) pp3.ppp <- as.ppp(pp3_df, W = W) pp4.ppp <- as.ppp(pp4_df, W = W) pp5.ppp <- as.ppp(pp5_df, W = W) 17.4 Motivation: Hypothesis Testing In the previous sessions you learned about density- and distance-based techniques for the analysis of spatial point patterns. With the exception of the test of independence for quadrats, other techniques (including kernel density, the \\(\\hat{G}\\)- and \\(\\hat{F}\\)-functions, and the \\(\\hat{K}\\)-function), did not have a formal hypothesis testing framework. The question of “how confident are you when deciding whether a pattern is random” forms the basis of hypothesis testing. In other words, when making a decision whether the reject a null hypothesis, we would like to know what is the probability that we are making a mistake with the decision. Quantifying our uncertainty is a key feature of statistical analysis. In statistics, tests of hypothesis are developed following these general steps: Identify a null hypothesis of interest, and if possible alternative hypotheses as well (although the latter is not always possible). For instance, in point pattern analysis, a null hypothesis of interest is whether a pattern is random. If it is not, we would like to know in which way it is not random (i.e., is it clustered? Or on the contrary, is it regular?) Derive the expected value of the summary statistic of interest. It the case of the \\(\\hat{G}\\)-function, for instance, the expected value of the function under the null hypothesis of a spatially random Poisson process is: \\[ G_{pois}(x) = 1 - exp(-\\lambda \\pi x^2). \\] Similar expressions were presented for the \\(\\hat{F}\\)-function and \\(\\hat{K}\\)-function, but not for kernel density estimates. When the expected value of the function is known, the closer the empirical function is to its expected value, the more likely it is that the null hypothesis is true. For instance, the \\(\\hat{G}\\)-function of the pattern in pp1.ppp is shown below. It is quite close to the theoretical function, so the pattern is probably random. The question is, how probable is this? g_pp1 <- Gest(pp1.ppp, correction = "none") plot(g_pp1) To make a decision whether to reject the null hypothesis (or contrariwise, fail to reject it), we need to know how close is close to the expected value. This step depends on how much variability there is of the random process around its expected value. In other words, we need to know the variance of the expected value under the null hypothesis. Unfortunately, the variance of the theoretical random processes is not known in the case of many spatial point pattern techniques (the quadrat-based test of independence is an exception.) For a long time, this meant that the techniques remained purely descriptive, and it was not possible to quantify uncertainty when trying to decide whether a pattern was random: the decision would remain purely subjective. Fortunately, with the growth in use of computers in statistical analysis, the lack of theoretical expressions for the variance can be circumvented by means of simulation. Simulation has many applications in statistics, and is certainly relevant in the analysis of point patterns, allowing us to generate null landscapes with ease. 17.5 Null Landscapes Revisited A null landscape is a landscape produced by a random process. In previous practices you saw various different ways of generating null landscapes. A useful way of generating null landscapes for point patterns is by means of a Poisson process. The package spatstat implements this by means of the function rpoisp. This function generates a null landscape given an intensity parameter and a window. Before creating a null landscape, we can check the characteristics of the patterns in the dataset: summary(pp1.ppp) ## Planar point pattern: 81 points ## Average intensity 81 points per square unit ## ## Coordinates are given to 8 decimal places ## ## Window: rectangle = [0, 1] x [0, 1] units ## Window area = 1 square unit You can verify that the intensity in every case is 81 points per square unit, and the window is a square unit. Lets copy the window from one of the patterns in the sample dataset: # We can use `$` to index an item in the object `pp1.ppp` W <- pp1.ppp$window It is possible to generate a null landscape as follows, by means of the function rpoisppp(). The arguments of this function are a desired intensity (\\(\\lambda\\)) and a window: # The function `rpoisppp()` is used to generate null landscapes based on the Poisson distribution sim1 <- rpoispp(lambda = 81, win = W) The value (i.e., output) of this function is a ppp object that can be analyzed in all the ways that you already know. For instance, you can plot it: plot(sim1) Importantly, you can apply any of the techniques that you have seen so far, for instance, the \\(\\hat{G}\\)-function: g_sim1 <- Gest(sim1, correction = "none") We can try plotting the empirical functions (notice that the result of Gest is a dataframe with the values of r, the distance variable, the raw or empirical function, and the theoretical function). To plot using ggplot2 you can stack the two dataframes as follows (after adding a factor to indicate if it is the empirical function or a simulation): # Use `transmute()` to keep only some columns from a data frame, possibly after calculating new columns; in this example we take `raw` and put it in a column called `G`, we take `r` and put it in a column called `r` and create a new variable called `Type` to indicate that these values are for the "Empirical" function. Then we use `rbind()` to bind the rows of this data frame, and the rows of a second data frame that keeps the same columns, but based on the simulated null landscape g_all <- transmute(g_pp1, G = raw, x = r, Type = "Pattern 1") g_all <- rbind(g_all, transmute(g_sim1, G = raw, x = r, Type = "Simulation")) We can use ggplot2 to create a plot of the two functions: # By assigning `Type` to the aesthetic of `color` in `ggplot()`, we plot lines of different types in different colors ggplot(data = g_all, aes(x= x, y = G, color = Type)) + geom_line() After seeing the plot above, we notice that the empirical function is very, very similar to the simulated null landscape. But is this purely a coincidence? After all, when we simulate a null landscape, there is the possibility, however improbable, that it will replicate some meaningful process purely by chance. To be sure, we can simulate and analyze a second null landscape: sim2 <- rpoispp(lambda = 81, win = W) g_sim2 <- Gest(sim2, correction = "none") g_all <- rbind(g_all, transmute(g_sim2, G = raw, x = r, Type = "Simulation")) Plot again: ggplot(data = g_all, aes(x= x, y = G, color = Type)) + geom_line() The empirical function continues to look very similar to the simulated null landscapes. We could simulate more null landscapes and increase our confidence that the empirical function indeed is similar to a null landscape (notice the use of a for loop to repeat the same instructions multiple times): # Flow control functions include `for()`; this function will repeat the statements that follow a set number of times. In this example, we had already simulated 2 null landscapes above, so we want to simulate null landscapes 3 through 99 for(i in 3:99){ g_sim <- Gest(rpoispp(lambda = 81, win = W), correction = "none") g_all <- rbind(g_all, transmute(g_sim, G = raw, x = r, Type = "Simulation")) } With this we have generated 99 distinct null landscapes. Try plotting the empirical function with the functions of all of these simulated landscapes: ggplot(data = g_all, aes(x= x, y = G, color = Type)) + geom_line() You can see in the plot above that the empirical function is actually not visible! It is obscured by the null landscapes, since it falls somewhere within the limits of the functions for all the simulated patterns. The interpretation of this is as follows: out of 100 patterns (the empirical pattern and 99 null landscapes), the empirical pattern is not noticeably different from the random ones. How confident would you be rejecting the null hypothesis, i.e., deciding that the empirical pattern is not random? We can follow the same process but now for the second pattern pp2.ppp to the simulated null landscapes: # Compute the G-function for the point pattern in `pp2.ppp` and then extract the value of G, the distance, and label it as an "Empirical" function in a new data frame (by means of `transmute()`) g_pp2 <- Gest(pp2.ppp, correction = "none") g_pp2 <- transmute(g_pp2, G = raw, x = r, Type = "Pattern 2") # Bind the results of the G-function for `pp2.ppp` to the data frame with the simulations, and use `mutate()` to convert `Type` into a factor g_all <- rbind(g_all, g_pp2) g_all <- mutate(g_all, Type = factor(Type, levels = c("Pattern 1", "Pattern 2", "Simulation"))) # Use filter to remove all observations associated with "Pattern 1"; in this case, Type not equal (i.e., `!=`) to "Pattern 1". This way we can plot only the G-function of "Pattern 2" and the simulations ggplot(data = filter(g_all, Type != "Pattern 1"), aes(x= x, y = G, color = Type)) + geom_line() We can see that the empirical \\(\\hat{G}\\)-function of pp2.ppp is quite distinct from the 99 null landscapes that we generated! How confident would you be rejecting the null hypothesis now? 17.6 Simulation Envelopes Simulation, as seen above, can be quite powerful for hypothesis testing in situations where the theoretical parameters, for example the variance of a function, are not know. Essentially, the area covered by the \\(\\hat{G}\\)-functions of the simulated landscapes above are an estimate of the variance of the function. The set of functions estimated on the null landscapes are used to obtain what we call simulation envelopes. Since we lack a theoretical expression for the variance, we cannot obtain \\(p\\)-values to inform our decision to reject the null hypothesis. The simulation, however, provides a pseudo-\\(p\\)-value. If you generate 99 null landscapes, and the empirical pattern is still different, the probability that you are mistaken by rejecting the null hypothesis is at most 1% (since the next simulated landscape could expand the envelopes in such a way that it completely contains the empirical function). As you saw above, using simulation for hypothesis testing is, in general terms, a relatively straightforward process (assuming that the null process is properly defined, etc.) The package spatstat includes a function, called envelope(), that can be used to generate simulation envelopes for several statistics used in point pattern analysis. For instance, for the \\(\\hat{G}\\)-function, with 99 simulated landscapes: # The function `envelope()` automates what we did above, simulating null landscapes; it takes as arguments a `ppp` object for the empirical pattern, a function that we desire to test, for example the function `Gest`, as well as the number of simulations that we wish to conduct. An additional argument `funargs = ` is used to pass other arguments to the function that is evaluated, i.e., in this example `Gest` env_pp1 <- envelope(pp1.ppp, Gest, nsim = 99, funargs = list(correction = "none")) ## Generating 99 simulations of CSR ... ## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, ## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, ## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99. ## ## Done. The envelopes can be plotted: plot(env_pp1) It is easy to see that in this case the empirical function falls within the simulation envelopes, and thus it is very unlikely to be different from the null landscapes. Also, the \\(\\hat{F}\\)-function: env_pp2 <- envelope(pp2.ppp, Fest, nsim = 99, funargs = list(correction = "none")) ## Generating 99 simulations of CSR ... ## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, ## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, ## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99. ## ## Done. plot(env_pp2) Now the empirical function lies well outside the simulation envelopes, which makes it very unlikely that it is similar to the null landscapes. And finally, the \\(\\hat{K}\\)-function: env_pp3 <- envelope(pp3.ppp, Kest, nsim = 99, funargs = list(correction = "none")) ## Generating 99 simulations of CSR ... ## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, ## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, ## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99. ## ## Done. plot(env_pp3) Again, the empirical function lies mostly outside of the simulation envelopes, meaning that it is very improbable that it represents a random process. Simulation envelopes are a powerful way to test the hypothesis of null landscapes in the case of spatial point patterns. 17.7 Things to Keep in Mind! Before concluding the topic of point pattern analysis, here are a few important caveats to keep in mind. 17.7.1 Definition of a Region When defining the region (or window) for the analysis, care must be taken that it is reasonable from the perspective of the process under analysis. Defining the region in an inappropriate way can easily lead to misleading results. Consider for instance the first pattern in the dataset. This pattern was defined for a unit-square window. We can apply the \\(\\hat{K}\\)-function to it: k_env_pp1 <- envelope(pp1.ppp, Kest, nsim = 99, funargs = list(correction = "none")) ## Generating 99 simulations of CSR ... ## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, ## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, ## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99. ## ## Done. plot(k_env_pp1) Based on this we would most likely conclude that the pattern is random. But if we replace the unit-square window by a much larger window, as follows: W2 <- owin(x = c(-2,4), y = c(-2, 4)) pp1_reg2 <- as.ppp(as.data.frame(pp1.ppp), W = W2) plot(pp1_reg2) In the context of the larger window, the point pattern now looks clustered! See how the definition of the window would change your conclusions regarding the pattern: k_env_pp1_reg2 <- envelope(pp1_reg2, Kest, nsim = 99, funargs = list(correction = "none")) ## Generating 99 simulations of CSR ... ## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, ## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, ## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99. ## ## Done. plot(k_env_pp1_reg2) Care must be taken when defining the window/region for analysis to avoid spurious results. 17.7.2 Edge Effects As discussed above, definition of the window (region) is critical. If at all possible, the region should be selected in such a way that it is consistent with the underlying process. This is not always possible, either because the underlying process is not known, or because of limitations in data collection capabilities. When this is the case, it is necessary to define a boundary that does not correspond necessarily with the extent of the process of interest. For example, analysis of business locations in Toronto may be limited to the city limits. This does not mean that establishments do not exist beyond those boundaries. When the extent of the process exceeds the window used in the analysis, the point pattern is observed only partially, and it is possible that the omitted information regarding the location of events beyond the boundary may introduce some bias. Consider the situation illustrated in Figure 17.1. Figure 17.1: Edge effects In the figure, the region is the rectangular window. Events are observed only inside the window, but events still exist beyond the edges of the window. It is straightforward to see how the empty space (\\(\\hat{F}\\)-) function would be biased, since locations near the edge would appear the be more distant from an event than they actually are. Several corrections are available in spatstat to deal with the possibility of edge effects. So far, we have used the argument correction = \"none\" when applying the functions. The following alternative corrections are implemented: “none”, “rs”, “km”, “cs” and “best”. Alternatively correction = \"all\" selects all options. These corrections are variations of weighting schemes. In other words, the statistic is weighted to give an unbiased estimator. See: plot(Gest(pp2.ppp, correction = "all")) The different corrections are plotted. It can be seen in this case that the corrections are relatively small, relative to the uncorrected empirical line; however, this is not always the case. 17.7.3 Sampled Point Patterns Whereas edge effects can introduce bias by censoring the observations outside of the window/region, another issue emerges when not all events are observed inside the window. We have assumed so far that any point pattern under analysis consists of a census of events, or in other words, that all relevant events have been recorded. A sampled point pattern, on the other hand, is a pattern where not all events have been recorded (see Figure 17.2). Figure 17.2: Sampled point pattern The bias introduced by sampled point patterns can be extremely serious, because the findings depend heavily of the observations that were recorded as well as those that were not recorded! Clustered events could easily give the impression of a dispersed pattern, depending on what was observed. Imagine for instance that the events are nests of birds. If the birds tend to nest in the thickest parts of the forest that observers cannot easily access, the “observed” pattern will depend crucially on the trails and other routes of access that the researcher can use. There are no good solutions to bias introduced by sampled point patterns, and it is not recommended to use the techniques discussed here with sampled point patterns. This concludes the topic of spatial point patterns. References "],
["activity-8-point-pattern-analysis-v.html", "Chapter 18 Activity 8: Point Pattern Analysis V 18.1 Practice questions 18.2 Learning objectives 18.3 Suggested reading 18.4 Preliminaries 18.5 Activity", " Chapter 18 Activity 8: Point Pattern Analysis V Remember, you can download the source file for this activity from here. 18.1 Practice questions Answer the following questions: Describe the process to use simulation for hypothesis testing Why is the selection of an appropriate region critical for the analysis of point patterns? Discuss the issues associated with the edges of a region. What is a sampled point pattern? 18.2 Learning objectives In this activity, you will: Explore a dataset using single scale distance-based techniques. Explore the characteristics of a point pattern at multiple scales. Discuss ways to evaluate how confident you are that a pattern is random. 18.3 Suggested reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey. 18.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity. In addition to tidyverse, you will need spatstat, a package designed for the analysis of point patterns (you can learn about spatstat here and here): library(tidyverse) library(spatstat) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Load a dataset of your choice. It could be one of the datasets that we have used before (Toronto Business Points, Bear GPS Locations), or one of the datasets included with the package spatstat. To see what datasets are available through the package, do the following: vcdExtra::datasets("spatstat.data") ## Item class dim ## 1 Kovesi list 41x13 ## 2 amacrine ppp 6 ## 3 anemones ppp 6 ## 4 ants ppp 6 ## 5 ants.extra list 7 ## 6 austates list 4 ## 7 bdspots list 3 ## 8 bei ppp 5 ## 9 bei.extra list 2 ## 10 betacells ppp 6 ## 11 bramblecanes ppp 6 ## 12 bronzefilter ppp 6 ## 13 cells ppp 5 ## 14 cetaceans list 9x4 ## 15 cetaceans.extra list 1 ## 16 chicago ppx 3 ## 17 chorley ppp 6 ## 18 chorley.extra list 2 ## 19 clmfires ppp 6 ## 20 clmfires.extra list 2 ## 21 copper list 7 ## 22 demohyper list 3x3 ## 23 demopat ppp 6 ## 24 dendrite ppx 3 ## 25 finpines ppp 6 ## 26 flu list 41x4 ## 27 ganglia ppp 6 ## 28 gordon ppp 5 ## 29 gorillas ppp 6 ## 30 gorillas.extra list 7 ## 31 hamster ppp 6 ## 32 heather list 3 ## 33 humberside ppp 6 ## 34 humberside.convex ppp 6 ## 35 hyytiala ppp 6 ## 36 japanesepines ppp 5 ## 37 lansing ppp 6 ## 38 letterR owin 5 ## 39 longleaf ppp 6 ## 40 mucosa ppp 6 ## 41 mucosa.subwin owin 4 ## 42 murchison list 3 ## 43 nbfires ppp 6 ## 44 nbfires.extra list 2 ## 45 nbw.rect owin 4 ## 46 nbw.seg list 5 ## 47 nztrees ppp 5 ## 48 osteo list 40x5 ## 49 paracou ppp 6 ## 50 ponderosa ppp 5 ## 51 ponderosa.extra list 2 ## 52 pyramidal list 31x2 ## 53 redwood ppp 5 ## 54 redwood3 ppp 5 ## 55 redwoodfull ppp 5 ## 56 redwoodfull.extra list 5 ## 57 residualspaper list 7 ## 58 shapley ppp 6 ## 59 shapley.extra list 3 ## 60 simba list 10x2 ## 61 simdat ppp 5 ## 62 simplenet list 10 ## 63 spiders ppx 3 ## 64 sporophores ppp 6 ## 65 spruces ppp 6 ## 66 swedishpines ppp 5 ## 67 urkiola ppp 6 ## 68 vesicles ppp 5 ## 69 vesicles.extra list 4 ## 70 waka ppp 6 ## 71 waterstriders list 3 ## Title ## 1 Colour Sequences with Uniform Perceptual Contrast ## 2 Hughes' Amacrine Cell Data ## 3 Beadlet Anemones Data ## 4 Harkness-Isham ants' nests data ## 5 Harkness-Isham ants' nests data ## 6 Australian States and Mainland Territories ## 7 Breakdown Spots in Microelectronic Materials ## 8 Tropical rain forest trees ## 9 Tropical rain forest trees ## 10 Beta Ganglion Cells in Cat Retina ## 11 Hutchings' Bramble Canes data ## 12 Bronze gradient filter data ## 13 Biological Cells Point Pattern ## 14 Point patterns of whale and dolphin sightings. ## 15 Point patterns of whale and dolphin sightings. ## 16 Chicago Crime Data ## 17 Chorley-Ribble Cancer Data ## 18 Chorley-Ribble Cancer Data ## 19 Castilla-La Mancha Forest Fires ## 20 Castilla-La Mancha Forest Fires ## 21 Berman-Huntington points and lines data ## 22 Demonstration Example of Hyperframe of Spatial Data ## 23 Artificial Data Point Pattern ## 24 Dendritic Spines Data ## 25 Pine saplings in Finland. ## 26 Influenza Virus Proteins ## 27 Beta Ganglion Cells in Cat Retina, Old Version ## 28 People in Gordon Square ## 29 Gorilla Nesting Sites ## 30 Gorilla Nesting Sites ## 31 Aherne's hamster tumour data ## 32 Diggle's Heather Data ## 33 Humberside Data on Childhood Leukaemia and Lymphoma ## 34 Humberside Data on Childhood Leukaemia and Lymphoma ## 35 Scots pines and other trees at Hyytiala ## 36 Japanese Pines Point Pattern ## 37 Lansing Woods Point Pattern ## 38 Window in Shape of Letter R ## 39 Longleaf Pines Point Pattern ## 40 Cells in Gastric Mucosa ## 41 Cells in Gastric Mucosa ## 42 Murchison gold deposits ## 43 Point Patterns of New Brunswick Forest Fires ## 44 Point Patterns of New Brunswick Forest Fires ## 45 Point Patterns of New Brunswick Forest Fires ## 46 Point Patterns of New Brunswick Forest Fires ## 47 New Zealand Trees Point Pattern ## 48 Osteocyte Lacunae Data: Replicated Three-Dimensional Point Patterns ## 49 Kimboto trees at Paracou, French Guiana ## 50 Ponderosa Pine Tree Point Pattern ## 51 Ponderosa Pine Tree Point Pattern ## 52 Pyramidal Neurons in Cingulate Cortex ## 53 California Redwoods Point Pattern (Ripley's Subset) ## 54 California Redwoods Point Pattern (Ripley's Subset) ## 55 California Redwoods Point Pattern (Entire Dataset) ## 56 California Redwoods Point Pattern (Entire Dataset) ## 57 Data and Code From JRSS Discussion Paper on Residuals ## 58 Galaxies in the Shapley Supercluster ## 59 Galaxies in the Shapley Supercluster ## 60 Simulated data from a two-group experiment with replication within each group. ## 61 Simulated Point Pattern ## 62 Simple Example of Linear Network ## 63 Spider Webs on Mortar Lines of a Brick Wall ## 64 Sporophores Data ## 65 Spruces Point Pattern ## 66 Swedish Pines Point Pattern ## 67 Urkiola Woods Point Pattern ## 68 Vesicles Data ## 69 Vesicles Data ## 70 Trees in Waka national park ## 71 Waterstriders data. Three independent replications of a point pattern formed by insects. Load a dataset of your choice. You can do this by using the load() function if the dataset is in your drive (e.g., the GPS coordinates of the bear). On the other hand, if the dataset is included with the spatstat package you can do the following, for example to load the gorillas dataset: gorillas.ppp <- gorillas As usual, you can check the object by means of the summary function: summary(gorillas.ppp) ## Marked planar point pattern: 647 points ## Average intensity 3.255566e-05 points per square metre ## ## *Pattern contains duplicated points* ## ## Coordinates are given to 2 decimal places ## i.e. rounded to the nearest multiple of 0.01 metres ## ## Mark variables: group, season, date ## Summary: ## group season date ## major:350 dry :275 Min. :2006-01-06 ## minor:297 rainy:372 1st Qu.:2007-03-15 ## Median :2008-02-05 ## Mean :2007-12-14 ## 3rd Qu.:2008-09-23 ## Max. :2009-05-31 ## ## Window: polygonal boundary ## single connected closed polygon with 21 vertices ## enclosing rectangle: [580457.9, 585934] x [674172.8, 678739.2] metres ## (5476 x 4566 metres) ## Window area = 19873700 square metres ## Unit of length: 1 metre ## Fraction of frame area: 0.795 18.5 Activity Partner with a fellow student to analyze the chosen dataset. Discuss whether the pattern is random, and how confident you are in your decision. The analysis of the pattern is meant to provide insights about the underlying process. Create a hypothesis using the data generated and can you answer that hypothesis using the plots generated? Discuss the limitations of the analysis, for instance, choice of modeling parameters (size of region, kernel bandwidths, edge effects, etc.) "],
["area-data-i.html", "Chapter 19 Area Data I 19.1 Learning Objectives 19.2 Suggested Readings 19.3 Preliminaries 19.4 Area Data 19.5 Processes and Area Data 19.6 Visualizing Area Data: Choropleth Maps 19.7 Visualizing Area Data: Cartograms", " Chapter 19 Area Data I NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. 19.1 Learning Objectives In last few practices/sessions, you learned about spatial point patterns. The next few sessions will concentrate on area data. In this practice, you will learn: A formal definition of area data. Processes and area data. Visualizing area data: Choropleth maps. Visualizing area data: Cartograms. 19.2 Suggested Readings Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex. Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York. Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey. 19.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(sf) library(plotly) library(cartogram) library(gridExtra) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Read the data used in this chapter. data("Hamilton_CT") The data are an object of class sf that includes the spatial information for the census tracts in the Hamilton Census Metropolitan Area in Canada and a series of demographic variables from the 2011 Census of Canada. You can quickly verify the contents of the dataframe by means of summary: summary(Hamilton_CT) ## ID AREA TRACT POPULATION ## Min. : 919807 Min. : 0.3154 Length:188 Min. : 5 ## 1st Qu.: 927964 1st Qu.: 0.8552 Class :character 1st Qu.: 2639 ## Median : 948130 Median : 1.4157 Mode :character Median : 3595 ## Mean : 948710 Mean : 7.4578 Mean : 3835 ## 3rd Qu.: 959722 3rd Qu.: 2.7775 3rd Qu.: 4692 ## Max. :1115750 Max. :138.4466 Max. :11675 ## POP_DENSITY AGE_LESS_20 AGE_20_TO_24 AGE_25_TO_29 ## Min. : 2.591 Min. : 0.0 Min. : 0.0 Min. : 0.0 ## 1st Qu.: 1438.007 1st Qu.: 528.8 1st Qu.:168.8 1st Qu.:135.0 ## Median : 2689.737 Median : 750.0 Median :225.0 Median :215.0 ## Mean : 2853.078 Mean : 899.3 Mean :253.9 Mean :232.8 ## 3rd Qu.: 3783.889 3rd Qu.:1110.0 3rd Qu.:311.2 3rd Qu.:296.2 ## Max. :14234.286 Max. :3285.0 Max. :835.0 Max. :915.0 ## AGE_30_TO_34 AGE_35_TO_39 AGE_40_TO_44 AGE_45_TO_49 ## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 ## 1st Qu.: 135.0 1st Qu.: 145.0 1st Qu.: 170.0 1st Qu.:203.8 ## Median : 195.0 Median : 200.0 Median : 230.0 Median :282.5 ## Mean : 228.2 Mean : 239.6 Mean : 268.7 Mean :310.6 ## 3rd Qu.: 281.2 3rd Qu.: 280.0 3rd Qu.: 325.0 3rd Qu.:385.0 ## Max. :1320.0 Max. :1200.0 Max. :1105.0 Max. :880.0 ## AGE_50_TO_54 AGE_55_TO_59 AGE_60_TO_64 AGE_65_TO_69 AGE_70_TO_74 ## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.0 ## 1st Qu.:203.8 1st Qu.:175.0 1st Qu.:140 1st Qu.:115.0 1st Qu.: 90.0 ## Median :280.0 Median :240.0 Median :220 Median :157.5 Median :130.0 ## Mean :300.3 Mean :257.7 Mean :229 Mean :174.2 Mean :139.7 ## 3rd Qu.:375.0 3rd Qu.:325.0 3rd Qu.:295 3rd Qu.:221.2 3rd Qu.:180.0 ## Max. :740.0 Max. :625.0 Max. :540 Max. :625.0 Max. :540.0 ## AGE_75_TO_79 AGE_80_TO_84 AGE_MORE_85 geometry ## Min. : 0.00 Min. : 0.00 Min. : 0.00 POLYGON :188 ## 1st Qu.: 68.75 1st Qu.: 50.00 1st Qu.: 35.00 epsg:26917 : 0 ## Median :100.00 Median : 77.50 Median : 70.00 +proj=utm ...: 0 ## Mean :118.32 Mean : 95.05 Mean : 87.71 ## 3rd Qu.:160.00 3rd Qu.:120.00 3rd Qu.:105.00 ## Max. :575.00 Max. :420.00 Max. :400.00 19.4 Area Data Every phenomena can be measured at a location (ask yourself, what exists outside of space?). In point pattern analysis, the unit of support is the point, and the source of randomness is the location itself. Many other forms of data are also collected at points. For instance, when the census collects information on population, at its most basic, the information can be georeferenced to an address, that is, a point. In numerous applications, however, data are not reported at their fundamental unit of support, but rather are aggregated to some other geometry, for instance an area. This is done for several reasons, including the privacy and confidentiality of the data. Instead of reporting individual-level information, the information is reported for zoning systems that often are devised without consideration to any underlying social, natural, or economic processes. Census data, for example, are reported at different levels of geography. In Canada, the smallest publicly available geography is called a Dissemination Area or DA. A DA in Canada contains a population between 400 and 700 persons. Thus, instead of reporting that one person (or more) are located at a point (i.e., an address), the census reports the population for the DA. Other data are aggregated in similar ways (income, residential status, etc.) At the highest level of aggregation, national level statistics are reported, such as Gross Domestic Product, or GDP. Economic production is not evenly distributed across space; however, the national GDP does not distinguish regional variations in this process. Ideally, a data analyst would work with data in its most fundamental support. This is not alway possible, and therefore many techniques have been developed to work with data that have been agregated to zones. When working with areas, it is less practical to identify the area with the coordinates (as we did with points). After all, areas will be composed of lines and reporting all the relevant coordinates is impractical. Sometimes the geometric centroids of the areas are used instead. More commonly, areas are assigned an index or unique identifier, so that a region will typically consist of a set of \\(n\\) areas as follows: \\[ R = A_1 \\cup A_2 \\cup A_3 \\cup ...\\cup A_n. \\] The above is read as “the Region R is the union of Areas 1 to n”. Regions can have a set of \\(k\\) attributes or variables associated with them, for instance: \\[ \\textbf{X}_i=[x_{i1}, x_{i2}, x_{i3},...,x_{ik}] \\] These attributes will typically be counts (e.g., number of people in a DA), or some summary measure of the underlying data (e.g., mean commute time). 19.5 Processes and Area Data Imagine that data on income by household were collected as follows: # Here, we are creating a dataframe with three columns, coordinates x and y in space to indicate the locations of households, and their income. df <- data.frame(x = c(0.3, 0.4, 0.5, 0.6, 0.7), y = c(0.1, 0.4, 0.2, 0.5, 0.3), Income = c(30000, 30000, 100000, 100000, 100000)) Households are geocoded as points with coordinates x and y, whereas income is in dollars. Plot the income as points (hover over the points to see the attributes): # The `ggplot()` function is used to create a plot. The function `geom_point()` adds points to the plot, using the values of coordinates x and y, and coloring by Income. Higher income households appear to be on the East regions of the area. p <- ggplot(data = df, aes(x = x, y = y, color = Income)) + geom_point(shape = 17, size = 5) + coord_fixed() ggplotly(p) The underlying process is one of income sorting, with lower incomes to the west, and higher incomes to the east. This could be due to a geographical feature of the landscape (for instance, an escarpment), or the distribution of the housing stock (with a neighborhood that has more expensive houses). These are examples of a variable that responds to a common environmental factor. As an alternative, people may display a preference towards being near others that are similar to them (this is called homophily). When this happens, the variable responds to itself in space. The quality of similarity or disimilarity between neighboring observations of the same variable in space is called spatial autocorrelation. You will learn more about this later on. Another reason why variables reported for areas could display similarities in space is as an consequence of the zoning system. Suppose for a moment that the data above can only be reported at the zonal level, perhaps because of privacy and confidentiality concerns. Thanks to the great talent of the designers of the zoning system (or a felicitous coincidence!), the zoning system is such that it is consistent with the underlying process of sorting. The zones, therefore, are as follows: # Here, we create a new dataframe with the coordinates necessary to define two zones. The zones are rectangles, so we need to define four corners for each. "Zone_ID" only has 2 values because there are only two zones in the analysis. zones1 <- data.frame(x1=c(0.2, 0.45), x2=c(0.45, 0.80), y1=c(0.0, 0.0), y2=c(0.6, 0.6), Zone_ID = c('1','2')) If you add these zones to the plot: # Similar to the plot above, but adding the zones with `geom_rect()` for plotting rectangles. p <- ggplot() + geom_rect(data = zones1, mapping = aes(xmin = x1, xmax = x2, ymin = y1, ymax = y2, fill = Zone_ID), alpha = 0.3) + geom_point(data = df, aes(x = x, y = y, color = Income), shape = 17, size = 5) + coord_fixed() ggplotly(p) What is the mean income in zone 1? What is the mean income in zone 2? Not only are the summary measures of income highly representative of the observations they describe, the two zones are also highly distinct. Imagine now that for whatever reason (lack of prior knowledge of the process, convenience for data collection, etc.) the zones instead are as follows: # Note how the values have changed for x1 and x2. This reveals that the zones have shifted and are no longer the same as the plot above. zones2 <- data.frame(x1=c(0.2, 0.55), x2=c(0.55, 0.80), y1=c(0.0, 0.0), y2=c(0.6, 0.6), Zone_ID = c('1','2')) If you plot these zones: p <- ggplot() + geom_rect(data = zones2, mapping = aes(xmin = x1, xmax = x2, ymin = y1, ymax = y2, fill = Zone_ID), alpha = 0.3) + geom_point(data = df, aes(x = x, y = y, color = Income), shape = 17, size = 5) + coord_fixed() ggplotly(p) What is now the mean income of zone 1? What is the mean income of zone 2? The observations have not changed, and the generating spatial process remains the same. However, as you can see, the summary measures for the two zones are more similar in this case than they were when the zones more closely captured the underlying process. 19.6 Visualizing Area Data: Choropleth Maps The very first step when working with spatial area data, perhaps, is to visualize the data. Commonly, area data are visualized by means of choropleth maps. A choropleth map is a map of the polygons that form the areas in the region, each colored in a way to represent the value of an underlying variable. Lets use ggplot2 to create a choropleth map of population in Hamilton. Notice that the fill color for the polygons is given by cutting the values of POPULATION in five equal segments. In other words, the colors represent zones in the bottom 20% of population, zones in the next 20%, and so on, so that the darkest zones are those with populations so large as to be in the top 20% of the population distribution: # Geographical information can also be plotted using `ggplot2` when it is in the form of simple features or `sf`. Here, we create a plot with function `ggplot()`. We also have available the census tracts for Hamilton in an `sf` dataframe. To plot the distribution of the population in five equal segments (or quintiles), we apply the function `cut_number()` to the variable `POPULATION` from the `Hamilton_CT` census tract dataframe. The aesthetic value for `fill` will color the zones according to the population quintiles. ggplot(Hamilton_CT) + geom_sf(aes(fill = cut_number(Hamilton_CT$POPULATION, 5)), color = NA, size = 0.1) + scale_fill_brewer(palette = "YlOrRd") + coord_sf() + labs(fill = "Population") Inspect the map above. Would you say that the distribution of population is random, or not random? If not random, what do you think might be an underlying process for the distribution of population? Often, creating a choropleth map using the absolute value of a variable can be somewhat misleading. As illustrated by the map of population by census tract in Hamilton, the zones with the largest population are also often large zones. Many process are confounded by the size of the zones: quite simply, in larger areas often there is more of, well, almost anything, compared with smaller areas. For this reason, it is often more informative when creating a choropleth map to use a variable that is a rate. Rates are quantities that are measured with respect to something. For instance population measured by area, or population density, is a rate: # Note how the `cut_number()` is applied to population density rather than population like the figure above. This gives a more different, and perhaps more informative, of the distribution of population, by measuring population against area. pop_den.map <- ggplot(Hamilton_CT) + geom_sf(aes(fill = cut_number(Hamilton_CT$POP_DENSITY, 5)), color = "white", size = 0.1) + scale_fill_brewer(palette = "YlOrRd") + labs(fill = "Pop Density") pop_den.map It can be seen now that the population density is higher in the more central parts of Hamilton, Burlington, Dundas, etc. Does the map look random? If not, what might be an underlying process that explains the variations in population density in a city like Hamilton? Other times, it is appropriate to standardize instead of by area, by what might be called the population at risk. For instance, imagine that we wanted to explore the distribution of the population of older adults (say, 65 and older). In this case, if instead of normalizing by area, we used the total population instead, would remove the “size” effect, giving a rate: #The "HAMILTON_CT" dataframe portions ages by category. For this choropleth map, we sum all age categories over 65, and then divide by total population. This measures the population of older adults against total population, to give a proportion (the rate out of a total). ggplot(Hamilton_CT) + geom_sf(aes(fill = cut_number((Hamilton_CT$AGE_65_TO_69 + Hamilton_CT$AGE_70_TO_74 + Hamilton_CT$AGE_75_TO_79 + Hamilton_CT$AGE_80_TO_84 + Hamilton_CT$AGE_MORE_85) / Hamilton_CT$POPULATION, 5)), color = NA, size = 0.1) + scale_fill_brewer(palette = "YlOrRd") + labs(fill = "Prop Age 65+") Do you notice a pattern in the distribution of seniors in the Hamilton, CMA? There are a few things to keep in mind when creating choropleth maps. First, what classification scheme to use, with how many classes, and what colors? The examples above were all created using a classification scheme based on the quintiles of the distribution. As noted above, these are obtained by dividing the sample into 5 equal parts to give bottom 20%, etc., of observations. The quintiles are a particular form of a statistical summary measure known as quantiles. Another example of a quantile is the median, which is the value obtained when the sample is divided in two equal sized parts. Other classification schemes may include the mean, standard deviations, and so on. Essentially, a classification scheme defines a way to divide the sample for representation in a choropleth map. In terms of how many classes to use, often there is little point in using more than six or seven classes, because the human eye cannot distinguish color differences at a much higher resolution. The colors are a matter of style and preference, but there are coloring schemes that are colorblind safe (see here). Also, for communication purposes, there are conventions that assign values or meanings to colors. Maps showing results of elections often use the colors of political parties: this is such a widespread convention that it would be thoroughly confusing if the colors were reversed, more so than if just the colors were exchanged for others. Red is often associated with heat, concentration, or sometimes bad, whereas green is associated with good. Here is an interesting discussion of use of colors in visualization. Secondly, when the zoning system is irregular (as opposed to, say, a raster, which is composed of pixels, regular tiles of consistent size), large zones can easily become dominant. In effect, much detail in the maps above is lost for small zones, whereas large zones, especially if similarly colored, may mislead the eye as to their relative frequency. Another mapping technique, the cartogram, is meant to reduce the issues with small-large zones. 19.7 Visualizing Area Data: Cartograms A cartogram is a map where the size of the zones is adjusted so that instead of being the surface area, it is proportional to some other variable of interest. We will illustrate the idea behind the cartogram here. In the maps that we created above, the zones are faithful to their geographical properties (subject to distortions due to geographical projection). Unfortunately, this feature of the maps obscured the relevance of some of the smaller zones. A cartogram can be weighted by another variable, say for instance, the population. In this way, the size of the zones will depend on the total population. Cartograms are implemented in R in the package cartogram. # The function `cartogram_cont()` constructs a continuous area cartogram. Here, a cartogram is created for census tracts of the city of Hamilton, but the size of the zones will be weighted by the variable `POPULATION`. CT_pop_cartogram <- cartogram_cont(Hamilton_CT, weight = "POPULATION") ## Mean size error for iteration 1: 5.93989832705674 ## Mean size error for iteration 2: 4.5514055520835 ## Mean size error for iteration 3: 7.74856106866916 ## Mean size error for iteration 4: 7.49510294164283 ## Mean size error for iteration 5: 5.12121781701006 ## Mean size error for iteration 6: 3.45188989405368 ## Mean size error for iteration 7: 2.66683855570118 ## Mean size error for iteration 8: 2.23950467189881 ## Mean size error for iteration 9: 1.93816581350794 ## Mean size error for iteration 10: 1.78377894897916 ## Mean size error for iteration 11: 1.62985317085302 ## Mean size error for iteration 12: 1.50983288572639 ## Mean size error for iteration 13: 1.60808238152904 ## Mean size error for iteration 14: 6.67220825006972 ## Mean size error for iteration 15: 8.78821301683394 Plotting the cartogram: #We are using "ggplot" to create a cartogram for populations by census tact in Hamilton. Census tracts with a larger value are distorted to visually represent their population size. The number "5" after calling the population variable states that there will be 5 categories dividing population quantities. ggplot(CT_pop_cartogram) + geom_sf(aes(fill = cut_number(Hamilton_CT$POPULATION, 5)), color = "white", size = 0.1) + scale_fill_brewer(palette = "YlOrRd") + labs(fill = "Population") Notice how the size of the zones has been adjusted. The cartogram can be combined with coloring schemes, as in choropleth maps: CT_popden_cartogram <- cartogram(Hamilton_CT, weight = "POP_DENSITY") ## ## Please use cartogram_cont() instead of cartogram(). ## Mean size error for iteration 1: 29.0384287070147 ## Mean size error for iteration 2: 26.6652279985395 ## Mean size error for iteration 3: 24.8111000080233 ## Mean size error for iteration 4: 23.2716548947531 ## Mean size error for iteration 5: 21.928598879704 ## Mean size error for iteration 6: 20.7113138849207 ## Mean size error for iteration 7: 19.576698518681 ## Mean size error for iteration 8: 18.4983401508171 ## Mean size error for iteration 9: 17.460238779898 ## Mean size error for iteration 10: 16.453534698246 ## Mean size error for iteration 11: 15.4732800316789 ## Mean size error for iteration 12: 14.5184813061204 ## Mean size error for iteration 13: 13.5901475440423 ## Mean size error for iteration 14: 12.6911089325245 ## Mean size error for iteration 15: 11.8246511070686 Plot the cartogram: pop_den.cartogram <- ggplot(CT_popden_cartogram) + geom_sf(aes(fill = cut_number(Hamilton_CT$POP_DENSITY, 5)),color = "white", size = 0.1) + scale_fill_brewer(palette = "YlOrRd") + labs(fill = "Pop Density") pop_den.cartogram By combining a cartogram with choropleth mapping, it becomes easier to appreciate the way high population density is concentrated in the central parts of Hamilton, Burlington, etc. grid.arrange(pop_den.map, pop_den.cartogram, nrow = 2) This concludes this chapter. References "],
["activity-9-area-data-i.html", "Chapter 20 Activity 9: Area Data I 20.1 Practice questions 20.2 Learning objectives 20.3 Suggested reading 20.4 Preliminaries 20.5 Activity", " Chapter 20 Activity 9: Area Data I Remember, you can download the source file for this activity from here. 20.1 Practice questions Answer the following questions: What is a key difference between area data and point data? What is a choropleth map? What is a cartogram? What are the advantages and disadvantages of these mapping techniques? 20.2 Learning objectives In this activity, you will: Create choroplet maps using census data. Think about possible underlying process that could explain the pattern. Think about ways to decide whether a landscape is random when working with area data. 20.3 Suggested reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey. 20.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity. In addition to tidyverse, you will need sf, a package that implements simple features in R (you can learn more about this package here): library(tidyverse) library(sf) library(cartogram) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' In the practice that preceded this activity, you learned about the area data and visualization techniques for area data. Begin by loading the data that you will use in this activity: data("Hamilton_CT") This is an sf object with census tracts and selected demographic variables for the Hamilton CMA in Canada. You can obtain new (calculated) variables as follows. For instance, to obtain the proportion of residents who are between 20 and 34 years old, and between 35 and 49: Hamilton_CT <- Hamilton_CT %>% mutate(Prop20to34 = (AGE_20_TO_24 + AGE_25_TO_29 + AGE_30_TO_34)/POPULATION, Prop35to49 = (AGE_35_TO_39 + AGE_40_TO_44 + AGE_45_TO_49)/POPULATION) You are ready for the next activity. 20.5 Activity NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to organize data, create a plot, and so on in support of analysis and interpretation. These tasks are indicated by a star (*). 1.* Create choropleth maps for the proportion of the population who are 20 to 34 years old, 35 to 49 years old, 50 to 65 years old, and 65 and older. 2.* Create cartograms for the proportion of the population who are 20 to 34 years old, 35 to 49 years old, 50 to 65 years old, and 65 and older. 3.* Change the scheme and colors of your maps to obtain maps with 2 classes/colors, 5 classes/colors, and 10 classes/colors. You can check different color palettes in the documentation of ggplot2. Which scheme is more informative? What colors looked better to you? Show your maps to a fellow student. What patterns do you notice in the distribution of population by age in Hamilton? Do you think the distribution of the population by age is random, or not random? Devise a rule to decide whether the pattern observed in a choropleth map is random. "],
["area-data-ii.html", "Chapter 21 Area Data II 21.1 Learning Objectives 21.2 Suggested Readings 21.3 Preliminaries 21.4 Proximity in Area Data 21.5 Spatial Weights Matrices 21.6 Creating Spatial Weights Matrices in R 21.7 Spatial Moving Averages 21.8 Other Criteria for Coding Proximity", " Chapter 21 Area Data II NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. 21.1 Learning Objectives In last chapter and activity, you learned about area data and practiced some visualization techniques for spatial data of this type, specifically choropleth maps and cartograms. You also thought about rules to decide whether a mapped variable displayed a spatially random distribution of values. In this practice, you will learn about: The concept of proximity for area data. How to formalize the concept of proximity: spatial weights matrices. How to create spatial weights matrices in R. The use of spatial moving averages. Other criteria for coding proximity. 21.2 Suggested Readings Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex. Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York. Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey. 21.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(sf) library(plotly) library(spdep) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Read the data to be used in this chapter. The data is an object of class sf (simple feature) with the census tracts of Hamilton CMA in Canada, and a selection of demographic variables: data(Hamilton_CT) You can quickly verify the contents of the dataframe by means of summary: summary(Hamilton_CT) ## ID AREA TRACT POPULATION ## Min. : 919807 Min. : 0.3154 Length:188 Min. : 5 ## 1st Qu.: 927964 1st Qu.: 0.8552 Class :character 1st Qu.: 2639 ## Median : 948130 Median : 1.4157 Mode :character Median : 3595 ## Mean : 948710 Mean : 7.4578 Mean : 3835 ## 3rd Qu.: 959722 3rd Qu.: 2.7775 3rd Qu.: 4692 ## Max. :1115750 Max. :138.4466 Max. :11675 ## POP_DENSITY AGE_LESS_20 AGE_20_TO_24 AGE_25_TO_29 ## Min. : 2.591 Min. : 0.0 Min. : 0.0 Min. : 0.0 ## 1st Qu.: 1438.007 1st Qu.: 528.8 1st Qu.:168.8 1st Qu.:135.0 ## Median : 2689.737 Median : 750.0 Median :225.0 Median :215.0 ## Mean : 2853.078 Mean : 899.3 Mean :253.9 Mean :232.8 ## 3rd Qu.: 3783.889 3rd Qu.:1110.0 3rd Qu.:311.2 3rd Qu.:296.2 ## Max. :14234.286 Max. :3285.0 Max. :835.0 Max. :915.0 ## AGE_30_TO_34 AGE_35_TO_39 AGE_40_TO_44 AGE_45_TO_49 ## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 ## 1st Qu.: 135.0 1st Qu.: 145.0 1st Qu.: 170.0 1st Qu.:203.8 ## Median : 195.0 Median : 200.0 Median : 230.0 Median :282.5 ## Mean : 228.2 Mean : 239.6 Mean : 268.7 Mean :310.6 ## 3rd Qu.: 281.2 3rd Qu.: 280.0 3rd Qu.: 325.0 3rd Qu.:385.0 ## Max. :1320.0 Max. :1200.0 Max. :1105.0 Max. :880.0 ## AGE_50_TO_54 AGE_55_TO_59 AGE_60_TO_64 AGE_65_TO_69 AGE_70_TO_74 ## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.0 ## 1st Qu.:203.8 1st Qu.:175.0 1st Qu.:140 1st Qu.:115.0 1st Qu.: 90.0 ## Median :280.0 Median :240.0 Median :220 Median :157.5 Median :130.0 ## Mean :300.3 Mean :257.7 Mean :229 Mean :174.2 Mean :139.7 ## 3rd Qu.:375.0 3rd Qu.:325.0 3rd Qu.:295 3rd Qu.:221.2 3rd Qu.:180.0 ## Max. :740.0 Max. :625.0 Max. :540 Max. :625.0 Max. :540.0 ## AGE_75_TO_79 AGE_80_TO_84 AGE_MORE_85 geometry ## Min. : 0.00 Min. : 0.00 Min. : 0.00 POLYGON :188 ## 1st Qu.: 68.75 1st Qu.: 50.00 1st Qu.: 35.00 epsg:26917 : 0 ## Median :100.00 Median : 77.50 Median : 70.00 +proj=utm ...: 0 ## Mean :118.32 Mean : 95.05 Mean : 87.71 ## 3rd Qu.:160.00 3rd Qu.:120.00 3rd Qu.:105.00 ## Max. :575.00 Max. :420.00 Max. :400.00 21.4 Proximity in Area Data In the earlier part of the text, when working with point data, the spatial relationships among events (their proximity) were more or less unambiguously given by their relative location, or more precisely by their distance. Hence, we had quadrat-based techniques (relative location with respect to a grid), kernel density (relative location with respect to the center of a kernel function), and distance-based techniques (event-to-event and point-to-event distances). In the case of area data, spatial proximity can be represented in more ways, given the characteristics of areas. In particular, an area contains an infinite number of points, and measuring distance between two areas leads to an infinite number of results, depending on which pairs of points within two zones are used to measure the distance. Consider the simple zonal system shown in Figure @ref{fig:simple-zoning-system}. Which of zones \\(A_2\\), \\(A_3\\), and \\(A_4\\) is closer (or more proximate) to \\(A_1\\)? Figure 21.1: Simple zoning system We can devise a way of establishing proximity between areas as follows: if points are selected in such a way that they are on the overlapping edges of two contiguous areas, the distance between these two areas clearly is zero, and they must be proximate. This criterion to define proximity is called adjacency. Adjacency means that two zones share a common edge. This is conventionally called the rook criterion, after chess, in which the piece called the rook can move only orthogonally (in the vertical and horizontal directions). The rook criterion, however, would dictate that zones \\(A_2\\) and \\(A_6\\) are not proximate, despite being closer than \\(A_2\\) and \\(A_3\\). When this criterion is expanded to allow contact at a single point between zones (say, the corner between \\(A_2\\) and \\(A_6\\)), the adjacency criterion is called queen, again, for the chess piece that moves both orthogonally and diagonally. If we accept adjacency as a reasonable way of expressing relationships of proximity between areas, what we need is a way of coding relationships of adjacency in a way that is convenient and amenable to manipulation for data analysis. One of the most widely used tools to code proximity in area data is the spatial weights matrix. 21.5 Spatial Weights Matrices A spatial weights matrix is an arrangement of values (or weights) for all pairs of zones in a system. For instance, in a zoning system such as shown in Figure 1, with 6 zones, there will be \\(6 \\times 6\\) such weights. The weights are organized by rows, in such a way that each zone has a corresponding row of weights. For example, zone \\(A_1\\) in Figure 1 has the following weights, one for each zone in the system: \\[ w_{1\\cdot} = [w_{11}, w_{12}, w_{13}, w_{14}, w_{15}, w_{16}] \\] The values of the weights depend on the adjacency criterion adopted. The simplest coding scheme is when we assign a value of 1 to pairs of zones that are adjacent, and a value of 0 to pairs of zones that are not. Lets formalize the two criteria mentioned above: Rook criterion \\[ w_{ij}=\\bigg\\{\\begin{array}{l l} 1\\text{ if } A_i \\text{ and } A_j \\text{ share an edge}\\\\ 0\\text{ otherwise}\\\\ \\end{array} \\] If rook adjacency is used, the weights for zone \\(A_6\\) are as follows: \\[ w_{6\\cdot} = [0, 0, 0, 1, 1, 0]. \\] As you can see, the adjacent areas from the perspective of \\(A_6\\) are \\(A_4\\) and \\(A_5\\) by virtue of sharing an edge. These two areas receive weights of 1. On the other hand, \\(A_1\\), \\(a_2\\), and \\(A_3\\) are not adjacent, and therefore receive a weight of zero. Notice how the weight \\(w_{66}\\) is set to zero. By convention, an area is not its own neighbor! Queen criterion \\[ w_{ij}=\\bigg\\{\\begin{array}{l l} 1\\text{ if } A_i \\text{ and } A_j \\text{ share an edge or a vertex}\\\\ 0\\text{ otherwise}\\\\ \\end{array} \\] If queen adjacency is used, the weights for zone \\(A_6\\) are as follows: \\[ w_{6\\cdot} = [0, 1, 0, 1, 1, 0]. \\] As you can see, the adjacent areas from the perspective of \\(A_6\\) are \\(A_4\\) and \\(A_5\\) (by virtue of sharing an edge), and \\(A_2\\) (by virtue of sharing a vertex). These three areas receive weights of 1. On the other hand, \\(A_1\\) and \\(A_3\\) are not adjacent, and therefore receive a weight of zero. Again, weight \\(w_{66}\\) is set to zero. The set of weights above define the neighborhood of \\(A_6\\). The spatial weights matrix for the whole system in Figure 1 is as follows: \\[ \\textbf{W}=\\left (\\begin{array}{c c c c c c} 0 & 1 & 1 & 1 & 0 & 0\\\\ 1 & 0 & 0 & 1 & 1 & 1\\\\ 1 & 0 & 0 & 1 & 0 & 0\\\\ 1 & 1 & 1 & 0 & 1 & 1\\\\ 0 & 1 & 0 & 1 & 0 & 1\\\\ 0 & 1 & 0 & 1 & 1 & 0\\\\ \\end{array} \\right). \\] Compare the matrix to the zoning system. The spatial weights matrix has the following properties: The main diagonal elements of the matrix are all zeros (no area is its own neighbor). Each zone has a row of weights in the matrix: row number one corresponds to \\(A_1\\), row number two corresponds to \\(A_2\\), and so on. Likewise, each zone has a column of weights. The sum of all values in a row gives the total number of neighbors for a zone. That is: \\[ \\text{The total number of neighbors of } A_i \\text{ is given by: }\\sum_{j=1}^{n}{w_{ij}} \\] The spatial weights matrix is often processed to obtain a row-standardized spatial weights matrix. This procedure consists of dividing every weight by the sum of its corresponding row (i.e., by the total number of neighbors of the zone), as follows: \\[ w_{ij}^{st}=\\frac{w_{ij}}{\\sum_{j=1}^n{w_{ij}}} \\] The row-standardized weights matrix for the system in Figure 1 is: \\[ \\textbf{W}^{st}=\\left (\\begin{array}{c c c c c c} 0 & 1/3 & 1/3 & 1/3 & 0 & 0\\\\ 1/4 & 0 & 0 & 1/4 & 1/4 & 1/4\\\\ 1/2 & 0 & 0 & 1/2 & 0 & 0\\\\ 1/5 & 1/5 & 1/5 & 0 & 1/5 & 1/5\\\\ 0 & 1/3 & 0 & 1/3 & 0 & 1/3\\\\ 0 & 1/3 & 0 & 1/3 & 1/3 & 0\\\\ \\end{array} \\right). \\] The row-standardized spatial weights matrix has the following properties: Each weight now represents the proportion of a neighbor out of the total of neighbors. For instance, since the total of neighbors of \\(A_1\\) is 3, each neighbor contributes 1/3 to that total. The sum of all weights over a row equals 1, or 100% of all neighbors for that zone. 21.6 Creating Spatial Weights Matrices in R Coding spatial weights matrices by hand is a tedious and error-prone process. Fortunately, functions to generate them exist in R. The package spdep in particular has a number of useful utilities for working with spatial weights matrices. The first step to create a spatial weights matrix is to find the neighbors (i.e., areas adjacent to) for each area. The function poly2nb is used for this. The input argument is a SpatialPolygonDataFrame, a kind of object that spdep uses. Fortunately, it is straightforward to convert our sf object into a SpatialPolygonDataFrame by means of the function as(): # Function `as()` is used to convert between object classes Hamilton_CT.sp <- as(Hamilton_CT, "Spatial") The following finds the neighbors (note that the default adjacency criterion is queen): # The function `poly2nb()` takes an object of class "Spatial" with polygons, and finds the neighbors Hamilton_CT.nb <- poly2nb(pl = Hamilton_CT.sp, queen = TRUE) The value (output) of the function is an object of class nb: class(Hamilton_CT.nb) ## [1] "nb" The function summary() applied to an object of this class gives some useful information about the neighbors in the region, including the number of zones in this system (\\(188\\)), the total number of neighbors (\\(1,180\\)), and the percentage of neighbors out of all pairs of areas (3.34%; conversely, 96.66% of all possible zone pairs are not neighbors!) Other information includes the distribution of neighbors (3 zones have two neighbors, 8 zones have three neighbors, 22 zones have four neighbors, and so on): summary(Hamilton_CT.nb) ## Neighbour list object: ## Number of regions: 188 ## Number of nonzero links: 1180 ## Percentage nonzero weights: 3.338615 ## Average number of links: 6.276596 ## Link number distribution: ## ## 2 3 4 5 6 7 8 9 10 11 12 14 ## 3 8 22 32 35 45 30 6 1 1 4 1 ## 3 least connected regions: ## 174 175 188 with 2 links ## 1 most connected region: ## 33 with 14 links The nb object is a list that contains the neighbors for each zone. For instance, the neighbors of census tract 5370001.01 (the first tract in the dataframe) are the following tracts: # Here, the indexing works by making reference to the first set of zone in `Hamilton_CT.nb` and then using those values to retrieve the census tract identifiers from our `Hamilton_CT` dataframe Hamilton_CT$TRACT[Hamilton_CT.nb[[1]]] ## [1] "5370120.02" "5370122.01" "5370122.02" "5370124.00" "5370142.01" ## [6] "5370133.01" "5370130.03" The list of neighbors can be converted into a list of entries in a spatial weights matrix \\(W\\) by means of the function nb2listw (for “neighbors to matrix W in list form”): Hamilton_CT.w <- nb2listw(Hamilton_CT.nb) We can visualize the neighbors (adjacent) areas: plot(Hamilton_CT.sp, border = "gray") plot(Hamilton_CT.nb, coordinates(Hamilton_CT.sp), col = "red", add = TRUE) 21.7 Spatial Moving Averages The spatial weights matrix \\(W\\), and in particular its row-standardized version \\(W^{st}\\), is useful to calculate a spatial statistic, the spatial moving average. The spatial moving average is a variation of the mean statistic: in fact, it is a weighted average, calculated using the spatial weights. Recall that the mean is calculated as the sum of all relevant values divided by the number of values summed. In the case of spatial data, the mean is what we would call a global statistic, since it is calculated using all data for a region: \\[ \\bar{x}=\\frac{1}{n}\\sum_{j=1}^{n}{x_j} \\] where \\(\\bar{x}\\) (read x-bar) is the mean of all values of x. A spatial moving average is calculated in the same way, but for each area, and based only on the values of proximate areas: \\[ \\bar{x_i}=\\frac{1}{n_i}\\sum_{j\\in N(i)}{x_j} \\] where \\(n_i\\) is the number of neighbors of \\(A_i\\), and the sum is only for \\(x_j\\) that are in the neighborhood of i (\\(j\\in N(i)\\) is read “j in the neighborhood of i”). We can illustrate the way spatial moving averages work by making reference again to Figure 1. Consider zone \\(A_1\\). The spatial weights matrix indicates that the neighborhood of \\(A_1\\) consists of three areas: \\(A_2\\), \\(A_3\\), and \\(A_4\\). Therefore \\(n_1=3\\), and \\(j\\in N(1)\\) are 2, 3, and 4. The spatial moving average of \\(A_1\\) for a variable \\(x\\) would then be calculated as: \\[ \\bar{x}_1=\\frac{x_2 + x_3 + x_4}{3} \\] Notice that another way of writing the spatial moving average expression is as follows, since membership in the neighborhood of \\(i\\) is implicit in the definition of \\(w_{ij}\\)! Since \\(w_{ij}\\) takes values of zero and one, the effect is to turn on and off the values of \\(x\\) depending on whether they are for areas adjacent to \\(i\\): \\[ \\bar{x}_i=\\frac{1}{n_i}\\sum_{j=1}^n{w_{ij}x_j} \\] This means that the spatial moving average of \\(A_1\\) for a variable \\(x\\) on this system can also be calculated using the spatial weights matrix as: \\[ \\bar{x}_1=\\frac{w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + w_{14}x_4 + w_{15}x_5 + w_{12}x_6}{3} \\] Substituting the spatial weights: \\[ \\bar{x}_1=\\frac{0x_1 + 1x_2 + 1x_3 + 1x_4 + 0x_5 + 0x_6}{3} = \\frac{x_2 + x_3 + x_4}{3} \\] In other words, the spatial weights can be used directly in the calculation of spatial moving averages. Further, notice that: \\[ n_i=\\sum_{j=1}^{n}w_{ij} \\] which is simply the total number of neighbors of \\(A_i\\), and the value we used to row-standardize the spatial weights. Since the row-standardized weights have already been divided by the number of neighbors, we can use them to express the spatial moving average as follows: \\[ \\bar{x}_i=\\sum_{j=1}^{n}{w_{ij}^{st}x_j} \\] Continuing with this example, if we use the row-standardized weights, the spatial moving average at \\(A_1\\) is: \\[ \\bar{x}_i=0x_1 + \\frac{1}{3}x_2 + \\frac{1}{3}x_3 + \\frac{1}{3}x_4 + 0x_5 + 0x_6 \\] which is the same as: \\[ \\bar{x}_i=\\frac{x_2 + x_3 + x_4}{3} \\] Consider the following map of Hamilton’s population density: # You have seen previously how to create a choropleth map using quintiles. The first part of this is a choropleth map of population density map <- ggplot(data = Hamilton_CT) + geom_sf(aes(fill = cut_number(Hamilton_CT$POP_DENSITY, 5), POP_DENSITY = round(POP_DENSITY), TRACT = TRACT), color = "black") + # For the example, two census tracts will be identified more explicitly # Next, we the function `filter()` to select census tract 5370142.02. We will color red the boundaries of this census tract geom_sf(data = filter(Hamilton_CT, TRACT == "5370142.02"), aes(POP_DENSITY = round(POP_DENSITY), TRACT = TRACT), color = "red", weight = 3, fill = NA) + # We the function `filter()` again, now to select census tract 5370144.01. We will color green the boundaries of this census tract geom_sf(data = subset(Hamilton_CT, TRACT == "5370144.01"), aes(POP_DENSITY = round(POP_DENSITY), TRACT = TRACT), color = "green", weight = 3, fill = NA) + # This selects a palette for the fill colors and changes the label for the legend scale_fill_brewer(palette = "YlOrRd") + labs(fill = "Pop Density") + coord_sf() # The function `ggplotly()` takes a `ggplot2` object and creates an interactive map ggplotly(map, tooltip = c("TRACT", "POP_DENSIT")) Manually calculate the spatial moving average for tract 5370142.02 (with the red boundary) and tract (with the green boundary). Tip: hover over the tracts to see their population densities. (32 + 109 + 48)/3 ## [1] 63 (48 + 55 + 125)/3 ## [1] 76 Spatial moving averages can be calculated in a straighforward way by means of the function lag.listw() function of the spdep package. This function uses a spatial weights matrix and automatically selects the row-standardized weights. Here, we calculate the spatial moving average of population density: POP_DENSITY.sma <- lag.listw(x = Hamilton_CT.w, Hamilton_CT$POP_DENSITY) And now we can plot the spatial moving average of population density. First we join this variable to our sf dataframe with the census tracts. The key for joining the two dataframes is the unique tract identifier : Hamilton_CT <- left_join(Hamilton_CT, data.frame(TRACT = Hamilton_CT$TRACT, POP_DENSITY.sma), by = "TRACT") And plot: # In this chunk of code we create a choropleth map, but now of the spatial moving average of population density # First map the spatial moving average of population density using quintiles map.sma <- ggplot() + geom_sf(data = Hamilton_CT, aes(fill = cut_number(Hamilton_CT$POP_DENSITY.sma, 5), POP_DENSITY.sma = round(POP_DENSITY.sma), TRACT = TRACT), color = "black") + # Select and plot census tract 5370142.02 and color its boundaries in red geom_sf(data = filter(Hamilton_CT, TRACT == "5370142.02"), aes(POP_DENSITY.sma = round(POP_DENSITY.sma), TRACT = TRACT), color = "red", weight = 3, fill = NA) + # Select and plot census tract 5370144.01 and color its boundaries in green geom_sf(data = filter(Hamilton_CT, TRACT == "5370144.01"), aes(POP_DENSITY.sma = round(POP_DENSITY.sma), TRACT = TRACT), color = "green", weight = 3, fill = NA) + # Embellish the map with a color palette to your taste and labels scale_fill_brewer(palette = "YlOrRd") + labs(fill = "Pop Density SMA") + coord_sf() # Again, `ggplotly()` takes the `ggplot2` object and creates an interactive map ggplotly(map.sma, tooltip = c("TRACT", "POP_DENSIT.sma")) Verify that your manual calculations for the two tracts above are correct. What differences do you notice between the map of population density and the map of spatial moving averages of population density? 21.8 Other Criteria for Coding Proximity Adjacenty is not the only criterion that can be used for coding proximity. Occasionally, the distance between areas is calculated by using the centroids of the areas as their representative points. A centroid is simply the mean of the coordinates of the edges of an area, and in this way represent the “centre of gravity” of the area. The inter-centroid distance allows us to define additional criteria for proximity, including neighbors within a certain distance threshold, and \\(k\\)-nearest neighbors. Distance-based criterion \\[ w_{ij}=\\bigg\\{\\begin{array}{l l} 1\\text{ if inter-centroid distance } d_{ij}\\leq \\delta\\\\ 0\\text{ otherwise}\\\\ \\end{array} \\] where \\(\\delta\\) is a distance threshold. Distance-based nearest neighbors can be obtained in R by means of the function dnearneigh(). To implement this criterion we need to find the centroids of the polygons with st_centroid() and then extract the coordinates of the centroids with st_coordinates(): CT_centroids <- st_centroid(Hamilton_CT) %>% st_coordinates() ## Warning in st_centroid.sf(Hamilton_CT): st_centroid assumes attributes are ## constant over geometries of x We can create a nearest neighbors object nb using two threshold distances, a minimum and a maximum distance value. In this example we will consider that the neighbors of zone \\(A_i\\) are all zones \\(A_j\\) whose centroids are within \\(0\\) and \\(5,000\\) meters of the centroid of \\(A_i\\): Hamilton_CT.dnb <- dnearneigh(CT_centroids, d1 = 0, d2 = 5000) We can visualize the neighbors (adjacent) areas: plot(Hamilton_CT.sp, border = "gray") plot(Hamilton_CT.dnb, CT_centroids, col = "red", add = TRUE) Try changing the distance threshold to see how different neighborhoods are defined. \\(k\\)-nearest neighbors A potential disadvantage of using a distance-based criterion is that for zoning systems with areas of vastly different sizes, small areas will end up having many neighbors, whereas large areas will have few or none. The criterion of \\(k\\)-nearest neighbors allows for some adaptation to the size of the areas. Under this criterion, all zones have the exact same number of neighbors, but the geographical extent of the neighborhood may (and likely will) change. The criterion is defined as follows: \\[ w_{ij}=\\bigg\\{\\begin{array}{l l} 1\\text{ if } A_j \\text{ is one of } k\\text{-nearest neighbors of } A_i\\\\ 0\\text{ otherwise}\\\\ \\end{array} \\] In R, \\(k\\)-nearest neighbors can be obtained by means of the function knearneigh(), and the arguments include the value of \\(k\\): Hamilton_CT.knb <- knn2nb(knearneigh(CT_centroids, k = 3)) We can again visualize the neighbors (“adjacent”) areas: plot(Hamilton_CT.sp, border = "gray") plot(Hamilton_CT.knb, CT_centroids, col = "red", add = TRUE) Try changing the value of k to see how the neighborhoods change. This chapter has equipped you to define various forms of proximity for area data. You have also seen how spatial moving averages can be calculated using row-standardized spatial weights matrices. References "],
["activity-10-area-data-ii.html", "Chapter 22 Activity 10: Area Data II 22.1 Practice questions 22.2 Learning objectives 22.3 Suggested reading 22.4 Preliminaries 22.5 Activity", " Chapter 22 Activity 10: Area Data II Remember, you can download the source file for this activity from here. 22.1 Practice questions Answer the following questions: List and describe two criteria to define proximity in area data analysis. What is a spatial weights matrix? Why do spatial weight matrices have zeros in the main diagonal? How is a spatials weights matrix row-standardized? Write the spatial weights matrices for the sample systems in Figures @ref{fig:simple-areal-system-i} and @ref{fig:simple-areal-system-ii}. Explain the criteria used to do so. Figure 22.1: Sample areal system 1 Figure 22.2: Sample areal system 2 22.2 Learning objectives In this activity, you will: Create spatial weights matrices. Calculate the spatial moving average of a variable. Create scatterplots of a variable and its spatial moving average. Think about ways to decide whether a landscape is random when working with area data. 22.3 Suggested reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey. 22.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity. In addition to tidyverse, you will need sf, a package that implements simple features in R (you can learn about sf here) and spdep, a package that implements several spatial statistical methods (you can learn more about it here): library(tidyverse) library(spdep) library(sf) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' library(plotly) In the practice that preceded this activity, you learned about the area data and visualization techniques for area data. Begin by loading the data that you will use in this activity: data(Hamilton_CT) This is a sf object with census tracts and selected demographic variables for the Hamilton CMA in Canada. You can obtain new (calculated) variables as follows. For instance, to obtain the proportion of residents who are between 20 and 34 years old, and between 35 and 49: Hamilton_CT <- mutate(Hamilton_CT, Prop20to34 = (AGE_20_TO_24 + AGE_25_TO_29 + AGE_30_TO_34)/POPULATION, Prop35to49 = (AGE_35_TO_39 + AGE_40_TO_44 + AGE_45_TO_49)/POPULATION) You can also convert the sf object into a SpatialPolygonsDataFrame object for use with the spdedp package: Hamilton_CT.sp <- as(Hamilton_CT, "Spatial") You are now ready for the next activity. 22.5 Activity NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to organize data, create a plot, and so on in support of analysis and interpretation. These tasks are indicated by a star (*). 1.* Create a spatial weights matrix for the census tracts in the Hamilton CMA. Use adjacency as your criterion for proximity. 2.* Calculate the spatial moving average for the following two variables: 1) proportion of the population who are 20 to 34 years old; and 2) proportion of the population who are 65 and older. 3.* Append the spatial moving averages to your dataframe. 4.* Choose one age group and create a scatterplot of the proportion of population in that group versus its spatial moving average. (Hint: if you create the scatterplot using ggplot2 you can add the 45 degree line by meand of geom_abline(slope = 1, intercept = 0)). Show your scatterplot of the population versus its spatial moving average to a fellow student. What is the meaning of the 45 degree line in this plot? Create a null-landscape by scrambling the values of your variable. For instance, you can use the variable prop20to34 to generate a null landscape as follows: Hamilton_CT$Null_1 <- sample(Hamilton_CT$Prop20to34) Calculate the spatial moving average of your null landscape, and create a scatterplot just like you did for your variable. How is this scatterplot different? "],
["area-data-iii.html", "Chapter 23 Area Data III 23.1 Learning Objectives 23.2 Suggested Readings 23.3 Preliminaries 23.4 Spatial Moving Averages and Simulation 23.5 The Spatial Moving Average as a Smoother 23.6 Spatial Moving Average Scatterplots 23.7 Spatial Autocorrelation and Moran’s \\(I\\) coefficient 23.8 Moran’s \\(I\\) and Moran’s Scatterplot 23.9 Hypothesis Testing for Spatial Autocorrelation", " Chapter 23 Area Data III NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. 23.1 Learning Objectives In the previous chapter and its corresponding activity, you learned about different ways to define proximity for area data, about spatial weights matrices, and how spatial weights matrices could be used to calculate spatial moving averages. In this practice, you will learn about: Spatial moving averages and simulation. The concept of spatial autocorrelation. Moran’s \\(I\\) coefficient and Moran’s scatterplot. Hypothesis testing for spatial autocorrelation. 23.2 Suggested Readings Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex. Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York. Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey. 23.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(spdep) library(sf) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' library(gridExtra) Read the data used in this chapter. This is an object of class sf (simple feature) with the census tracts of Hamilton CMA and some selected population variables from the 2011 Census of Canada: data(Hamilton_CT) You can quickly verify the contents of the dataframe by means of summary: summary(Hamilton_CT) ## ID AREA TRACT POPULATION ## Min. : 919807 Min. : 0.3154 Length:188 Min. : 5 ## 1st Qu.: 927964 1st Qu.: 0.8552 Class :character 1st Qu.: 2639 ## Median : 948130 Median : 1.4157 Mode :character Median : 3595 ## Mean : 948710 Mean : 7.4578 Mean : 3835 ## 3rd Qu.: 959722 3rd Qu.: 2.7775 3rd Qu.: 4692 ## Max. :1115750 Max. :138.4466 Max. :11675 ## POP_DENSITY AGE_LESS_20 AGE_20_TO_24 AGE_25_TO_29 ## Min. : 2.591 Min. : 0.0 Min. : 0.0 Min. : 0.0 ## 1st Qu.: 1438.007 1st Qu.: 528.8 1st Qu.:168.8 1st Qu.:135.0 ## Median : 2689.737 Median : 750.0 Median :225.0 Median :215.0 ## Mean : 2853.078 Mean : 899.3 Mean :253.9 Mean :232.8 ## 3rd Qu.: 3783.889 3rd Qu.:1110.0 3rd Qu.:311.2 3rd Qu.:296.2 ## Max. :14234.286 Max. :3285.0 Max. :835.0 Max. :915.0 ## AGE_30_TO_34 AGE_35_TO_39 AGE_40_TO_44 AGE_45_TO_49 ## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 ## 1st Qu.: 135.0 1st Qu.: 145.0 1st Qu.: 170.0 1st Qu.:203.8 ## Median : 195.0 Median : 200.0 Median : 230.0 Median :282.5 ## Mean : 228.2 Mean : 239.6 Mean : 268.7 Mean :310.6 ## 3rd Qu.: 281.2 3rd Qu.: 280.0 3rd Qu.: 325.0 3rd Qu.:385.0 ## Max. :1320.0 Max. :1200.0 Max. :1105.0 Max. :880.0 ## AGE_50_TO_54 AGE_55_TO_59 AGE_60_TO_64 AGE_65_TO_69 AGE_70_TO_74 ## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.0 ## 1st Qu.:203.8 1st Qu.:175.0 1st Qu.:140 1st Qu.:115.0 1st Qu.: 90.0 ## Median :280.0 Median :240.0 Median :220 Median :157.5 Median :130.0 ## Mean :300.3 Mean :257.7 Mean :229 Mean :174.2 Mean :139.7 ## 3rd Qu.:375.0 3rd Qu.:325.0 3rd Qu.:295 3rd Qu.:221.2 3rd Qu.:180.0 ## Max. :740.0 Max. :625.0 Max. :540 Max. :625.0 Max. :540.0 ## AGE_75_TO_79 AGE_80_TO_84 AGE_MORE_85 geometry ## Min. : 0.00 Min. : 0.00 Min. : 0.00 POLYGON :188 ## 1st Qu.: 68.75 1st Qu.: 50.00 1st Qu.: 35.00 epsg:26917 : 0 ## Median :100.00 Median : 77.50 Median : 70.00 +proj=utm ...: 0 ## Mean :118.32 Mean : 95.05 Mean : 87.71 ## 3rd Qu.:160.00 3rd Qu.:120.00 3rd Qu.:105.00 ## Max. :575.00 Max. :420.00 Max. :400.00 The sf object can be converted into a SpatialPolygonsDataFrame object for use with the spdedp package: Hamilton_CT.sp <- as(Hamilton_CT, "Spatial") 23.4 Spatial Moving Averages and Simulation In the preceding chapter and activity you learned about different criteria to define proximity for the analysis of area data, and how spatial weights matrices can be used to code patterns of proximity among zones in a spatial system. Furthermore, you also saw how spatial weights matrices can be used to calculate spatial moving averages, which in turn can be used to explore spatial patterns in area data. We will begin this chapter by briefly revisiting some of these notions. In the following chunk, we create a spatial weights matrix for Hamilton CMA census tracts based on the adjacency criterion: # Function `poly2nb()` builds a list of neighbors based on contiguous boundaries. The argument for this function is an object of class "Spatial", which was obtained from the `sf` object previously. `Hamilton_CT.sp` is an object containiing multi-polygon objects. # Function `nb2listw()` takes a list of neighbours and creates a matrix of spatial weights in the form of a list. Together, these two functions create a spatial weights matrix for the Census Tracts in Hamilton. Hamilton_CT.nb <- poly2nb(pl = Hamilton_CT.sp) Hamilton_CT.w <- nb2listw(Hamilton_CT.nb) Once that you have a matrix of spatial weights, it can be used to calculate the spatial moving average. In this example, we calculate the spatial moving average of the variable for population density, i.e., POP_DENSITY which is found in the sf dataframe: # The function `lag.listw()` takes as argument the population density by census tracts in Hamilton, and calculates the moving average, with the "moving" part given by the local neighborhoods around each zone as defined by `Hamilton_CT.w` POP_DENSITY.sma <- lag.listw(Hamilton_CT.w, Hamilton_CT$POP_DENSITY) After calculating the spatial moving average of population density, we can join this new variable to both the sf and SpatialPolygonsDataFrame objects: Hamilton_CT$POP_DENSITY.sma <- POP_DENSITY.sma Hamilton_CT.sp$POP_DENSITY.sma <- POP_DENSITY.sma As you saw in your last activity, the spatial moving average can be used in two ways to explore the spatial pattern of an area variable: as a smoother and by means of a scatterplot, combined with the original variable. 23.5 The Spatial Moving Average as a Smoother The spatial moving average, when mapped, is essentially a smoothing technique. What do we mean by smoothing? By reporting the average of the neighbors instead of the actually observed value of the variable, we reduce the amount of variability that is communicated. This often can make it easier to distinguish the overall pattern, at the cost of some information loss (think of how when mapping quadrats we lost some information/detail by calculating the intensity for areas). We can illustrate the use of the spatial moving average as a smoother with the help of a little simulation. To simulate a random spatial variable, we can randomize the observations that we already have, reassigning them at random to areas in the system. This is accomplished as follows: # By sampling at random and without replacement from the original variable, we create a null landscape. We will call this `POP_DENSITY_s1`, where the "s1" part is to indicate that this is our first simulated random landscape. We will actually repeat this process below. POP_DENSITY_s1 <- sample(Hamilton_CT$POP_DENSITY) Calculate the spatial moving average for this randomized variable (i.e., null landscape): # We use the function `lag.listw()` to calculate the spatial moving average, but now for the null landscape we just simulated. POP_DENSITY_s1.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s1) Once that you have seen how to randomize the variable, repeat the process to simulate a total of eight new variables/null landscapes, and calculate their spatial moving averages: # Note that we are creating 8 null landscapes based on our original population density variable, and that we are calculating the spatial moving average for each of them. Each simulation has a new name: s2, s3, s4,..., s8. # Null landscape/simulation #2 POP_DENSITY_s2 <- sample(Hamilton_CT$POP_DENSITY) POP_DENSITY_s2.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s2) # Null landscape/simulation #3 POP_DENSITY_s3 <- sample(Hamilton_CT$POP_DENSITY) POP_DENSITY_s3.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s3) # Null landscape/simulation #4 POP_DENSITY_s4 <- sample(Hamilton_CT$POP_DENSITY) POP_DENSITY_s4.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s4) # Null landscape/simulation #5 POP_DENSITY_s5 <- sample(Hamilton_CT$POP_DENSITY) POP_DENSITY_s5.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s5) # Null landscape/simulation #6 POP_DENSITY_s6 <- sample(Hamilton_CT$POP_DENSITY) POP_DENSITY_s6.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s6) # Null landscape/simulation #7 POP_DENSITY_s7 <- sample(Hamilton_CT$POP_DENSITY) POP_DENSITY_s7.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s7) # Null landscape/simulation #8 POP_DENSITY_s8 <- sample(Hamilton_CT$POP_DENSITY) POP_DENSITY_s8.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s8) Next, we will add all the null landscapes that you just simulated to the dataframes, as well as their spatial moving averages. This is useful for mapping and plotting purposes: # Here we add the simulated landscapes to the `sf` dataframe. Hamilton_CT$POP_DENSITY_s1 <- POP_DENSITY_s1 Hamilton_CT$POP_DENSITY_s2 <- POP_DENSITY_s2 Hamilton_CT$POP_DENSITY_s3 <- POP_DENSITY_s3 Hamilton_CT$POP_DENSITY_s4 <- POP_DENSITY_s4 Hamilton_CT$POP_DENSITY_s5 <- POP_DENSITY_s5 Hamilton_CT$POP_DENSITY_s6 <- POP_DENSITY_s6 Hamilton_CT$POP_DENSITY_s7 <- POP_DENSITY_s7 Hamilton_CT$POP_DENSITY_s8 <- POP_DENSITY_s8 # Here we add the spatial moving averages of the simulated landscapes to the `sf` dataframe. Hamilton_CT$POP_DENSITY_s1.sma <- POP_DENSITY_s1.sma Hamilton_CT$POP_DENSITY_s2.sma <- POP_DENSITY_s2.sma Hamilton_CT$POP_DENSITY_s3.sma <- POP_DENSITY_s3.sma Hamilton_CT$POP_DENSITY_s4.sma <- POP_DENSITY_s4.sma Hamilton_CT$POP_DENSITY_s5.sma <- POP_DENSITY_s5.sma Hamilton_CT$POP_DENSITY_s6.sma <- POP_DENSITY_s6.sma Hamilton_CT$POP_DENSITY_s7.sma <- POP_DENSITY_s7.sma Hamilton_CT$POP_DENSITY_s8.sma <- POP_DENSITY_s8.sma It would be useful to compare the original landscape of population density to the null landscapes that you created before. To create a single figure with choropleth maps of the empirical variable and the eight simulated variables using the facet_wrap() function of ggplot2, we must first reorganize the data so that all the population density variables are in a single column, and all spatial moving average variables are also in a single column. Further, we need a new column to identifies which variable the values in this column correspond to. We will solve this little data management problem by copying only the data we are interested in into a new dataframe (by means of select()), and then gathering the spatial moving averages into a single column: #"Hamilton_CT2 is a new dataframe. Here, the pipe operators (%>%) are used to pass the original dataframe to the select() function, and then the output of that is passed on to the `gather()` function. Notice that we are selecting the empirical spatial moving average and the 8 simulated instances of population densities. Hamilton_CT2 <- Hamilton_CT %>% # This pipe operator passes the dataframe to `select()` # `select()` keeps only the spatial moving averages and geometry select(POP_DENSITY.sma, POP_DENSITY_s1.sma, POP_DENSITY_s2.sma, POP_DENSITY_s3.sma, POP_DENSITY_s4.sma, POP_DENSITY_s5.sma, POP_DENSITY_s6.sma, POP_DENSITY_s7.sma, POP_DENSITY_s8.sma, geometry) %>% # This pipe operator passes the dataframe with only the spatial moving average variables and the geometry to `gather()` # `gather()` places all variables with the exception of `geometry` in a single column named `DENSITY_SMA` and creates a new variable called `VAR` with the names of the original columns (i.e., POP_DENSITY.sma, POP_DENSITY_s1.sma, etc.) gather(VAR, DENSITY_SMA, -geometry) Now the new dataframe with all spatial moving averages in a single column can be used to create choropleth maps. The function facet_wrap() is used to create facet plots so that we can place all maps in a single figure: ggplot() + geom_sf(data = Hamilton_CT2, aes(fill = DENSITY_SMA), color = NA) + facet_wrap(~VAR, ncol = 3) + # We are creating multiple plots for single data frame by means of the "facet_wrap" function. scale_fill_distiller(palette = "YlOrRd", direction = 1) + # Select palette for colors labs(fill = "Pop Den SMA") + # Change the label of the legend theme(axis.text.x = element_blank(), axis.text.y = element_blank()) # Remove the axis labels to avoid cluttering the plots The empirical variable is the map in the upper left corner (labelled POP_DENSITY.sma). The remaining 8 maps are simulated variables. Would you say the map of the empirical variable is fairly different from the map of the simulated variables? What are the key differences? An additional advantage of the spatial moving average is its use in the development of scatterplots. The information below provides further examples of exploring spatial moving averages with scatterplots. 23.6 Spatial Moving Average Scatterplots Let us explore the use of spatial moving average scatterplots. First, we will extract the density information from the original sf object, reorganize, and bind to Hamilton_CT2 so that we can plot using faceting: Hamilton_CT2 <- Hamilton_CT2 %>% # Pass `Hamilton_CT2` as the first argument of `data.frame()` data.frame(Hamilton_CT %>% # Pass `Hamilton_CT` to `st_drop_geometry()` st_drop_geometry() %>% # Drop the geometry because it is already available in `Hamilton_CT2`. # Select from `Hamilton_CT` the original population density and the 8 null landscapes simulated from it. select(POP_DENSITY, POP_DENSITY_s1, POP_DENSITY_s2, POP_DENSITY_s3, POP_DENSITY_s4, POP_DENSITY_s5, POP_DENSITY_s6, POP_DENSITY_s7, POP_DENSITY_s8) %>% # Pass the result to `gather()` gather(VAR, DENSITY) %>% # Copy all density variables to a single column, and create a new variable called `VAR` with the names of the original columns (i.e., POP_DENSITY, POP_DENSITY_s1, etc.) select(DENSITY)) # Drop VAR from the the dataframe After reorganizing the data we can create the scatterplot of the empirical population density and its spatial moving average, as well as the scatterplots of the simulated variables and their spatial moving averages for comparison (the plots include the 45 degree line). Again, the use of facet_wrap() allows us to put all plots in a single figure: #We are adding a geom and line (slope = 1) ggplot(data = Hamilton_CT2, aes(x = DENSITY, y = DENSITY_SMA, color = VAR)) + geom_point() + geom_abline(slope = 1, intercept = 0) + coord_equal() + facet_wrap(~ VAR, ncol = 3) What difference do you see between the empirical and simulated variables in these scatterplots? It is possible to fit a line to the scatterplots (i.e., adding a regression line). This makes it easier to appreciate the difference between the empirical and simulated variables. This line would take the following form, with \\(\\beta\\) as the slope of the line, and \\(\\alpha\\) the intercept: \\[ \\overline{x_i} =\\alpha + \\beta x_i \\] Recreate the previous figure, but now add fitted lines to the scatterplots by means of the function geom_smooth(). The method “lm” means linear model, so the fitted line is a straight line: ggplot(data = Hamilton_CT2, aes(x = DENSITY, y = DENSITY_SMA, color = VAR)) + geom_point(alpha = 0.1) + geom_abline(slope = 1, intercept = 0, linetype = "dashed") + # Add a fitted line to the plots geom_smooth(method = "lm") + coord_equal() + facet_wrap(~ VAR, ncol = 3) You will notice that the slope of the line tends to be flat in the simulated variables; this is to be expected, since these variables are spatially random: the values of the variable at \\(i\\) are independent of the values of their local means!. In other words, the probability that the map is random is pretty high (in fact, since these 8 of these maps are null landscapes, we know for a fact that they are random). The empirical variable, on the other hand, has a slope that is much closer to the 45 degree line. This indicates that the values of the variable at \\(i\\) are not independent of their local means: in other words, \\(x_i\\) is correlated with \\(\\overline{x_i}\\), and the probability of a non-random pattern is high. This phenomenon is called spatial autocorrelation, and it is a fundamental way to describe spatial data. We will discuss this more extensively next. 23.7 Spatial Autocorrelation and Moran’s \\(I\\) coefficient As seen above, the spatial moving average can provide evidence of the phenomenon of spatial autocorrelation, that is, when a variable displays spatial patterns whereby the values of a variable at zone \\(i\\) are not independent of the values of the variable in the neighborhood of zone \\(i\\). A convenient modification to the concept of the spatial moving average is as follows. Instead of using the variable \\(x\\) for the calculation of the spatial moving average, we first center it on the global mean: \\[ z_i = x_i - \\bar{x} \\] In this way, the values of \\(z_i\\) are given in deviations from the mean. By forcing the variable to be centered on the mean, the slope of the fit line is forced to pass through the origin. Calculate the mean-centered version of POP_DENSIT, and then its spatial moving average: df_mean_center_scatterplot <- transmute(Hamilton_CT, # Modify values in dataframe Density_z = POP_DENSITY - mean(POP_DENSITY), # Substract the mean, so that the variable now is deviations from the mean SMA_z = lag.listw(Hamilton_CT.w, Density_z)) # Calculate the spatial moving average of the newly created variable `Density_z` Compare the following two plots. You will see that they are identical, but in the mean-centered one the origin of the axes coincides with the means of \\(x\\) and the spatial moving average of \\(x\\). In other words, we have the same data, but we have displaced the origin of the plot: # Create a scatterplot of population density and its spatial moving average sc1 <- ggplot(data = filter(Hamilton_CT2, VAR == "POP_DENSITY.sma"), aes(x = DENSITY, y = DENSITY_SMA)) + geom_point(alpha = 0.1) + geom_abline(slope = 1, intercept = 0, linetype = "dashed") + geom_smooth(method = "lm") + ggtitle("Population Density") + coord_equal() # Create a scatterplot of the mean-centered population density, and its spatial moving average sc2 <- ggplot(data = df_mean_center_scatterplot, aes(x = Density_z, y = SMA_z)) + geom_point(alpha = 0.1) + geom_abline(slope = 1, intercept = 0, linetype = "dashed") + geom_smooth(method = "lm", formula = y ~ x-1) + ggtitle("Mean-Centered Population Density") + coord_equal() # Use `grid.arrange()` to place the two plots in a single figure grid.arrange(sc1, sc2, ncol = 1) How is it useful to displace the origin of the axes to the mean values of \\(x\\) and its spatial moving average? To explain this, notice that the values on the top scatterplot are all positive. The values on the bottom scatterplot are positive or negative, depending if they are above or below the mean. This sign is interesting. Notice what happens when the variable \\(z_i\\) multiplies its spatial moving average: \\[ z_i\\bar{z}_i = z_i\\sum_{j=1}^n{w_{ij}^{st}z_j} \\] When \\(z_i\\) is above its mean, it is a positive value. When it is below the mean, it is a negative value. Likewise, when \\(\\bar{z}_i\\) is above its mean, it is a positive value, and negative otherwise. The mean is a useful benchmark to see if values are relatively high, or relatively low. There are four posibilities with respect to the combinations of (relatively) high and low values. Quadrant 1 (the value of \\(z_i\\) is high & the value of \\(\\bar{z}_i\\) is also high): If \\(z_i\\) is above the mean, it is a relatively high value in the distribution (signed positive). If its neighbors are also relatively high values, the spatial moving average will be above the mean, and also signed positive. Their product will be positive (positive times positive equals positive). Quadrant 2 (the value of \\(z_i\\) is low & the value of \\(\\bar{z}_i\\) is high): If \\(z_i\\) is below the mean, it is a relatively low value in the distribution (signed negative). If its neighbors in contrast are relatively high values, the spatial moving average will be above the mean, and signed positive. Their product will be negative (negative times positive equals negative). Quadrant 3 (the value of \\(z_i\\) is low & the value of \\(\\bar{z}_i\\) is also low): If \\(z_i\\) is below the mean, it is a relatively low value in the distribution (signed negative). If its neighbors are also relatively low values, the spatial moving average will be below the mean, and also signed negative. Their product will be positive (negative times negative equals positive). Quadrant 4 (the value of \\(z_i\\) is high & the value of \\(\\bar{z}_i\\) is low): If \\(z_i\\) is above the mean, it is a relatively high value in the distribution (signed positive). If its neighbors are relatively low values, the spatial moving average will be below the mean, and signed negative. Their product will be negative (positive times negative equals negative). These four quadrants are shown in the following plot: ggplot(data = df_mean_center_scatterplot, aes(x = Density_z, y = SMA_z)) + geom_point(color = "gray") + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + # You can also add annotations to plots by using `annotate()`. The inputs are the kind of annotation; in this case "text", but it could be circles, arrows, rectangles, labels, and other things. For text, you need a label, and coordinates for the annotation. annotate("text", label = "Q1: Positive", x= 2000, y = 2500) + annotate("text", label = "Q4: Negative", x= 2000, y = -2500) + annotate("text", label = "Q2: Negative", x= -2000, y = 2500) + annotate("text", label = "Q3: Positive", x= -2000, y = -2500) + coord_equal() We can take the products of \\(z_i\\) by \\(\\bar{z}_i\\) for all \\(i\\) and add them: \\[ \\sum_{i=1}^n{z_i\\overline{z_i}} = \\sum_{i=1}^n{z_i\\sum_{j=1}^n{w_{ij}^{st}z_j}} \\] If many dots are in Quadrants 1 and 3 in the scatterplot, the sum of the products will tend to be a large positive number. On the other hand, if many dots are in Quadrants 2 and 4, the sum of the products will tend to be a large number, but negative. Either case would be indicative of a pattern: If the sum is positive, this would suggest that high & high values tend to be together, while low & low values also tend to be together. In contrast, if the sum is negative, this would suggest that high values tend to be surrounded by low values, and viceversa. Finally, if the dots are scattered over the four quadrants, some products will be positive and some will be negative, and they will tend to cancel each other when summed. In this way, the sum of the products will tend to be closer to zero. 23.8 Moran’s \\(I\\) and Moran’s Scatterplot Based on the discussion above, let us define the following coefficient, called Moran’s I: \\[ I = \\frac{\\sum_{i=1}^n{z_i\\sum_{j=1}^n{w_{ij}^{st}z_j}}}{\\sum_{i=1}^{n}{z_i^2}} \\] The numerator in this expression is the sum of the products described above. The denominator is the variance of variable \\(x_i\\), and is used here to scale Moran’s \\(I\\) so that it is contained roughly in the interval \\((-1, 1)\\) (the exact bounds depend on the characteristics of the zoning system). Moran’s \\(I\\) is a coefficient of spatial autocorrelation. We can calculate Moran’s \\(I\\) as follows, using as an example the mean-centered population density (notice how it is the sum of the products of \\(z_i\\) by their spatial moving averages \\(\\bar{z}_i\\), divided by the variance): # Try to decipher the formula. You should be able to see that we are calculating the sum of the products by their spatial moving averages, divided by variance sum(df_mean_center_scatterplot$Density_z * df_mean_center_scatterplot$SMA_z) / sum(df_mean_center_scatterplot$Density_z^2) ## [1] 0.5179736 Since the value is positive, and relatively high, this would suggest a non-random spatial pattern of similar values (i.e., high & high and low & low). Moran’s \\(I\\) is implemented in R in the spdep package, which makes its calculation easy, since you do not have to go manually through the process of calculating the spatial moving averages, etc. The function moran() requires as input arguments a variable, a set of spatial weights, the number of zones (\\(n\\)), and the total sum of all weights (termed \\(S_0\\)) - which in the case of row-standardized spatial weights is equal to the number of zones. Therefore: mc <- moran(Hamilton_CT$POP_DENSITY, Hamilton_CT.w, n = 188, S0 = 188) mc$I ## [1] 0.5179736 You can verify that this matches the value calculated above. The kind of scatterplots that we used previously are called Moran’s scatterplots, and they can also be created easily by means of the moran.plot() function of the spdep package: # Confirming the results from the Moran coefficient above. We use "moran.plot" to illustrate the SMA of population density by census tract in Hamilton. mp <- moran.plot(Hamilton_CT$POP_DENSITY, Hamilton_CT.w) 23.9 Hypothesis Testing for Spatial Autocorrelation The tools described so far are useful to suggest whether a pattern is random; however, while inspection of the scatterplot is suggestive, we would like a more formal criterion to decide whether the pattern is random. Fortunately, Moran’s \\(I\\) can be used to develop a test of hypothesis. The expected value of Moran’s \\(I\\) under the null hypothesis of spatial randomness (or independence), as well as its variance, have been derived. A test for autocorrelation based on Moran’s \\(I\\) is implemented in the spdep package: #"moran.test" is calculating spatial autocorrelation of population density in Hamilton census tracts moran.test(Hamilton_CT$POP_DENSITY, Hamilton_CT.w) ## ## Moran I test under randomisation ## ## data: Hamilton_CT$POP_DENSITY ## weights: Hamilton_CT.w ## ## Moran I statistic standard deviate = 12.722, p-value < 2.2e-16 ## alternative hypothesis: greater ## sample estimates: ## Moran I statistic Expectation Variance ## 0.517973553 -0.005347594 0.001691977 Since the null hypothesis is of spatial independence, the \\(p\\)-value of the statistic is interpreted as the probability of making a mistake by rejecting the null hypothesis. In the present case, the \\(p\\)-value is such a small number that we can reject the null hypothesis with a high degree of confidence. Moran’s \\(I\\) and Moran’s scatterplots are amongst the most widely used tools in the analysis of spatial area data. References "],
["activity-11-area-data-iii.html", "Chapter 24 Activity 11: Area Data III 24.1 Practice questions 24.2 Learning objectives 24.3 Suggested reading 24.4 Preliminaries 24.5 Activity", " Chapter 24 Activity 11: Area Data III Remember, you can download the source file for this activity from here. 24.1 Practice questions Answer the following questions: What does the 45 degree line in the scatterplot of spatial moving averages indicate? What is the effect of centering a variable around the mean? In your own words, describe the phenomenon of spatial autocorrelation. What is the null hypothesis in the test of autocorrelation based on Moran’s I? 24.2 Learning objectives In this activity, you will: Calculate Moran’s I coefficient of autocorrelation for area data. Create Moran’s scatterplots. Examine the results of the tests/scatterplots for further insights. Think about ways to decide whether a landscape is random when working with area data. 24.3 Suggested reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey. 24.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity. In addition to tidyverse, you will need sf, a package that implements simple features in R (you can learn about sf here) and spdep, a package that implements several spatial statistical methods (you can learn more about it here): library(tidyverse) library(sf) library(spdep) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Begin by loading the data that you will use in this activity: data(Hamilton_CT) This is a sf object with census tracts and selected demographic variables for the Hamilton CMA in Canada. You can obtain new (calculated) variables as follows. For instance, to obtain the proportion of residents who are between 20 and 34 years old, and between 35 and 49: Hamilton_CT <- mutate(Hamilton_CT, Prop20to34 = (AGE_20_TO_24 + AGE_25_TO_29 + AGE_30_TO_34)/POPULATION, Prop35to49 = (AGE_35_TO_39 + AGE_40_TO_44 + AGE_45_TO_49)/POPULATION) You can also convert the sf object into a SpatialPolygonsDataFrame object for use with the spdedp package: Hamilton_CT.sp <- as(Hamilton_CT, "Spatial") You are now ready for the next activity. 24.5 Activity NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to organize data, create a plot, and so on in support of analysis and interpretation. These tasks are indicated by a star (*). 1.* Create a spatial weights matrix for the census tracts in the Hamilton CMA. 2.* Use moran.test to test the following variables for spatial autocorrelation: proportion of the population who are 20 to 34 years old, 35 to 49 years old, 50 to 65 years old, and 65 and older. 3.* Use moran.plot() to create Moran’s scatterplots to complement your tests of spatial autocorrelation. How confident are you deciding whether the variables under analysis are not spatially random? What can you say regarding the relative strength of the spatial pattern of these variables? Show a fellow student the Moran’s scatterplots you created in point 3. What can you tell about the spatial pattern based on these scatterplots? Create choropleth maps for the variables. If the spatial pattern is not random, what kind of process might have led to the patterns you observe? The scatterplots created using moran.plot include some observations that are labeled with their id and a different symbol. Why do you think these observations are highlighted in such a way? "],
["area-data-iv.html", "Chapter 25 Area Data IV 25.1 Learning objectives 25.2 Suggested readings 25.3 Preliminaries 25.4 Decomposing Moran’s \\(I\\) 25.5 Local Moran’s \\(I\\) and Mapping 25.6 A Quick Note on Functions 25.7 A Concentration approach for Local Analysis of Spatial Association 25.8 A Short Note on Hypothesis Testing 25.9 Detection of Hot and Cold Spots 25.10 Other Resources", " Chapter 25 Area Data IV NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. 25.1 Learning objectives In the previous practice/session, you learned about the concept of spatial autocorrelation, and how it can be used to evaluate statistical maps when searching for patterns. We also introduced Moran’s \\(I\\) coefficient, one of the most widely used tools to measure spatial autocorrelation. In this practice, you will learn about: Decomposing Moran’s \\(I\\). Local Moran’s \\(I\\) and mapping. A concentration approach for local analysis of spatial association. A short note on hypothesis testing. Detection of hot and cold spots. 25.2 Suggested readings Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex. Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York. Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey. 25.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(sf) library(plotly) library(spdep) library(crosstalk) library(geog4ga3) Load the datasets, first the .RData and then the shape file. data("df1_simulated") data("df2_simulated") These two dataframes are simulated landscapes, one completely random and another stochastic with a strong systematic pattern. Note that the descriptive statistics of both variables are identical.: summary(df1_simulated) ## x y z ## Min. : 1.00 Min. : 1.00 Min. :24.40 ## 1st Qu.:27.00 1st Qu.:19.00 1st Qu.:27.89 ## Median :46.50 Median :33.00 Median :30.33 ## Mean :45.61 Mean :31.63 Mean :34.38 ## 3rd Qu.:66.00 3rd Qu.:45.00 3rd Qu.:38.25 ## Max. :87.00 Max. :61.00 Max. :69.59 summary(df2_simulated) ## x y z ## Min. : 1.00 Min. : 1.00 Min. :24.40 ## 1st Qu.:27.00 1st Qu.:19.00 1st Qu.:27.89 ## Median :46.50 Median :33.00 Median :30.33 ## Mean :45.61 Mean :31.63 Mean :34.38 ## 3rd Qu.:66.00 3rd Qu.:45.00 3rd Qu.:38.25 ## Max. :87.00 Max. :61.00 Max. :69.59 The third dataset is an object of class sf (simple feature) with the census tracts of Hamilton CMA and some selected population variables from the 2011 Census of Canada: data(Hamilton_CT) The sf object can be converted into a SpatialPolygonsDataFrame object for use with the spdedp package: Hamilton_CT.sp <- as(Hamilton_CT, "Spatial") 25.4 Decomposing Moran’s \\(I\\) Here we will revisit Moran’s \\(I\\) coefficient to see how its utility for the exploration of spatial patterns can be extended. Recall from the preceding reading and activity that this coefficient of spatial autocorrelation was derived based on the idea of aggregating the products of a (mean-centered) variable by its spatial moving average, and then dividing by the variance: \\[ I = \\frac{\\sum_{i=1}^n{z_i\\sum_{j=1}^n{w_{ij}^{st}z_j}}}{\\sum_{i=1}^{n}{z_i^2}} \\] Also, remember that when plotting Moran’s scatterplot using moran.plot() some observations were highlighted. To see this, we will recreate the plot, for which we need a set of spatial weights: Hamilton_CT.w <- nb2listw(poly2nb(pl = Hamilton_CT.sp)) And here is the scatterplot of population density again: # We can use the arguments xlab and ylab in `moran.plot()` to change the labels for the two axes of the plot mp <- moran.plot(Hamilton_CT$POP_DENSITY, Hamilton_CT.w, xlab = "Population Density", ylab = "Lagged Population Density") The reason some observations are highlighted is because they have been identified as “influential”, meaning that they make a particularly large contribution to the calculation of \\(I\\). It turns out that the relative contribution of each observation to the calculation of Moran’s \\(I\\) is informative in and of itself, and its analysis can provide more focused information about the spatial pattern. To explore this, we will recreate the scatterplot manually to have better control of its aspect. To do this, we first create a dataframe with the mean-centered and scaled variable \\(z_i=(x_i-\\overline{x})/\\sum z_i^2\\), and its spatial moving average. We will also create a factor variable (call it Type) to identify the type of spatial relationship (Low & Low, if both \\(z_i\\) and its spatial moving average are negative, High & High, if both \\(z_i\\) and its spatial moving average are positive, and Low & High/High & Low otherwise). This is information is useful for mapping the results: Hamilton_CT <- Hamilton_CT %>% # Use the pipe operator to pass the dataframe as an argument to `mutate()`, which is used to create new variables. mutate(Z = (POP_DENSITY - mean(POP_DENSITY)) / var(POP_DENSITY), # Create a mean-centered variable that is standardized by the variance. SMA = lag.listw(Hamilton_CT.w, Z), # Calculate the spatial moving average of variable `Z`. # The function `case_when()` is used to evaluate several logical conditions and respond to them. Type = case_when(Z < 0 & SMA < 0 ~ "LL", Z > 0 & SMA > 0 ~ "HH", TRUE ~ "HL/LH")) Next, we will create the scatterplot and a choropleth map of the population density. The package plotly is used to create interactive plots. Read more about how to visualize geospatial information with plotly here. The package crosstalk allows us to link two plots for brushing (brushing is a visualization technique that links several plots in a dynamic way to highlight some elements of interest). To create an interactive plot for linking and brushing we first, create a SharedData object to link two plots: # Create a shared data object for brushing. df_msc.sd <- SharedData$new(Hamilton_CT) The function bscols() (for bootstrap columns) is used to array two plotly objects; the first of these is a scatterplot, and the second is a choropleth map of population density. bscols( # The first plot is Moran's scatterplot plot_ly(df_msc.sd) %>% # Create a `plotly` object using the dataframe as an input. The pipe operator passes this object to the function `add_markers()`; this function is similar to the `geom_point()` function in `ggplot2` and it draws objects on the blank plot created by `plot_ly()` add_markers(x = ~Z, y = ~SMA, color = ~POP_DENSITY, size = ~(Z * SMA), colors = "YlOrRd") %>% hide_colorbar() %>% # Remove the colorbar from the plot. highlight("plotly_selected", persistent = TRUE), # Highlight observations when selected. # The second plot is a choropleth map plot_ly(df_msc.sd) %>% # Create a `plotly` object using the dataframe as an input. The pipe operator passes this object to the function `add_sf()`; this function is similar to the `geom_sf()` functions in `ggplot2` and it draws a simple features object on the blank plot created by `plot_ly()` add_sf(split = ~TRACT, color = ~POP_DENSITY, colors = "YlOrRd", showlegend = FALSE) %>% hide_colorbar() %>% # Remove colorbar from the plot. highlight(dynamic = TRUE, persistent = TRUE) # Highlight observations when selected. ) The darker colors are zones with higher population densities. The size of the dots in the scatterplot indicates the contributions of the zone to Moran’s \\(I\\). The darker colors in the choropleth map are higher population densities.Since the plots are linked for brushing, it is possible to selecting groups of dots in the scatterplot (double click to clear a selection). Change the color for brushing to select a different group of dots. Can you identify in the map the zones that most contribute to Moran’s \\(I\\)? The direct relationship between the dots in the scatterplot and the values of the variable in the map suggest the following decomposition of Moran’s \\(I\\). 25.5 Local Moran’s \\(I\\) and Mapping A possible decomposition of Moran’s \\(I\\) into local components is as follows (see Anselin 1995) (Available here): \\[ I_i = \\frac{z_i}{m_2}\\sum_{j=1}^n{w_{ij}^{st}z_j} \\] where \\(z_i\\) is a mean-centered variable, and: \\[ m_2 = \\sum_{i=1}^n{z_i^2} \\] is its variance. \\(I_i\\) is called local Moran’s \\(I\\). It is straightforward to see that: \\[ I = \\sum_{i=1}^n{I_i} \\] In other words, the coefficients \\(I_i\\) when summed equal \\(I\\). To distinguish between these, we will call our Moran’s \\(I\\) coefficient a global statistic: there is one value for a map and it describes overall autocorrelation. \\(I_i\\), in turn, we will call a local statistic: it can be calculated locally for a location of interest, and describes autocorrelation for that location, as well as the contribution of that location to the global statistic. An advantage of the local decomposition described here is that it allows an analyst to map the statistic to better understand the spatial pattern. The local version of Moran’s \\(I\\) is implemented in spdep as localmoran(), and can be called with a variable and a set of spatial weights as arguments: POP_DENSITY.lm <- localmoran(Hamilton_CT$POP_DENSITY, Hamilton_CT.w) The value (output) of the function is a matrix with local Moran’s \\(I\\) coefficients (i.e., \\(I_i\\)), and their corresponding expected values and variances (used for hypothesis testing; more on this next). You can check the summary to verify the contents: summary(POP_DENSITY.lm) ## Ii E.Ii Var.Ii Z.Ii ## Min. :-0.62144 Min. :-0.005348 Min. :0.06421 Min. :-1.67885 ## 1st Qu.: 0.00478 1st Qu.:-0.005348 1st Qu.:0.13340 1st Qu.: 0.02345 ## Median : 0.12523 Median :-0.005348 Median :0.15647 Median : 0.33935 ## Mean : 0.51797 Mean :-0.005348 Mean :0.16681 Mean : 1.32117 ## 3rd Qu.: 0.59384 3rd Qu.:-0.005348 3rd Qu.:0.18876 3rd Qu.: 1.45104 ## Max. : 8.30454 Max. :-0.005348 Max. :0.47938 Max. :19.12671 ## Pr(z > 0) ## Min. :0.00000 ## 1st Qu.:0.07352 ## Median :0.36718 ## Mean :0.31317 ## 3rd Qu.:0.49065 ## Max. :0.95341 Similar to the global version of Moran’s \\(I\\), hypothesis testing can be conducted by comparing the empirical statistic to its distribution under the null hypothesis of spatial independence. The function localmoran reports p-values to this end. For further exploration, join the local statistics to the dataframe: Hamilton_CT <- Hamilton_CT %>% left_join(data.frame(TRACT = Hamilton_CT$TRACT, POP_DENSITY.lm), by = "TRACT") %>% # Join the results of `localmoran()` to the dataframe. rename(p.val = Pr.z...0.) # Use `rename()` to change the name of variables in a dataframe ## Warning: Column `TRACT` joining character vector and factor, coercing into ## character vector Now it is possible to map the local statistics. Since we added the \\(p\\)-value of the local statistics, we can distinguish between those with small (say, less than 0.05) and large \\(p\\)-values: # The function `add_sf()` draws a simple features object, similar to `geom_sf()` in `ggplot2`. We "split" observations based on their p-values: if the p-value is less than 0.05, the condition is "TRUE" and otherwise it is "FALSE". Finally, we color the zones based on their `Type`: that is, whether they are High & High according to the local statistic, or Low & Low, etc. plot_ly(Hamilton_CT) %>% add_sf(split = ~(p.val < 0.05), color = ~Type, colors = c("red", "khaki1", "dodgerblue", "dodgerblue4")) The map above shows whether population density in a zone is high, surrounded by other zones with high population densities (HH), or low, surrounded by zones that also have low population density (LL). Other zones have either low population densities and are surrounded by zones with high population density, or viceversa (HL/LH). Click on the legend to filter by category of TRUE-FALSE and HH-LL-HL/LH. This map allows you to identify what we could call the downtown core (from the perspective of population density), and the most suburban-rural census tracts in the Hamilton CMA. While mapping \\(I_i\\) or their corresponding \\(p\\)-values is straightforward, I personally find it more useful to map whether the zones are of type HH, LH, or HL/LH. Since such maps are not (to the best of my knowledge) the output of an existing function in an R package, so we will create one here. # A function is a way of packaging a set of standard instructions. Here, we package all the steps we used above to create the map of the local Moran coefficients in a new function called `localmoran.map()` localmoran.map <- function(p = p, listw = listw, VAR = VAR, by = by){ # p is a simple features object require(tidyverse) require(spdep) require(plotly) df_msc <- transmute(p, key = p[[by]], Z = (p[[VAR]] - mean(p[[VAR]])) / var(p[[VAR]]), SMA = lag.listw(listw, Z), Type = case_when(Z < 0 & SMA < 0 ~ "LL", Z > 0 & SMA > 0 ~ "HH", TRUE ~ "HL/LH")) local_I <- localmoran(p[[VAR]], listw) df_msc <- left_join(df_msc, data.frame(key = p[[by]], local_I)) df_msc <- rename(df_msc, p.val = Pr.z...0.) plot_ly(df_msc) %>% add_sf(split = ~(p.val < 0.05), color = ~Type, colors = c("red", "khaki1", "dodgerblue", "dodgerblue4")) } Notice how this function simply replicates the steps that we followed earlier to create the map with the results of the local Moran coefficients. To use this function you need as inputs an object of class sf, a listw object with spatial weights, and to define the variable of interest and a unique identifier for the areas (such as their tract identifiers). For example: localmoran.map(Hamilton_CT, Hamilton_CT.w, "POP_DENSITY", by = "TRACT") There, the function creates the map as desired. 25.6 A Quick Note on Functions Once that you know the steps needed to complete a task, if the task needs to be repeated many times possibly using different inputs, a function is a way of packing those instructions in a convenient way. That is all. 25.7 A Concentration approach for Local Analysis of Spatial Association The local version of Moran’s \\(I\\) is one of the most widely used tools of a family of measures called Local Statistics of Spatial Association or LISA. It is not the only one, however. In this section, we will see an alternative way of exploring spatial patterns locally, by means of a concentration approach. To introduce this new approach, imagine a landscape with a variable that can be measured in a ratio scale with a true zero point (say, population, income, a contaminant, or property values, variables that do not take negative values and the value of zero indicates complete absence). Imagine that you stand at a given location on that landscape and survey your surroundings. If your surroundings look very similar to the location where you stand (i.e., if their elevation is similar, relative to the rest of the landscape), you would take that as evidence of a spatial pattern, at least locally. This is the fundamental idea behind spatial autocorrelation analysis. As an alternative, imagine for instance that the variable of interest is, say, personal income. You might ask “how much of the regional wealth can be found in my neighborhood?” (or, if you prefer, imagine that the variable is a contaminant, and your question is, how much of it is around here?) Imagine now that personal income is spatially random. What would you expect the share of the wealth to be in your neighborhood? Would that share change if you moved to any other location? Lets elaborate this thought experiment. Take the df1 dataframe. The total sum of this variable in the region is 12,034.34. See: sum(df1_simulated$z) ## [1] 12034.34 The following is an interactive plot of variable z in the sample dataframe df1. This variable is spatially random: # Define how variables in the table are represented in the plot: for instance, the variable `x` corresponds to the x axis. Next, define the properties of the markers, or geometric objects in the plot. For example, their color will be proportional to variable `z` plot_ly(df1_simulated, x = ~x, y = ~y, z = ~z, marker = list(color = ~z, colorscale = c('#FFE1A1', '#683531'), showscale = TRUE)) %>% add_markers() Imagine that you stand at coordinates x = 53 and y = 34 (we will call this location the focal point), and you survey the landscape within a radius \\(r\\) of 10 (units of distance) of this location. How much wealth is concentrated in the neighborhood of the focal point? Lets see: # Define the focal point xy0 <- c(53, 34) # Select a radius r <- 10 # Extract observations that are within a radius of `r` from focal point `xy0` (note that sqrt((x - xy0)^2 + (x - xy0)^2) is Pythagora's formula for calculating the distance between two points; if this distance is less than `r`, the point is kept) df1_simulated %>% subset(sqrt((x - xy0[1])^2 + (y - xy0[2])^2) < r) %>% select(z) %>% sum() ## [1] 832.0156 Here, we calculated how much of the variable is present locally around the focal point. Recall that the total of the variable for the region is 12,034.34. If you change the radius r to a very large number, the concentration of the variable will simply become the total sum of the variable for the region. Essentially, the whole region is the “neighborhood” of the focal point. Try it. Now, for a fixed radius, change the focal point, and see how much the concentration of the variable changes for its neighborhood. How does the concentration of the variable by focal point? We will now repeat the thought experiment but now with the landscape shown in the following figure: plot_ly(df2_simulated, x = ~x, y = ~y, z = ~z, marker = list(color = ~z, colorscale = c('#FFE1A1', '#683531'), showscale = TRUE)) %>% add_markers() Imagine that you stand at the focal point with coordinates x = 53 and y = 34. Can you identify the point in the plot? If you surveyed the neighborhood, what would be the concentration of wealth there? How would that change as you visited different focal points? Lets see (again, recall that the total of the variable for the whole region is 12,034.34): xy0 <- c(53, 34) # Select a radius r <- 10 # Extract observations that are within a radius of `r` from focal point `xy0` (note that sqrt((x - xy0)^2 + (x - xy0)^2) is Pythagora's formula for calculating the distance between two points; if this distance is less than `r`, the point is kept) df2_simulated %>% subset(sqrt((x - xy0[1])^2 + (y - xy0[2])^2) < r) %>% select(z) %>% sum() ## [1] 1316.884 Change the focal point. How does the concentration of the variable change? We are now ready to define the following measure of local concentration (see Getis and Ord, 1992): \\[ G_i^*(d)=\\frac{\\sum_{j=1}^n{w_{ij}x_j}}{\\sum_{i=i}^{n}x_{i}} \\] Notice that the spatial weights are not row-standardized, and in fact must be a binary variable as follows: \\[ w_{ij}=\\bigg\\{\\begin{array}{l l} 1\\text{ if } d_{ij}\\leq d\\\\ 0\\text{ otherwise}\\\\ \\end{array} \\] This is because in this measure of concentration, we do not calculate the spatial moving average for the neighborhood, but the total of the variable in the neighborhood. A variant of this statistic removes from the sum the value of the variable at i: \\[ G_i(d)=\\frac{\\sum_{j\\neq i}^n{w_{ij}x_j}}{\\sum_{i=i}^{n}x_{i}} \\] I do not find this definition to be particularly useful. I suspect it was defined to resemble Moran’s \\(I\\) where an area is not it’s own neighbor - which makes sense in an autocorrelation sense (an area is perfectly autocorrelated with itself). In a concentration approach, not using the value at \\(i\\) is less appealing. As with the local version of Moran’s \\(I\\), it is possible to map the statistic to better understand the spatial pattern. The \\(G_i^*(d)\\) and \\(G_i(d)\\) statistics are implemented in spdep as localG, and can be called with a variable and a set of spatial weights as arguments. WE will calculate this statistic for the two datasets in the example above. This requires that we create binary spatial weights. Begin by creating neighbors by distance: # Create a matrix of coordinates. xy_coord <- cbind(df1_simulated$x, df1_simulated$y) # Find all nearest neighbors that are withing 0 and 10 units of distance away from every observation. dn10 <- dnearneigh(xy_coord, 0, 10) There are two differences with the procedure that we used before to create spatial weights. First, when we created spatial weights for Moran’s \\(I\\) coefficient, we stated that an observation is not its own neighbor. For the concentration approach, we might prefer to say that an observation is in the neighborhood of interest (being at its center). For this reason, we might opt to include the observation at \\(i\\) (therefore include.self()). And secondly, the style of the matrix is now “B” (for binary): # Convert the nearest neighbors `nb` object to spatial weights wb10 <- nb2listw(include.self(dn10), style = "B") The local statistics can be obtained as follows: # The arguments of this function are a spatial variable and a list of spatial weights df1.lg <- localG(df1_simulated$z, wb10) The value (output) of the function is a ’vector localG object with normalized local statistics. Normalized means that the mean under the null hypothesis has been substracted and the result has been divided by the variance under the null. Normalized statistics can be compared to the standard normal distribution for hypothesis testing. You can check the summary to verify the contents: summary(df1.lg) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -1.6345 -0.5085 0.1401 0.0657 0.5911 2.6638 The function localG() does not report the \\(p\\)-values, but they are relatively easy to calculate: df1.lg <- as.numeric(df1.lg) df1.lg <- data.frame(Gstar = df1.lg, p.val = 2 * pnorm(abs(df1.lg), lower.tail = FALSE)) How many of the \\(p\\)-values are less than the conventional decision cutoff of 0.05? Now the second example: df2.lg <- localG(df2_simulated$z, wb10) summary(df2.lg) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -4.2400 -2.6791 -1.3999 0.1503 2.3938 12.2401 Adding \\(p\\)-values: df2.lg <- as.numeric(df2.lg) df2.lg <- data.frame(Gstar = df2.lg, p.val = 2 * pnorm(abs(df2.lg), lower.tail = FALSE)) If we bind the results of the \\(G_i^*(d)\\) analysis to the dataframe, we can plot the results for further exploration. We will classify the results by their type, in this case high and low concentrations: df2 <- cbind(df2_simulated[,1:3],df2.lg) df2 <- df2 %>% mutate(Type = case_when(Gstar < 0 & p.val <= 0.05 ~ "Low Concentration", Gstar > 0 & p.val <= 0.05 ~ "High Concentration", TRUE ~ "Not Signicant")) And then the plot, but now color the points depending on whether they are high or low concentrations, and whether their \\(p\\)-values are lower than 0.05: plot_ly(df2, x = ~x, y = ~y, z = ~z, color = ~Type, colors = c("red", "blue", "gray"), marker = list()) %>% add_markers() What kind of pattern do you observe? 25.8 A Short Note on Hypothesis Testing Local tests as introduced above are affected by an issue called multiple testing. Typically, when attempting to assess the significance of a statistic, a level of significance is adopted (conventionally 0.05). When working with local statistics, we typically conduct many tests of hypothesis simultaneously (in the example above, one for each observation). A risk when conducting a large number of tests is that some of them might appear significant purely by chance! The more tests we conduct, the more likely that at least a few of them will appear to be significant by chance. For instance, in the preceding example the variable in df1 was spatially random, and yet a few observations had p-values smaller than 0.05. What this suggests is that some correction to the level of significance used is needed. A crude rule to make this adjustment is called a Bonferroni correction. This correction is as follows: \\[ \\alpha_B = \\frac{\\alpha_{nominal}}{m} \\] where \\(\\alpha_{nominal}\\) is the nominal level of significance, \\(\\alpha_B\\) is the adjusted level of significance, and \\(m\\) is the number of simultaneous tests. This correction requires that each test be evaluated at a lower level of significance \\(\\alpha_B\\) in order to to achieve a nominal level of significance of 0.05. If we apply this correction to the analysis above, we see that instead of 0.05, the p-value needed for significance is much lower: alpha_B <- 0.05/nrow(df1_simulated) alpha_B ## [1] 0.0001428571 You can verify now that no observations in df1 show up as significant: sum(df1.lg$p.val <= alpha_B) ## [1] 0 If we examine the variable in df2: df2 <- mutate(df2, Type = factor(ifelse(Gstar < 0 & p.val <= alpha_B, "Low Concentration", ifelse(Gstar > 0 & p.val <= alpha_B, "High Concentration", "Not Signicant")))) plot_ly(df2, x = ~x, y = ~y, z = ~z, color = ~Type, colors = c("red", "blue", "gray"), marker = list()) %>% add_markers() You will see that fewer observations are significant, but it is still possible to detect two regions of high concentration, and two of low concentration. The Bonferroni correction is known to be overly strict, and sharper approaches exist to correct for multiple testing. Between the nominal level of significance (no correction) and the level of significance with Bonferroni correction, it is still possible to assess the gravity of the issue of multiple comparisons. Observations that are flagged as significant with the Bonferroni correction, will also be significant under more refined corrections, so it provides the most conservative decision rule. 25.9 Detection of Hot and Cold Spots As the examples above illustrate, local statistics can be very useful in detecting what might be termed “hot” and “cold” spots. A hot spot is a group of observations that are significantly high, whereas a cold spot is a group of observations that are significantly low. There are many different applications where hot/cold spot detection is important. For instance, in many studies of urban form, it is important to identify centers and subcenters - by population, by property values, by incidence of trips, and so on. In spatial criminology, detecting hot spots of crime can help with prevention and law enforcement efforts. In environmental studies, remediation efforts can be greatly assisted by identification of hot areas. In spatial epidemiology hot spots can indicate locations were a large number of cases of a disease have been observed. There are countless applications of this. 25.10 Other Resources Check a cool app that illustrates the \\(G_i^*\\) statistic here References "],
["activity-12-area-data-iv.html", "Chapter 26 Activity 12: Area Data IV 26.1 Practice questions 26.2 Learning objectives 26.3 Suggested reading 26.4 Preliminaries 26.5 Activity", " Chapter 26 Activity 12: Area Data IV 26.1 Practice questions Answer the following questions: How are row-standardized and binary spatial weights interpreted? What is the reason for using a Bonferroni correction for multiple tests? What types of spatial patterns can the local version of Moran’s I detect? What types of spatial patterns can the \\(G_i(d)\\) statistic detect? What is the utility of detecting hot and cold spatial spots? 26.2 Learning objectives In this activity, you will: Calculate Moran’s I coefficient of autocorrelation for area data. Create Moran’s scatterplots. Examine the results of the tests/scatterplots for further insights. Think about ways to decide whether a landscape is random when working with area data. 26.3 Suggested reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey. 26.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity. In addition to tidyverse, you will need sf, a package that implements simple features in R (you can learn about sf here) and spdep, a package that implements several spatial statistical methods (you can learn more about it here): library(tidyverse) library(sf) library(spdep) library(geog4ga3) ## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when ## loading 'geog4ga3' ## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading ## 'geog4ga3' Begin by loading the data that you will use in this activity: data(Hamilton_CT) This is a sf object with census tracts and selected demographic variables for the Hamilton CMA in Canada. You can obtain new (calculated) variables as follows. For instance, to obtain the proportion of residents who are between 20 and 34 years old, and between 35 and 49: Hamilton_CT <- mutate(Hamilton_CT, Prop20to34 = (AGE_20_TO_24 + AGE_25_TO_29 + AGE_30_TO_34)/POPULATION, Prop35to49 = (AGE_35_TO_39 + AGE_40_TO_44 + AGE_45_TO_49)/POPULATION) You can also convert the sf object into a SpatialPolygonsDataFrame object for use with the spdedp package: Hamilton_CT.sp <- as(Hamilton_CT, "Spatial") This function is used to create local Moran maps: localmoran.map <- function(p = p, listw = listw, VAR = VAR, by = by){ require(tidyverse) require(spdep) require(plotly) df_msc <- transmute(p, key = p[[by]], Z = (p[[VAR]] - mean(p[[VAR]])) / var(p[[VAR]]), SMA = lag.listw(listw, Z), Type = case_when(Z < 0 & SMA < 0 ~ "LL", Z > 0 & SMA > 0 ~ "HH", TRUE ~ "HL/LH")) local_I <- localmoran(p[[VAR]], listw) df_msc <- left_join(df_msc, data.frame(key = p[[by]], local_I)) df_msc <- rename(df_msc, p.val = Pr.z...0.) plot_ly(df_msc) %>% add_sf(split = ~(p.val < 0.05), color = ~Type, colors = c("red", "khaki1", "dodgerblue", "dodgerblue4")) } This function is used to create \\(G_i^*\\) maps: gistar.map <- function(p = p, listw = listw, VAR = VAR, by = by){ require(tidyverse) require(spdep) require(sf) require(plotly) p <- mutate(p, key = p[[by]]) df.lg <- localG(p[[VAR]], listw) df.lg <- as.numeric(df.lg) df.lg <- data.frame(Gstar = df.lg, p.val = 2 * pnorm(abs(df.lg), lower.tail = FALSE)) df.lg <- mutate(df.lg, Type = case_when(Gstar < 0 & p.val <= 0.05 ~ "Low Concentration", Gstar > 0 & p.val <= 0.05 ~ "High Concentration", TRUE ~ "Not Signicant")) p <- left_join(p, data.frame(key = p[[by]], df.lg)) plot_ly(p) %>% add_sf(split = ~(p.val < 0.05), color = ~Type, colors = c("red", "dodgerblue", "gray")) } Create spatial weights. By contiguity: Hamilton_CT.w <- nb2listw(poly2nb(pl = Hamilton_CT.sp)) Binary, by distance (3 km threshold) including self. Hamilton_CT.3knb <- Hamilton_CT.sp %>% coordinates() %>% dnearneigh(d1 = 0, d2 = 3) Hamilton_CT.3kw <- nb2listw(include.self(Hamilton_CT.3knb), style = "B") You are now ready for the next activity. 26.5 Activity 1.* Create local Moran maps for the population in age group 20-34 and proportion of population in age group 20-34. 2.* Use the \\(G_i^*\\) statitic to analyze the population and proportion of population in the age group 20-34. 3.* Now create local Moran maps for the population and population density in the age group 20-34. Concerning the analysis in point 1: What is the difference between using population (absolute) and population density (rate)? Concerning the analysis in point 2: What is the difference between using population (absolute) and proportion of population (rate)? Is there a reason to prefer either variable in analysis? Discuss. More generally, what do you think should guide the decision of whether to analyze variables as absolute values or rates? "],
["area-data-v.html", "Chapter 27 Area Data V 27.1 Learning Objectives 27.2 Suggested Readings 27.3 Preliminaries 27.4 Regression Analysis in R 27.5 Autocorrelation as a Model Diagnostic 27.6 Variable Transformations 27.7 A Note about Spatial Autocorrelation in Regression Analysis", " Chapter 27 Area Data V NOTE: You can download the source files for this book from here. The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes. If you wish to work interactively with this chapter you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. 27.1 Learning Objectives In the previous chapter, you learned how to decompose Moran’s \\(I\\) coefficient into local versions of an autocorrelation statistic. You also learned about a concentration statistics, and saw how these local spatial statistics can be used for exploratory spatial data analysis, for example to search for “hot” and “cold” spots. In this practice, you will: Practice how to estimate regression models in R. Learn about autocorrelation as a model diagnostic. Learn about variable transformations. Use autocorrelation analysis to improve regression models. 27.2 Suggested Readings Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex. Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York. Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles. O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey. 27.3 Preliminaries As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity: library(tidyverse) library(ggmap) library(geosphere) library(sf) library(plotly) library(spdep) library(geog4ga3) Next, read an object of class sf (simple feature) with the census tracts of Hamilton CMA and some selected population variables from the 2011 Census of Canada. This dataset will be used for examples in this chapter: data(Hamilton_CT) The sf object can be converted into a SpatialPolygonsDataFrame object for use with the spdedp package: Hamilton_CT.sp <- as(Hamilton_CT, "Spatial") 27.4 Regression Analysis in R Regression analysis is one of the most powerful techniques in the repertoire of data analysis. There are many different forms of regression, and they usually take the following form: \\[ y_i = f(x_{ij}) + \\epsilon_i \\] This is a model for a stochastic process. The outcome is \\(y_i\\), which could be the observed values of a variable \\(y\\) at locations \\(i\\). We will think of these locations as areas, but they could as well be points, nodes on a network, links on a network, etc. The model consists of two components: a systematic/deterministic part, that is \\(f(x_{ij})\\), which is a function of a collection of variables \\(x_{i1}, x_{i2}, \\cdots, x_{ij}, \\cdots, x{ik}\\); and a random part, captured by the term \\(\\epsilon_i\\). In this chapter we will deal with one specific form of regression, namely linear regression. A linear regression model posits (as the name implies) linear relationships between an outcome, called a dependent variable, and one or more covariates, called independent variables. It is important to note that regression models capture statistical relationships, not causal relationships. Even so, causality is often implied by the choice of independent variables. In a way, regression analysis is a tool to infer process from pattern: it is a formula that aims to retrieve the elements of the process based on our observations of the outcome. This is the form of a linear regression model: \\[ y_i = f(x_{ij}) + \\epsilon_i = \\beta_0 + \\sum_{j=1}^k{\\beta_jx_{ij}} + \\epsilon_i \\] where \\(y_i\\) is the dependent variable and \\(x_ij\\) (\\(j=1,...,k\\)) are the independent variables. The coefficients \\(\\beta\\) are not known, but can be estimated from the data. And \\(\\epsilon_i\\) is the random term, which in regression analysis is often called a residual (or error), because it is the difference between the systematic term of the model and the value of \\(y_i\\): \\[ \\epsilon_i = y_i - \\bigg(\\beta_0 + \\sum_{j=1}^k{\\beta_jx_{ij}}\\bigg) \\] Estimation of a linear regression model is the procedure used to obtain values for the coefficients. This typically involves defining a loss function that needs to be minimized. In the case of linear regression, a widely used estimation procedure is least squares. This procedure allows a modeller to find the coefficients that minimize the sum of squared residuals, which become the loss function for the procedure. In very simple terms, the protocol is as follows: \\[ \\text{Find the values of }\\beta\\text{ that minimize }\\sum_{i=1}^n{\\epsilon_i^2} \\] For this procedure to be valid, there are a few assumptions that need to be satisfied, including: The functional form of the model is correct. The independent variables are not collinear; this is often diagnosed by calculating the correlations among the independent variables, with values greater than 0.8 often being problematic. The residuals have a mean of zero: \\[ E[\\epsilon_i|X]=0 \\] The residuals have constant variance: \\[ Var[\\epsilon_i|X] = \\sigma^2 \\text{ }\\forall i \\] The residuals are independent, that is, they are not correlated among them: \\[ E[\\epsilon_i\\epsilon_j|X] = 0 \\text{ }\\forall i\\neq j \\] The last three assumptions ensure that the residuals are random. Violation of these assumptions is often a consequence of a failure in the first two (i.e., the model was not properly specified, and/or the residuals are not exogenous). When all these assumptions are met, the coefficiens are said to be BLUE: Best Linear Unbiased Estimates - a desirable property because we wish to be able to quantify the relationships between covariates without bias. This section provides a refresher on linear regression, before reviewing the estimation of regression models in R. The basic command for multivariate linear regression in R is lm(), for “linear model”. This is the help file of this function: # Remember that we can search the definition of a function by using a question mark in front of the function itself. ?lm ## starting httpd help server ... done We will see now how to estimate a model using this function. The example we will use is of urban population density gradients. Population density gradients are representations of the variation of population density in cities. These gradients are of interest because they are related to land rent, urban form, and commuting patterns, among other things (see accompanying reading for more information). Urban economic theory suggests that population density declines with distance from the central business district of a city, or its CBD. This leads to the following model, where the population density at location \\(i\\) is a function of the distance of \\(i\\) to the CBD. Since this is likely a stochastic process, we allow for some randomness by means of the residuals: \\[ P_i = f(D_i) + \\epsilon_i \\] To implement this model, we need to add distance to the CBD as a covariate in our dataframe. We will use Jackson Square, a central shopping mall in Hamilton, as the CBD of the city: # The function `c()` concatenates the arguments arguments, that is, places them in a vector. Here, we create a vector with the coordinates of Jackson Square. xy_cbd <- c(-79.8708, 43.2584) To calculate the distance from the census tracts to the CBD, we retrieve the coordinates of the centroids: # We need to retrieve the spatial coordinates of Hamilton_ct by using 'coordinates' # We use 'spTransform' as a method for map projection by means of latitude, logitude and datum of WGS84. Remember, you can always learn more about the properties of a function by insterting a question mark before in front of the function (i.e ?coordinates) xy_ct <- coordinates(spTransform(Hamilton_CT.sp, CRSobj = "+proj=longlat +datum=WGS84 +no_defs")) Given these coordinates, the function geosphere::distGeo can be used to calculate the great circle distance between the centroids of the census tracts and Hamilton’s CBD. Call this dist2cbd.sl, i.e., straight line distance to CBD in a straight: # Function `distGeo()` is used to calculate the geodesic distance between two points. Here, we use it to calculate the distance from the centroids of the census tracts to the coordinates of the CBD. We will call this variable `dist.sl`, for "straight line" to remind us what kind of distance this is. dist.sl <- distGeo(xy_ct, xy_cbd) Mext. we add our new variable distance to CBD to our dataframe Hamilton_CT for analysis: Hamilton_CT$dist.sl <- dist.sl Regression analysis is implemented in R by means of the lm function. The arguments of the model include an object of type “formula” and a dataframe. Other arguments include conditions for subsetting the data, using sampling weights, and so on. A formula is written in the form y ~ x1 + x2, and more complex expressions are possible too, as we will see below. For the time being, the formula is simply POP_DENSIT ~ dist.sl: # The function `lm()` implements regression analysis in `R`. Recall that 'dist.sl' is the distance from the CBD (Jackson Square) model1 <- lm(formula = POP_DENSITY ~ dist.sl, data = Hamilton_CT) summary(model1) ## ## Call: ## lm(formula = POP_DENSITY ~ dist.sl, data = Hamilton_CT) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3841.2 -1338.3 -177.1 950.9 10009.1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4405.15414 250.16355 17.609 < 2e-16 *** ## dist.sl -0.17989 0.02418 -7.439 3.63e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1892 on 186 degrees of freedom ## Multiple R-squared: 0.2293, Adjusted R-squared: 0.2251 ## F-statistic: 55.33 on 1 and 186 DF, p-value: 3.631e-12 The value of the function is an object of class lm that contains the results of the estimation, including the coefficients with their diagnostics, and the coefficient of multiple determination, among other items. Notice how the coefficient for distance is negative (and significant). This indicates that population density declines with increasing distance: \\[ P_i = f(D_i) + \\epsilon_i = 4405.15414 - 0.17989D_i + \\epsilon_i \\] 27.5 Autocorrelation as a Model Diagnostic We can quickly explore the fit of the model. Since our model contains only one independent variable, we can use a scatterplot to see how it relates to population density. The points in the scatterplot are the actual population density and the distance to CBD. We also use the function geom_abline() to add the regression line to the plot, in blue: ggplot(data = Hamilton_CT, aes(x = dist.sl, y = POP_DENSITY)) + geom_point() + geom_abline(slope = model1$coefficients[2], # Recall that `geom_abline()` draws a line with intercept and slope as defined. Here the line is drawn using the coefficients of the regression model we estimated above. intercept = model1$coefficients[1], color = "blue", size = 1) + geom_vline(xintercept = 0) + # We also add the y axis... geom_hline(yintercept = 0) # ...and the x axis. Clearly, there remains a fair amount of noise after this model (the scatter of the dots around the regression line). In this case, the regression line captures the general trend of the data, but seems to underestimate most of the high population density areas closer to the CBD, and it also overestimates many of the low population areas. If the pattern of under- and over-estimation is random (i.e., the residuals are random), that would indicate that the model successfully retrieved all the systematic pattern. If the pattern is not random, there is a violation of assumption of independence. To explore this issue, we will add the residuals of the model to the dataframe: # Here we add the residuals from 'model1' to the dataframe, with the name `model1.e` Hamilton_CT$model1.e <- model1$residuals Since we are interested in statistical maps, we will create a map of the residuals. In this map, we will use red to indicate negative residuals (values of the dependent variable that the model overestimates), and blue for positive residuals (values of the dependent variable that the model underestimates): # Recall that 'plot_ly()' is a function used to create interactive plots plot_ly(Hamilton_CT) %>% # Recal that `add_sf()` is similar to `geom_sf()` and it draws a simple features object on a `plotly` plot. This example adds colours to represent positive (blue) and negative residuals (red). add_sf(color = ~(model1.e > 0), colors = c("red", "dodgerblue4")) In the legend of the plot, “TRUE” means that the residual is positive, and “FALSE” that it is negative. Does the spatial distribution of residuals look random? In this case, visual inspection is very suggestive. In addition, we have the tools to help us with this question, in particular how to make a decision while quantifying our levels of confidence: the \\(p\\)-values of Moran’s \\(I\\) coefficient, for instance. We will create a set of spatial weights: # Here, we use use `poly2nb()` to create a list of neighbors, based on the criterion of adjacency. Next, we pass that list of neighbors to `nb2listw()` to create a set of spatial weights. Hamilton_CT.w <- Hamilton_CT.sp %>% poly2nb() %>% nb2listw() Once that we have a set of spatial weights, we can calculate Moran’s \\(I\\): moran.test(Hamilton_CT$model1.e, Hamilton_CT.w) ## ## Moran I test under randomisation ## ## data: Hamilton_CT$model1.e ## weights: Hamilton_CT.w ## ## Moran I statistic standard deviate = 9.7447, p-value < 2.2e-16 ## alternative hypothesis: greater ## sample estimates: ## Moran I statistic Expectation Variance ## 0.395382878 -0.005347594 0.001691092 The results of Moran coefficient support our visual inspection of the map. Notice how we can reject the null hypothesis (spatial randomness) at a very high level of confidence (see the extremely small value of \\(p\\)). Spatial autocorrelation, as mentioned above, is a violation of a key assumption of linear regression, and likely the consequence of a model that was not correctly specified, either because the functional form was incorrect (e.g., the relationship was not linear), or there are missing covariates. We will explore the first of these possibilities by means of variable transformations. 27.6 Variable Transformations The term linear regression refers to the linearity in the coefficients. Variable transformations allow you to consider non-linear relationships between covariates, while still preserving the linearity of the coefficients. For instance, a possible transformation of the variable distance could be its inverse: \\[ f(D_i) = \\beta_0 + \\beta_1\\frac{1}{D_i} \\] Here, we will create a new covariate that is the inverse distance: # Recall that the function `mutate()` adds new variables to an exist dataframe, while preserving those that already exist. Here, we use our variable with the distance to the CBD to create a new variable that is the inverse distance. Hamilton_CT <- mutate(Hamilton_CT, invdist.sl = 1/dist.sl) Once we have the inverse distance, we can estimate a second model using it as the covariate: # Notice how the new 'model2' uses the inverse distance from the CBD rather than the original distance. model2 <- lm(formula = POP_DENSITY ~ invdist.sl, data = Hamilton_CT) summary(model2) ## ## Call: ## lm(formula = POP_DENSITY ~ invdist.sl, data = Hamilton_CT) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6763 -1375 -52 1108 9675 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2299.6 164.4 13.988 < 2e-16 *** ## invdist.sl 2260521.5 342039.7 6.609 3.97e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1940 on 186 degrees of freedom ## Multiple R-squared: 0.1902, Adjusted R-squared: 0.1858 ## F-statistic: 43.68 on 1 and 186 DF, p-value: 3.965e-10 As the scatterplot below shows (as before, the blue line is the regression line), we can capture a non-linear relationship. This model does a somewhat better job of describing the high density of tracts close to the CBD. Unfortunately, it is a poor description of density almost everywhere else: ggplot(data = Hamilton_CT, aes(x = dist.sl, y = POP_DENSITY)) + geom_point() + stat_function(fun=function(x)2299.6 + 2260521.5/x, geom="line", color = "blue", size = 1) + geom_vline(xintercept = 0) + geom_hline(yintercept = 0) We will add the residuals of this model to the dataframe for further examination, in particular testing for spatial autocorrelation: Hamilton_CT$model2.e <- model2$residuals If we calculate Moran’s \\(I\\), we notice that the coefficient is lower than for the previous model but the \\(p\\)-value is still very low, which means that we can confidently reject the hypothesis that the residuals are random. But we would actually prefer to not reject this hypothesis, since we would like the residuals to be random! moran.test(Hamilton_CT$model2.e, Hamilton_CT.w) ## ## Moran I test under randomisation ## ## data: Hamilton_CT$model2.e ## weights: Hamilton_CT.w ## ## Moran I statistic standard deviate = 8.8236, p-value < 2.2e-16 ## alternative hypothesis: greater ## sample estimates: ## Moran I statistic Expectation Variance ## 0.358132814 -0.005347594 0.001696970 The results of the test suggest that the model still fails at capturing the systematic aspects of population density gradients, so we need to investigate this further. The literature on population density gradients suggests other non-linear transformations, including: \\[ f(D_i) = exp(\\beta_0)exp(\\beta_1x_i) \\] This function is no longer linear in the coefficients (since the coefficients \\(\\beta_0\\) and \\(beta_1\\) are transformed by the exponential). Fortunately, there is a simple way of changing this to a linear expression, by taking the logarithm on both sides of the equation: \\[ ln(P_i) = \\beta_0 + \\beta_1x_i \\] By transforming the dependent variable we obtain a function that is linear in the parameters. To implement this model, we need to create a new variable that is the logarithm of population density: # Here we mutate the population density by taking its natural logarithm of both sides of the equation. This changes the coefficients back to a linear expression. Hamilton_CT <- mutate(Hamilton_CT, lnPOP_DEN = log(POP_DENSITY)) This allows us to estimate a third model, as follows: model3 <- lm(formula = lnPOP_DEN ~ dist.sl, data = Hamilton_CT) summary(model3) ## ## Call: ## lm(formula = lnPOP_DEN ~ dist.sl, data = Hamilton_CT) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.5857 -0.3395 0.2970 0.6897 2.4224 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.465e+00 1.588e-01 53.294 < 2e-16 *** ## dist.sl -1.161e-04 1.536e-05 -7.561 1.77e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.202 on 186 degrees of freedom ## Multiple R-squared: 0.2351, Adjusted R-squared: 0.231 ## F-statistic: 57.18 on 1 and 186 DF, p-value: 1.77e-12 We can recreate the scatterplot and add the regression line. Notice that to create the line, we revert the coefficients to the exponential form of the model: ggplot(data = Hamilton_CT, aes(x = dist.sl, y = POP_DENSITY)) + geom_point() + stat_function(fun=function(x)exp(8.465 - 0.0001161 * x), geom="line", color = "blue", size = 1) + geom_vline(xintercept = 0) + geom_hline(yintercept = 0) As before, we can add the residuals of the model to the dataframe for further examination: Hamilton_CT$model3.e <- model3$residuals While this latest model provides a somewhat better fit, there is still systematic under- and over-prediction, as seen in the map below (red are negative residuals and blue are positive): plot_ly(Hamilton_CT) %>% add_sf(color = ~(model3.e > 0), colors = c("red", "dodgerblue4")) Moran’s \\(I\\) as well strongly suggests that the residuals are still not random/independent: moran.test(Hamilton_CT$model3.e, Hamilton_CT.w) ## ## Moran I test under randomisation ## ## data: Hamilton_CT$model3.e ## weights: Hamilton_CT.w ## ## Moran I statistic standard deviate = 8.0156, p-value = 5.478e-16 ## alternative hypothesis: greater ## sample estimates: ## Moran I statistic Expectation Variance ## 0.325928113 -0.005347594 0.001708056 27.7 A Note about Spatial Autocorrelation in Regression Analysis Spatial autocorrelation was originally seen as a problem in regression analysis. It is not difficult to see why, after testing three models in this chapter. My preference is to view spatial autocorrelation as an opportunity for discovery. For instance, the models above all seem to struggle to capture the large variations in population density between the central parts of the city and the suburbs of Hamilton. Perhaps this could be due to a regime change, or in other words, the presence of an underlying process that operates somewhat differently in different parts parts of the city. The latest model we estimated (model3), for instance, suggests that the close proximity of Burlington might have an effect. The analysis that follows is somewhat more advanced, but serves to illustrate the idea of spatial autocorrelation as a tool for discovery. We will begin by creating local Moran maps to identify potential “hot” and “cold” spots of population density. We can envision these as representing different spatial regimes: localmoran.map(Hamilton_CT, Hamilton_CT.w, "POP_DENSITY", by = "TRACT") Examination of the map above, suggests that there are possibly three regimes: a CBD (“HH” and significant tracts), Suburbs (“LL” and significant tracts), and Other (not significant tracts). Based on this, we will create two indicator variables, one for census tracts in the CBD and another for census tracts in the Suburbs. An indicator variable takes values of 1 or zero, depending on whether a condition is true. For instance, all census tracts in the CBD will take the value of 1 in the CBD indicator variable, and all others will be zero. Begin by computing the local statistics: POP_DEN.lm <- localmoran(Hamilton_CT$POP_DENSITY, listw = Hamilton_CT.w) Next, we will identify the type of tract based on the spatial relationships according to the local statistics (i.e., “HH”, “LL”, or “HL/LH”). df_msc <- transmute(Hamilton_CT, TRACT = TRACT, Z = (POP_DENSITY - mean(POP_DENSITY)) / var(POP_DENSITY), SMA = lag.listw(Hamilton_CT.w, Z), Type = case_when(Z < 0 & SMA < 0 ~ "LL", Z > 0 & SMA > 0 ~ "HH", TRUE ~ "HL/LH")) After that, identify as CBD all tracts for which Type is “HH” and the p-value is less than or equal to 0.05. Likewise, identify as Suburb all tracts for which Type is “LL” and the \\(p\\)-value is also less than or equal to 0.05: df_msc <- cbind(df_msc, POP_DEN.lm) CBD <- ifelse(df_msc$Type == "HH" & df_msc$`Pr.z...0` < 0.05, 1, 0) Suburb <- ifelse(df_msc$Type == "LL" & df_msc$`Pr.z...0` < 0.05, 1, 0) We then add the indicator variables to the dataframe: Hamilton_CT$CBD <- CBD Hamilton_CT$Suburb <- Suburb The model that I propose to estimate is a variation of the last non-linear specification, but with regime breaks: \\[ ln(P_i) = \\beta_0 + \\beta_1x_i + \\beta_2CBD_i + \\beta_3Suburb_i + \\beta_4CBD_ix_i + \\beta_5Suburb_ix_i + \\epsilon_i \\] Since the indicator variables for CBD and Suburb take values of zero and one, effectively we have the following: \\[ ln(P_i)=\\Bigg\\{\\begin{array}{l l} (\\beta_0 + \\beta_2) + (\\beta_1 + \\beta_2)x_i + \\epsilon_i \\text{ if census tract } i \\text{ is in the CBD}\\\\ (\\beta_0 + \\beta_3) + (\\beta_1 + \\beta_5)x_i + \\epsilon_i \\text{ if census tract } i \\text{ is in the Suburbs}\\\\ \\beta_0 + \\beta_1x_i + \\epsilon_i \\text{ otherwise}\\\\ \\end{array} \\] Notice that the model now allows for different slopes and intercepts for observations in different parts of the city. Estimate the model: model4 <- lm(formula = lnPOP_DEN ~ dist.sl + CBD * dist.sl + Suburb * dist.sl, data = Hamilton_CT) summary(model4) ## ## Call: ## lm(formula = lnPOP_DEN ~ dist.sl + CBD * dist.sl + Suburb * dist.sl, ## data = Hamilton_CT) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.4523 -0.2128 0.1806 0.4907 2.3234 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.943e+00 1.571e-01 50.577 < 2e-16 *** ## dist.sl -3.248e-05 1.561e-05 -2.081 0.03880 * ## CBD 9.536e-01 3.788e-01 2.518 0.01267 * ## Suburb -9.535e-01 5.089e-01 -1.874 0.06260 . ## dist.sl:CBD -2.730e-05 1.410e-04 -0.194 0.84669 ## dist.sl:Suburb -1.006e-04 3.556e-05 -2.828 0.00521 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9384 on 182 degrees of freedom ## Multiple R-squared: 0.5435, Adjusted R-squared: 0.5309 ## F-statistic: 43.33 on 5 and 182 DF, p-value: < 2.2e-16 This model provides a much better fit than the preceding models (see the the coefficient of multiple determination). We can visaully examine the spatial distribution of the residuals by means of the following map: Hamilton_CT$model4.e <- model4$residuals plot_ly(Hamilton_CT) %>% add_sf(color = ~(model4.e > 0), colors = c("red", "dodgerblue4")) It is not clear from the visual inspection that the residuals are independent, but this can be tested as usual by means of Moran’s \\(I\\) coefficient: moran.test(Hamilton_CT$model4.e, Hamilton_CT.w) ## ## Moran I test under randomisation ## ## data: Hamilton_CT$model4.e ## weights: Hamilton_CT.w ## ## Moran I statistic standard deviate = 0.74498, p-value = 0.2281 ## alternative hypothesis: greater ## sample estimates: ## Moran I statistic Expectation Variance ## 0.024432414 -0.005347594 0.001597957 Based on the results, we fail to reject the null hypothesis, and can be fairly confident that the residuals are random, at last. The following figure illustrates the final model: # We will create three functions to represent each of the three regimes in `model4` fun.1 <- function(x)exp(7.943 + 0.9536 - (0.00003248 + 0.0000273) * x) #CBD fun.2 <- function(x)exp(7.943 - 0.9535 - (0.00003248 + 0.0001006) * x) #Suburb fun.3 <- function(x)exp(7.943 - 0.00003248 * x) #Other ggplot(data = Hamilton_CT, aes(x = dist.sl, y = POP_DENSITY)) + geom_point() + geom_point(data = filter(Hamilton_CT, CBD == 1), color = "Red") + geom_point(data = filter(Hamilton_CT, Suburb == 1), color = "Blue") + # `stat_fuction()` draws custom functions on a `ggplot2` plot. stat_function(fun= fun.1, geom="line", size = 1, aes(color = "CBD")) + stat_function(fun=fun.2, geom="line", size = 1, aes(color = "Suburb")) + stat_function(fun=fun.3, geom="line", size = 1, aes(color = "Other")) + # Set the colors of the regression lines scale_color_manual(values = c("red", "black", "blue"), labels = c("CBD", "Other", "Suburb")) + geom_vline(xintercept = 0) + # Add the y axis... geom_hline(yintercept = 0) # ...and the x axis. This example illustrate how spatial exploratory analysis can provide valuable insights to improve our models, and in turn hopefully develop a better understanding of the underlying process. References "],
["activity-13-area-data-v.html", "Chapter 28 Activity 13: Area Data V 28.1 Practice questions 28.2 Learning objectives 28.3 Suggested reading 28.4 Preliminaries 28.5 Activity", " Chapter 28 Activity 13: Area Data V 28.1 Practice questions Answer the following questions: Explain the main assumptions for linear regression models. How is Moran’s \\(I\\) used as a diagnostic in regression analysis? Residual spatial autocorrelation is symptomatic of what issues in regression analysis? What does it mean for a model to be linear in the coefficients? What is the purpose of transforming variables for regression analysis? 28.2 Learning objectives In this activity, you will: Explore a spatial dataset. Conduct linear regression analysis. Conduct diagnostics for residual spatial autocorrelation. Propose ways to improve your analysis. 28.3 Suggested reading O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey. 28.4 Preliminaries For this activity you will need the following: An R markdown notebook version of this document (the source file). A package called geog4ga3. It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following: rm(list = ls()) Note that ls() lists all objects currently on the worspace. Load the libraries you will use in this activity. In addition to tidyverse, you will need sf and geog4ga3: library(tidyverse) library(sf) library(spdep) library(geog4ga3) Begin by loading the data files you will use in this activity: data("HamiltonDAs") data("trips_by_mode") data("travel_time_car") HamiltonDAs are the Dissemination Areas for Hamilton CMA, which coincide with the Traffic Analysis Zones (TAZ) of the Transportation Tomorrow Survey of 2011. The dataframe trips_by_mode includes the number of trips by mode of transportation by TAZ (equivalently DA), as well as other useful information from the 2011 census for Hamilton CMA. Finally, the dataframe travel_time_car includes the travel distance/time from TAZ/DA centroids to Jackson Square in downtown Hamilton. The data for this activity were retrieved from the 2011 Transportation Tomorrow Survey TTS, the periodic travel survey of the Greater Toronto and Hamilton Area, as well as data from the 2011 Canadian Census Census Program. Before beginning the activity, join the information on trips and travel time to the sf object. Note that to complete the join, the identifier (in this case GTA06) must be in the same format in both data frames: travel_time_car$GTA06 <- factor(travel_time_car$GTA06) # Travel time HamiltonDAs <- left_join(HamiltonDAs, travel_time_car, by = "GTA06") # Trips by mode HamiltonDAs <- left_join(HamiltonDAs, trips_by_mode, by = "GTA06") ## Warning: Column `GTA06` joining factor and character vector, coercing into ## character vector The analysis will be based on travel by car in the Hamilton CMA. Calculate the proportion of trips by car by TAZ: HamiltonDAs <- mutate(HamiltonDAs, Auto_driver.prop = Auto_driver / (Auto_driver + Cycle + Walk)) Note that the proportion of people who travelled by car as passengers are not included in the denominator of the proportion! This is because every trip as a passenger is already included in trips with one driver. 28.5 Activity 1.* Examine your dataframe. What variables are included? Are there any missing values? 2.* Map the variable Auto_driver.prop, and use Moran’s I to test for spatial autocorrelation. 3.* Estimate regression model using the variables Pop_Density and travel time in minutes. What does the analysis of autocorrelation in point 2* tell you about Auto_driver.prop? Would you say that autocorrelation in this variable is a sign that autocorrelation will be an issue in regression analysis? Why or why not? Discuss the model you estimated in point 3. Next, examine its residuals. Would you say that they are spatially random/independent? Propose ways to improve your model. "]
]