From aeb34f18aa1f2586c14ffa2f2673f48f56e4de87 Mon Sep 17 00:00:00 2001 From: Pedro Amaral Date: Wed, 9 Oct 2024 18:54:34 -0300 Subject: [PATCH] Adding series of notebooks, parts 1 to 4 of 17. This series of notebooks shows step-by-step instructions for the estimation and interpretation of spatial regression models using PySAL/spreg. --- notebooks/1_sample_data.ipynb | 385 ++++++++ notebooks/2_data_input_output.ipynb | 795 +++++++++++++++ notebooks/3_basic_mapping.ipynb | 674 +++++++++++++ notebooks/4_spatial_weights.ipynb | 1424 +++++++++++++++++++++++++++ 4 files changed, 3278 insertions(+) create mode 100644 notebooks/1_sample_data.ipynb create mode 100644 notebooks/2_data_input_output.ipynb create mode 100644 notebooks/3_basic_mapping.ipynb create mode 100644 notebooks/4_spatial_weights.ipynb diff --git a/notebooks/1_sample_data.ipynb b/notebooks/1_sample_data.ipynb new file mode 100644 index 00000000..0358ca90 --- /dev/null +++ b/notebooks/1_sample_data.ipynb @@ -0,0 +1,385 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7b8975c4", + "metadata": {}, + "source": [ + "# PySAL Sample Data Sets\n", + "\n", + "### Luc Anselin\n", + "\n", + "### (revised 09/06/2024)\n" + ] + }, + { + "cell_type": "markdown", + "id": "4cfd0985", + "metadata": {}, + "source": [ + "## Preliminaries\n", + "\n", + "In this notebook, the installation and input of PySAL sample data sets is reviewed.\n", + "\n", + "A video recording is available from the GeoDa Center YouTube channel playlist *Applied Spatial Regression - Notebooks*, at https://www.youtube.com/watch?v=qwnLkUFiSzY&list=PLzREt6r1NenmhNy-FCUwiXL17Vyty5VL6." + ] + }, + { + "cell_type": "markdown", + "id": "a443690f", + "metadata": {}, + "source": [ + "### Prerequisites\n", + "\n", + "Very little is assumed in terms of prerequisites. Sample data files are examined and loaded with *libpysal* and *geopandas* is used to read the data. " + ] + }, + { + "cell_type": "markdown", + "id": "6494b68c", + "metadata": {}, + "source": [ + "### Modules Needed\n", + "\n", + "The three modules needed to work with sample data are *libpysal*, *pandas* and *geopandas*. \n", + "\n", + "Some additional imports are included to avoid excessive warning messages. With later versions of PySAL, these may not be needed." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "e398e42f", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + }, + "scrolled": true + }, + "outputs": [], + "source": [ + "import warnings\n", + "warnings.filterwarnings(\"ignore\")\n", + "import os\n", + "os.environ['USE_PYGEOS'] = '0'\n", + "\n", + "import pandas as pd\n", + "import geopandas as gpd\n", + "import libpysal" + ] + }, + { + "cell_type": "markdown", + "id": "4deb9fda", + "metadata": {}, + "source": [ + "In order to have some more flexibility when listing the contents of data frames, the `display.max_rows` option is set to 100 (this step can easily be skipped, but then the listing of example data sets below will be incomplete)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dda117c5", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "pd.options.display.max_rows = 100\n", + "pd.options.display.max_rows" + ] + }, + { + "cell_type": "markdown", + "id": "1ac85fb3", + "metadata": {}, + "source": [ + "### Functionality Used\n", + "\n", + "- from pandas/geopandas:\n", + " - read_file\n", + " \n", + "- from libpysal:\n", + " - examples.available\n", + " - examples.explain\n", + " - examples.load_example\n", + " - examples.get_path" + ] + }, + { + "cell_type": "markdown", + "id": "b9b0c168", + "metadata": {}, + "source": [ + "### Input Files" + ] + }, + { + "cell_type": "markdown", + "id": "74ab0075", + "metadata": {}, + "source": [ + "All notebooks used for this course are organized such that the relevant filenames and variables names are listed at the top, so that they can be easily adjusted for use with your own data sets and variables. In this notebook, the use of PySAL sample data sets is illustrated. For other data sets, the general approach is the same, except that either the files must be present in the current working directory, or the full pathname must be specified. In later notebooks, only sample data sets will be used.\n", + "\n", + "Here, the **Chi-SDOH** sample shape file is illustrated. The specific file names are:\n", + "\n", + "- **Chi-SDOH.shp,shx,dbf,prj**: a shape file (four files!) with socio-economic determinants of health for 2014 in 791 Chicago tracts\n", + "\n", + "In the other *spreg* notebooks, it is assumed that will you have installed the relevant example data sets using functionality from the *libpysal.examples* module. This is illustrated in detail here, but will not be repeated in the other notebooks. If the files are not loaded using the `libpysal.examples` functionality, they can be downloaded as individual files from https://github.com/lanselin/spreg_sample_data/ or https://geodacenter.github.io/data-and-lab/. You must then pass the full path to **infileshp** used as arguments in the corresponding `geopandas.read_file` command.\n", + "\n", + "The input file is specified generically as **infileshp** (for the shape file). " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "4d4335bb", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "infileshp = \"Chi-SDOH.shp\" # input shape file with data" + ] + }, + { + "cell_type": "markdown", + "id": "45549ced", + "metadata": {}, + "source": [ + "## Accessing a PySAL Remote Sample Data Set " + ] + }, + { + "cell_type": "markdown", + "id": "bd0db16e-fe3c-45c6-85d9-178d42d016c5", + "metadata": {}, + "source": [ + "### Installing a remote sample data set" + ] + }, + { + "cell_type": "markdown", + "id": "69b1e985", + "metadata": {}, + "source": [ + "All the needed files associated with a remote data set must be installed locally. The list of available remote data sets is shown by means of `libpysal.examples.available()`. When the file is also installed, the matching item in the **Installed** column will be given as **True**. \n", + "\n", + "If the sample data set has not yet been installed, **Installed** is initially set to **False**. For example, if the **chicagoSDOH** data set is not installed, item **79** in the list (**chicagoSDOH**), is given as **False**. Once the example data set is loaded, this will be changed to **True**.\n", + "\n", + "The example data set only needs to be loaded once. After that, it will be available for all future use in *PySAL* (not just in the current notebook), using the standard `get_path` functionality of `libpysal.examples`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "474197b2", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "libpysal.examples.available()" + ] + }, + { + "cell_type": "markdown", + "id": "bee29a1d", + "metadata": {}, + "source": [ + "The contents of any `PySAL` example data set can be shown by means of `libpysal.examples.explain`. Note that this does **not** load the data set, but it accesses the contents remotely (you will need an internet connection). As listed, the data set is for 791 census tracts in Chicago and it contains 65 variables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "02d1d8a4", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "libpysal.examples.explain(\"chicagoSDOH\")" + ] + }, + { + "cell_type": "markdown", + "id": "f2748a31", + "metadata": {}, + "source": [ + "The example data set is installed locally by means of `libpysal.examples.load_example` and passing the name of the remote example. Note the specific path to which the data sets are downloaded, you will need that if you ever want to remove the data set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f5f24ea0", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "libpysal.examples.load_example(\"chicagoSDOH\")" + ] + }, + { + "cell_type": "markdown", + "id": "9c737fc3", + "metadata": {}, + "source": [ + "At this point, when checking `available`, the data set is listed as **True** under **Installed**. As mentioned, the installation only needs to be carried out once." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a772500e", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "libpysal.examples.available()" + ] + }, + { + "cell_type": "markdown", + "id": "35566ba3", + "metadata": {}, + "source": [ + "### Reading Input Files from the Example Data Set" + ] + }, + { + "cell_type": "markdown", + "id": "73afed82", + "metadata": {}, + "source": [ + "The actual path to the files contained in the local copy of the remote data set is found by means of `libpysal.examples.get_path`. This is then passed to the *geopandas* `read_file` function in the usual way. Here, this is a bit cumbersome, but the command can be simplified by specific statements in the module import, such as `from libpysal.examples import get_path`. The latter approach will be used in later notebooks, but here the full command is used. \n", + "\n", + "For example, the path to the input shape file is (this may be differ somewhat depending on how and where PySAL is installed):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc1d2a01", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "libpysal.examples.get_path(infileshp)" + ] + }, + { + "cell_type": "markdown", + "id": "d11c88fd", + "metadata": {}, + "source": [ + "As mentioned earlier, if the example data are not installed locally by means of `libpysal.examples`, the `get_path` command must be replaced by an explicit reference to the correct file path name. This is easiest if the files are in the current working directory, in which case just specifying the file names in **infileshp** etc. is sufficient.\n", + "\n", + "The shape file is read by means of the *geopandas* `read_file` command, to which the full file pathname is passed obtained from `libpysal.examples.get_path(infileshp)`. To check if all is right, the shape of the data set (number of observations, number of variables) is printed (using the standard `print( )` command), as well as the list of variable names (columns in *pandas* speak). Details on dealing with *pandas* and *geopandas* data frames are covered in a later notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8d93a92", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "inpath = libpysal.examples.get_path(infileshp)\n", + "dfs = gpd.read_file(inpath)\n", + "print(dfs.shape)\n", + "print(dfs.columns)" + ] + }, + { + "cell_type": "markdown", + "id": "604e8ab2", + "metadata": {}, + "source": [ + "### Removing an Installed Remote Sample Data Set" + ] + }, + { + "cell_type": "markdown", + "id": "6974e1b6", + "metadata": {}, + "source": [ + "In case that for some reason the installed remote **chicagoSDOH** data set is no longer needed, it can be removed by means of standard linux commands (or equivalent, for other operating systems). For example, on a Mac or Linux-based system, one first moves to the directory where the files were copied to. This is the same path that was shown when `load_example` was executed. In the example for a Mac OS operating system, this was shown in **Downloading chicagoSDOH to /Users/luc/Library/Application Support/pysal/chicagoSDOH**.\n", + "\n", + "So, in a terminal window, one first moves to /Users/your_user_name/Library/'Application Support'/pysal (don't forget the quotes) on a Mac system (and equivalent for other operating systems). There, the **chicagoSDOH** directory will be present. It is removed by means of:\n", + " \n", + "`rm -r chicagoSDOH`\n", + " \n", + "Of course, once removed, it will have to be reinstalled if needed in the future." + ] + }, + { + "cell_type": "markdown", + "id": "94d9a818-6ce8-4658-b51a-54ad7178c795", + "metadata": {}, + "source": [ + "## Practice" + ] + }, + { + "cell_type": "markdown", + "id": "b29d53d4-69c7-411f-887d-efb6bc6b426f", + "metadata": {}, + "source": [ + "If you want to use other PySAL data sets to practice the spatial regression functionality in *spreg*, make sure to install them using the instructions given in this notebook. For example, load the **Police** data set (item 52 in the list), which will be used as an example in later notebooks." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/2_data_input_output.ipynb b/notebooks/2_data_input_output.ipynb new file mode 100644 index 00000000..20ad78ff --- /dev/null +++ b/notebooks/2_data_input_output.ipynb @@ -0,0 +1,795 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6c2d40bf", + "metadata": {}, + "source": [ + "# Data Input/Output\n", + "\n", + "### Luc Anselin\n", + "\n", + "### 09/06/2024" + ] + }, + { + "cell_type": "markdown", + "id": "f411eb3f", + "metadata": {}, + "source": [ + "## Preliminaries" + ] + }, + { + "cell_type": "markdown", + "id": "3c764aed", + "metadata": {}, + "source": [ + "In this notebook, some elementary functionality is covered to carry out data input and output from and to different types of files. The key concept is a so-called *DataFrame*, a tabular representation of the data with observations as rows and variables as columns.\n", + "\n", + "This is implemented by means of *pandas* for generic text files (as well as many other formats) and *geopandas* for spatial data files (shape files or geojson files). The functionality will be illustrated with the **Police** sample data set that contains police expenditure data for Mississippi counties. It is assumed that this data has been installed using `libpysal.examples.load_example(\"Police\")`.\n", + "\n", + "A video recording is available from the GeoDa Center YouTube channel playlist *Applied Spatial Regression - Notebooks*, at https://www.youtube.com/watch?v=7yWOgPEBQmE&list=PLzREt6r1NenmhNy-FCUwiXL17Vyty5VL6&index=2." + ] + }, + { + "cell_type": "markdown", + "id": "936a0938", + "metadata": {}, + "source": [ + "### Modules Needed\n", + "\n", + "The work horse for spatial analysis in Python is the *PySAL* library. However, before addressing specific spatial functionality, the use of *pandas* and *geopandas* will be illustrated to load data into so-called data frames. In addition, *libpysal* is needed to access the sample data sets. All of these rely on *numpy* as a dependency.\n", + "\n", + "The full set of imports is shown below. Also, in this notebook, the `get_path` functionality of `libpysal.examples` is imported separately, without the rest of *libpysal*." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "db02c49b", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import os\n", + "os.environ['USE_PYGEOS'] = '0'\n", + "import geopandas as gpd\n", + "from libpysal.examples import get_path" + ] + }, + { + "cell_type": "markdown", + "id": "203e4930", + "metadata": {}, + "source": [ + "### Functions Used\n", + "\n", + "- from numpy:\n", + " - array\n", + " - shape\n", + " - tolist\n", + " - reshape\n", + "\n", + "- from pandas:\n", + " - read_csv\n", + " - head\n", + " - info\n", + " - list\n", + " - columns\n", + " - describe\n", + " - corr\n", + " - DataFrame\n", + " - concat\n", + " - to_csv\n", + " - drop\n", + " \n", + "- from geopandas:\n", + " - read_file\n", + " - to_file\n", + "\n", + "- from libpysal:\n", + " - get_path" + ] + }, + { + "cell_type": "markdown", + "id": "2c3416ff", + "metadata": {}, + "source": [ + "### Files\n", + "\n", + "Data input and output will be illustrated with the **Police** sample data set. This data set contains the same information in several different formats, such as csv, dbf, shp and geojson, which will be illustrated in turn. The following files will be used:\n", + "\n", + "- **police.shp,shx,dbf,prj**: shape file (four files) for 82 counties\n", + "- **police.csv**: the same data in csv text format\n", + "- **police.geojson**: the spatial layer in geojson format\n", + "\n", + "All the files are defined here, and referred to generically afterwards, so that it will be easy to re-run the commands for a separate application. The only changes needed would be the file names and/or variable names (if needed)." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "d82e2963", + "metadata": {}, + "outputs": [], + "source": [ + "infilecsv = \"police.csv\" # input csv file\n", + "outfilecsv = \"test1.csv\" # output csv file\n", + "infiledbf = \"police.dbf\" # input dbf file\n", + "outfiledbf = \"test2.csv\" # output dbf file\n", + "infileshp = \"police.shp\" # input shape file\n", + "outfileshp = \"test3.shp\" # output shape file\n", + "infilegeo = \"police.geojson\" # input geojson file\n", + "outfilegeo = \"test4.geojson\" # output geojson file" + ] + }, + { + "cell_type": "markdown", + "id": "cbbc97ea", + "metadata": {}, + "source": [ + "## Text Files\n", + "\n", + "### Input\n", + "\n", + "The input file for csv formatted data is **infilecsv**. In the example, this is the csv file **police.csv**. The path to the installed sample data set is found with `get_path` (note the form of the `import` statement, which means that the full prefix `libpysal.examples` is not needed).\n", + "\n", + "The pandas command `read_csv` creates a data frame, essentially a data table. One of its attributes is `shape`, the dimension of the table as number of rows (observations) and number of columns (variables). `df.head( )` lists the first few rows of the actual table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6dd6bae", + "metadata": {}, + "outputs": [], + "source": [ + "inpath = get_path(infilecsv)\n", + "df = pd.read_csv(inpath)\n", + "print(df.shape)\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "e56ab718", + "metadata": {}, + "source": [ + "### Contents" + ] + }, + { + "cell_type": "markdown", + "id": "e89780e4", + "metadata": {}, + "source": [ + "A technical way to see the contents of a *pandas* data frame is to use the `info` command. This gives the class, range of the index (used internally to refer to rows) and the data type of the variables (columns)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e12746d", + "metadata": {}, + "outputs": [], + "source": [ + "df.info()" + ] + }, + { + "cell_type": "markdown", + "id": "7f9d50a2", + "metadata": {}, + "source": [ + "An arguably more intuitive sense of the contents of the data frame is to just list the names of all the variables. This can be accomplished several different ways, illustrating the flexibility of *pandas*. However, it is important to know what type of object the result of each operation yields. Depending on the approach, this could be a list, a pandas index object or a numpy array. Assuming the wrong type for the result can cause trouble.\n", + "\n", + "The following four approaches will each extract the column headers, but yield the result as a different type of object. This will determine how it can be further manipulated:\n", + "\n", + "- `list(df)`: creates a simple list with the variable names\n", + "\n", + "- `df.columns`: yields the columns as a pandas index object\n", + "\n", + "- `df.columns.values`: yields the columns as a numpy array\n", + "\n", + "- `df.columns.values.tolist( )`: yields the columns as a list, same as `list(df)`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c4ece4ae", + "metadata": {}, + "outputs": [], + "source": [ + "varlist1 = list(df)\n", + "print(varlist1)\n", + "type(varlist1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dce1b796", + "metadata": {}, + "outputs": [], + "source": [ + "varlist2 = df.columns\n", + "print(varlist2)\n", + "type(varlist2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33196a5d", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "varlist3 = df.columns.values\n", + "print(varlist3)\n", + "type(varlist3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e08f9aa3", + "metadata": {}, + "outputs": [], + "source": [ + "varlist4 = df.columns.values.tolist()\n", + "print(varlist4)\n", + "type(varlist4)" + ] + }, + { + "cell_type": "markdown", + "id": "31293509", + "metadata": {}, + "source": [ + "### Descriptive Statistics\n", + "\n", + "A quick summary of the data set is provided by the `describe` command." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ba2bfbb0", + "metadata": {}, + "outputs": [], + "source": [ + "df.describe()" + ] + }, + { + "cell_type": "markdown", + "id": "5936652f", + "metadata": {}, + "source": [ + "### Extracting Variables" + ] + }, + { + "cell_type": "markdown", + "id": "d167db2c", + "metadata": {}, + "source": [ + "Variables (columns) can easily be extracted from a dataframe by listing their names in a list and subsetting the data frame (there are other ways as well, but they will not be considered here). It is important to keep in mind that the result is a different view of the same data frame, which may not be what is expected. In fact, in many applications in the context of *spreg*, the result should be a numpy array. This requires an extra step to cast the data frame to an array object.\n", + "\n", + "Also, in many contexts, an additional variable may need to be added to the data frame. For example, this will be needed for regression residuals and predicted values in a later notebook. To illustrate some of the steps involved, the variable **COLLEGE** will be turned into its complement (i.e., percent population without a college degree) and subsequently added to the data frame. To illustrate some descriptive statistics, **POLICE** will be extracted as well.\n", + "\n", + "First, the variable names are put in a list to subset the data frame and check the type. Make sure to use double brackets, the argument to the subset [ ] is a list, so [[list of variable names in quotes, separated by commas]]. The result is a *pandas* data frame or series (one variable).\n", + "\n", + "Note: if you want to do this for your own data set, possibly using different variables and different expressions, you will need to adjust the code below accordingly. Typically, this is avoided in these notebooks, but here there is no option to make things totally generic." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ef84f45", + "metadata": {}, + "outputs": [], + "source": [ + "df1 = df[['POLICE','COLLEGE']]\n", + "type(df1)" + ] + }, + { + "cell_type": "markdown", + "id": "964daaa4", + "metadata": {}, + "source": [ + "A more elegant approach and one that will make it much easier to reuse the code for different data sets and variables is to enter the variable names in a list first, and then pass that to subset the data frame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab77cecf", + "metadata": {}, + "outputs": [], + "source": [ + "varnames = ['POLICE','COLLEGE']\n", + "df2 = df[varnames]\n", + "type(df2)" + ] + }, + { + "cell_type": "markdown", + "id": "db80a141", + "metadata": {}, + "source": [ + "At this point, it is much more meaningful to get the descriptive statistics using `describe`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6cddcd1a", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "df2.describe()" + ] + }, + { + "cell_type": "markdown", + "id": "f67f53ea", + "metadata": {}, + "source": [ + "A correlation coefficient is obtained by means of the `corr` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "83f15c59", + "metadata": {}, + "outputs": [], + "source": [ + "df2.corr()" + ] + }, + { + "cell_type": "markdown", + "id": "8e99466c", + "metadata": {}, + "source": [ + "### Extracting Variables to a Numpy Array" + ] + }, + { + "cell_type": "markdown", + "id": "10df9613", + "metadata": {}, + "source": [ + "As mentioned, when using variables in the context of **spreg** routines, they will often need to be numpy arrays, not a data frame. This is accomplished by means of the `numpy.array` function (`np.array` in the notation used here). The `shape` attribute is a check to make sure that the resulting matrices have the correct format. In the example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1eb1d639", + "metadata": {}, + "outputs": [], + "source": [ + "x1 = np.array(df[varnames])\n", + "print(x1.shape)\n", + "type(x1)" + ] + }, + { + "cell_type": "markdown", + "id": "cdab7c0e", + "metadata": {}, + "source": [ + "### Computations" + ] + }, + { + "cell_type": "markdown", + "id": "0255cfee", + "metadata": {}, + "source": [ + "New variables (columns) can be added to an existing data frame by means of straightforward element by element computations. However, to do this within the data frame structure is a bit cumbersome, since the data frame name needs to be included for each variable. On the other hand, the result is immediately attached to the data frame. \n", + "\n", + "Alternatively, the computations can be carried out using the numpy array and subsequently attached to the data frame. However, for a one-dimensional result, the shape of the result is a one-dimensional numpy array, not a row or a column vector. To obtain the latter, the `reshape` command needs to be used.\n", + "\n", + "For example, to compute the complement of the percentage with a college degree (in column 1 of array **x1**), the second column of the array is subtracted from 100. The element-by-element computation gives the desired result, but not the correct shape." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4ee6a4eb", + "metadata": {}, + "outputs": [], + "source": [ + "noncollege = 100.0 - x1[:,1]\n", + "noncollege" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87b3dc37", + "metadata": {}, + "outputs": [], + "source": [ + "noncollege.shape" + ] + }, + { + "cell_type": "markdown", + "id": "041ced4c", + "metadata": {}, + "source": [ + "The correct dimension is obtained by means of `reshape(-1,1)`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2beb3219", + "metadata": {}, + "outputs": [], + "source": [ + "noncollege = noncollege.reshape(-1,1)\n", + "print(noncollege.shape)\n", + "noncollege[0:5,:]" + ] + }, + { + "cell_type": "markdown", + "id": "4af49803", + "metadata": {}, + "source": [ + "Note the extra brackets in the (82,1) column vector compared to the (82, ) numpy array above." + ] + }, + { + "cell_type": "markdown", + "id": "16331996", + "metadata": {}, + "source": [ + "### Concatenating Data Frames" + ] + }, + { + "cell_type": "markdown", + "id": "47a9ceb6", + "metadata": {}, + "source": [ + "In order to add the result of the matrix calculation to the data frame, two steps are involved. First, the numpy array is turned into into a data frame using `pandas.DataFrame`, making sure to give meaningful names to the columns by means of the `columns` argument. Then the `pandas.concat` function is applied to join the two data frames together. One can of course combine the two operations into one line, but here they are kept separate for clarity. **NONCOLLEGE** is added as the last variable in the data frame.\n", + "\n", + "Note that `axis=1` is set as an argument to the `concat` function to make sure a column is added (`axis=0` is to add a row)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b415460", + "metadata": {}, + "outputs": [], + "source": [ + "dd = pd.DataFrame(noncollege,columns=['NONCOLLEGE'])\n", + "df = pd.concat([df,dd],axis=1)\n", + "print(df.columns)" + ] + }, + { + "cell_type": "markdown", + "id": "48bae79a", + "metadata": {}, + "source": [ + "### Output" + ] + }, + { + "cell_type": "markdown", + "id": "8183fbf0", + "metadata": {}, + "source": [ + "If desired, the new data frame can be written to a csv file using the `to_csv` command. The only required argument is the filename. For example, with the generic file name **outfilecsv** as defined at the top of the notebook, the file will be written to the current working directory. Its contents can be examined with any text editor or by loading it into a spreadsheet program. \n", + "\n", + "To avoid the index numbers as a first unnamed column (i.e., the default row names), an extra argument is `index = False`." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "a39e0475", + "metadata": {}, + "outputs": [], + "source": [ + "df.to_csv(outfilecsv,index=False)" + ] + }, + { + "cell_type": "markdown", + "id": "d5bf3069", + "metadata": {}, + "source": [ + "### DBase Files (dbf)" + ] + }, + { + "cell_type": "markdown", + "id": "65673215", + "metadata": {}, + "source": [ + "A common (but old) format for tabular data bases is the dBase format, with file extension dbf. Even though it is old (and, arguably, out of date), this format is still quite useful because it is used to store the data (attributes) in one of the common spatial data formats, the shape file popularized by ESRI (see below).\n", + "\n", + "As it happens, *pandas* is currently not able to read data from a dbf file directly into a data frame. Specialized packages exist that implement this functionality (like *simpledbf*). However, as it happens, *geopandas*, considered in more detail below, also reads dbf files by means of its `read_file` command. No special arguments are needed, since the file format is derived from the file extension.\n", + "\n", + "For example, to read the data from **police.dbf** (the same as in **police.csv**), the path to the sample data file **infiledbf** is found with `get_path` and passed to the `geopandas.read_file` command. The result is a **GeoDataFrame**, not a regular **DataFrame**. This is an artifact of the dbf file being in the same directory as the shape file. The same command applied to the dbf file in isolation will be a **DataFrame**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8ca9603", + "metadata": {}, + "outputs": [], + "source": [ + "inpath = get_path(infiledbf)\n", + "dfdb = gpd.read_file(inpath)\n", + "print(dfdb.shape)\n", + "print(type(dfdb))\n", + "print(dfdb.columns)" + ] + }, + { + "cell_type": "markdown", + "id": "56c081ac", + "metadata": {}, + "source": [ + "A close look at the dimensions and the columns reveals an additional column (22 compared to 21) with column name `geometry`. This can be removed by means of the `drop(columns = \"geometry\")` command." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70540bdd", + "metadata": {}, + "outputs": [], + "source": [ + "dfdb = dfdb.drop(columns = 'geometry')\n", + "print(dfdb.shape)\n", + "print(type(dfdb))\n", + "print(dfdb.columns)" + ] + }, + { + "cell_type": "markdown", + "id": "53dc84ac", + "metadata": {}, + "source": [ + "Now, the dimension is tha same as for the csv file and the `geometry` column has disappeared. Also, the `type` of the result is a regular **DataFrame**.\n", + "\n", + "As mentioned, if the dbf file is in a directory without the presence of a spatial layer, the `geometry` column will not be present. In that case, the result is a regular **DataFrame**, NOT a **GeoDataFrame**.\n", + "\n", + "It is important to keep this in mind, since *pandas* has currently no support for writing dbf files, whereas *geopandas* only has support for writing dbf files that contain a `geometry` column. However, a *pandas* data frame can be written to a csv file as seen before, using `to_csv`. The input dbf file can thus be converted to a csv file, but any changes cannot be saved to another dbf file.\n", + "\n", + "In general, working with dbf files in isolation is to be discouraged." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "4f1e648e", + "metadata": {}, + "outputs": [], + "source": [ + "dfdb.to_csv(outfiledbf,index=False)" + ] + }, + { + "cell_type": "markdown", + "id": "c5e63b68", + "metadata": {}, + "source": [ + "## Spatial Data Files" + ] + }, + { + "cell_type": "markdown", + "id": "2ae3e939", + "metadata": {}, + "source": [ + "### Spatial Data\n", + "\n", + "Spatial data are characterized by the combination of locational information (the precise\n", + "definition of points, lines or areas) and so-called attribute information (variables). \n", + "\n", + "There are many formats to store spatial information, in files as well as in relational databases. To keep\n", + "things simple, first the so-called *shape file* format is considered, a standard supported by ESRI, one of the \n", + "major commercial GIS vendors. In addition, *geojson* will be covered as well, since it is an increasingly common open source format." + ] + }, + { + "cell_type": "markdown", + "id": "5e70392d", + "metadata": {}, + "source": [ + "### Reading a shape file\n", + "\n", + "The terminology is a bit confusing, since there is no such thing as *one* shape file, but there is instead\n", + "a collection of three (or four) files. One file has the extension **.shp**, one **.shx**, one **.dbf**, and\n", + "one **.prj** (with the projection information). The first three are required, the fourth one is optional,\n", + "but highly recommended. The files should all be in the same directory and have the same main file name.\n", + "\n", + "In Python, the easiest way to read shape files is to use *geopandas*. The command is `read_file`, followed by the file pathname in parentheses. The program is smart enough to figure out the file format from the file extension *.shp*. As we saw before for the dbf format, the result is a geopandas data frame, a so-called **GeoDataFrame**, say **dfs**, which is a *pandas* **DataFrame** with an additional column for the geometry.\n", + "\n", + "All the standard pandas commands also apply to a geopandas data frame.\n", + "\n", + "The example uses the **police.shp** sample file as the input file, as specified in `infileshp` at the top of the notebook. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e85db1bb", + "metadata": {}, + "outputs": [], + "source": [ + "inpath = get_path(infileshp)\n", + "dfs = gpd.read_file(inpath)\n", + "print(dfs.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "ed98200b", + "metadata": {}, + "source": [ + "Note how the data frame has one more column than the one created from the csv file. This is the same as in the dbf example above. The last column is **geometry**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20d61cd6", + "metadata": {}, + "outputs": [], + "source": [ + "print(dfs.columns)" + ] + }, + { + "cell_type": "markdown", + "id": "237a031d", + "metadata": {}, + "source": [ + "### Creating New Variables" + ] + }, + { + "cell_type": "markdown", + "id": "81350492", + "metadata": {}, + "source": [ + "Just as for a standard pandas data frame, variables can be transformed, new variables created and data frames merged. The commands are the same as before and will not be repeated here." + ] + }, + { + "cell_type": "markdown", + "id": "e1d8cdd2", + "metadata": {}, + "source": [ + "### Reading a GeoJSON File" + ] + }, + { + "cell_type": "markdown", + "id": "28bc5379", + "metadata": {}, + "source": [ + "Reading any of the supported spatial formats is implemented by the same `read_file` command. As mentioned, *geopandas* figures out the right format from the file extension. The result is identical to the one for the shape file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fbf89311", + "metadata": {}, + "outputs": [], + "source": [ + "inpath = get_path(infilegeo)\n", + "dfg = gpd.read_file(inpath)\n", + "print(dfg.shape)\n", + "print(type(dfg))\n", + "print(dfg.columns)" + ] + }, + { + "cell_type": "markdown", + "id": "e0598e91", + "metadata": {}, + "source": [ + "### Writing a GeoDataFrame" + ] + }, + { + "cell_type": "markdown", + "id": "c7a7c818", + "metadata": {}, + "source": [ + "The output is accomplished by the `to_file` command. This supports many different output formats, but the default is the ESRI shape file, so we do not have to specify any arguments other than the filename. Here, we use the output file name specified in `outfileshp`." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "ff7d08e7", + "metadata": {}, + "outputs": [], + "source": [ + "dfs.to_file(outfileshp)" + ] + }, + { + "cell_type": "markdown", + "id": "15ff186d", + "metadata": {}, + "source": [ + "Writing a geojson file works in exactly the same way, for example, using the output file specified in **outputgeo**." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "fef50cad", + "metadata": {}, + "outputs": [], + "source": [ + "dfg.to_file(outfilegeo)" + ] + }, + { + "cell_type": "markdown", + "id": "1ad019ff", + "metadata": {}, + "source": [ + "## Practice" + ] + }, + { + "cell_type": "markdown", + "id": "72ca1664", + "metadata": {}, + "source": [ + "Use your own data set or one of the PySAL sample data sets to load a spatial data frame, create some new variables, optionally get descriptive statistics and write out an updated data set. This type of operation will be used frequently in the course of the regression analysis, for example, to add predicted values and/or residuals to a spatial layer." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/3_basic_mapping.ipynb b/notebooks/3_basic_mapping.ipynb new file mode 100644 index 00000000..4361d733 --- /dev/null +++ b/notebooks/3_basic_mapping.ipynb @@ -0,0 +1,674 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6c2d40bf", + "metadata": {}, + "source": [ + "# Basic Mapping\n", + "\n", + "### Luc Anselin\n", + "\n", + "### 09/06/2024" + ] + }, + { + "cell_type": "markdown", + "id": "f411eb3f", + "metadata": {}, + "source": [ + "## Preliminaries" + ] + }, + { + "cell_type": "markdown", + "id": "3c764aed", + "metadata": {}, + "source": [ + "There are many ways to create beautiful maps in Python using packages such as *folium* or *plotly*. In this notebook, the `plot` functionality of *geopandas* is illustrated, which is sufficient for most of our purposes. The functionality will be illustrated with the **Police** sample data set that contains police expenditure data for Mississippi counties. It is assumed that this data has been installed using `libpysal.examples.load_example(\"Police\")`.\n", + "\n", + "A video recording is available from the GeoDa Center YouTube channel playlist *Applied Spatial Regression - Notebooks*, at https://www.youtube.com/watch?v=rZ1Mw-hZcMY&list=PLzREt6r1NenmhNy-FCUwiXL17Vyty5VL6&index=3." + ] + }, + { + "cell_type": "markdown", + "id": "936a0938", + "metadata": {}, + "source": [ + "### Modules Needed\n", + "\n", + "As before, the main modules are *geopandas* and *libpysal*. Specifically, *libpysal.examples* is used to get the path to the sample data. In addition, to save the maps to a file, *matplotlib.pyplot* is needed.\n", + "\n", + "The full set of imports is shown below." + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "db02c49b", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.environ['USE_PYGEOS'] = '0'\n", + "import geopandas as gpd\n", + "import matplotlib.pyplot as plt\n", + "from libpysal.examples import get_path" + ] + }, + { + "cell_type": "markdown", + "id": "203e4930", + "metadata": {}, + "source": [ + "### Functions Used\n", + "\n", + "- from geopandas:\n", + " - read_file\n", + " - plot\n", + "\n", + "- from libpysal:\n", + " - get_path\n", + "\n", + "- from matplotlib.pyplot:\n", + " - savefig" + ] + }, + { + "cell_type": "markdown", + "id": "2c3416ff", + "metadata": {}, + "source": [ + "### Files\n", + "\n", + "The mapping functionality will be illustrated with the same **Police** sample data set as used in the previous notebook. The following files will be used:\n", + "\n", + "- **police.shp,shx,dbf,prj**: shape file (four files) for 82 counties\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d82e2963", + "metadata": {}, + "outputs": [], + "source": [ + "infileshp = \"police.shp\" # input shape file\n", + "inpath = get_path(infileshp)\n", + "dfs = gpd.read_file(inpath)" + ] + }, + { + "cell_type": "markdown", + "id": "2ac636ba", + "metadata": {}, + "source": [ + "## Getting Started" + ] + }, + { + "cell_type": "markdown", + "id": "7c30a057", + "metadata": {}, + "source": [ + "### Default Map" + ] + }, + { + "cell_type": "markdown", + "id": "52cf67ae", + "metadata": {}, + "source": [ + "Before delving into customization, the default choropleth map created by the `plot` function applied to a **GeoDataFrame** is illustrated. A bare bones implementation only requires the variable (column) to be mapped and the argument `legend = True`. Without the latter, there will still be a map, but it will not have a legend, so there will be no guide as to what the colors mean." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b2f412ce", + "metadata": {}, + "outputs": [], + "source": [ + "dfs.plot('POLICE',legend=True)" + ] + }, + { + "cell_type": "markdown", + "id": "01ce3b94", + "metadata": {}, + "source": [ + "Not exactly the prettiest thing in the world. A continuous color ramp as seen here is not recommended by cartographers. Also, the classification is such that too many observations have seemingly the same color. Finally, there is also this strange mention of ****.\n", + "\n", + "There are two important types of modifications that can be considered. One pertains to the fundamental characteristics of a choropleth map, the other to the way *matplotlib* constructs visualizations under the hood. The *geopandas* library relies on *matplotlib* so there is no need to `import` the latter explicitly, except when one wants to save the maps to a file. In any case, it helps to understand the *matplotlib* logic. This is considered first." + ] + }, + { + "cell_type": "markdown", + "id": "c12c0184", + "metadata": {}, + "source": [ + "### Matplotlib Logic" + ] + }, + { + "cell_type": "markdown", + "id": "78e5979e", + "metadata": {}, + "source": [ + "The *matplotlib* library is extremely powerful and allows just about any type of customized visualization. It starts by setting up the basic parameters and then builds a graphic representation layer by layer. The terminology may seem a bit strange at first, but after a while, it becomes more familiar.\n", + "\n", + "A plot is initialized by assigning some parameters to the tuple `fig , ax`. It is important to realize that `fig` is about the figure makeup and `ax` is about the actual plots. For example, `fig` is used to specify how many subplots there need to be, how they are arranged and what their size is. Since the examples used here and in later notebooks will only produce a single plot, the `fig` aspect can be ignored and only `ax` is needed. In fact, for simple plots such as the maps in our applications, the specification of `ax` as such is not needed and the `plot` function can be applied directly to the GeoDataFrame. However, it remains important that the plot object is referred to as `ax` in many operations." + ] + }, + { + "cell_type": "markdown", + "id": "579ab0f0", + "metadata": {}, + "source": [ + "An alternative way to set up the default map just shown is to explicitly assign it to an object `ax`, as `ax = dfs.plot( )` with the same arguments as before. To remove the x-y coordinates and box around the map, the method `set_axis_off()` is applied to the `ax` object. Using this setup also removes the **** listing. Otherwise, everything is still the same as before." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d50f07df", + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.plot('POLICE',legend = True)\n", + "ax.set_axis_off()" + ] + }, + { + "cell_type": "markdown", + "id": "0d881ef0", + "metadata": {}, + "source": [ + "Note that the same result can be obtained without the explicit assignment to `ax` by simply applying the method to the `plot` object, as in the example below. Typically, the more explicit assignment is considered to be more readable, but it is mostly a matter of preference." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09d54b24", + "metadata": {}, + "outputs": [], + "source": [ + "dfs.plot('POLICE',legend=True).set_axis_off()" + ] + }, + { + "cell_type": "markdown", + "id": "392346c2", + "metadata": {}, + "source": [ + "## Map Design Characteristics" + ] + }, + { + "cell_type": "markdown", + "id": "a505e9bf", + "metadata": {}, + "source": [ + "The purpose of a choropleth or thematic map is to visualize the spatial distribution of a variable over areal units. Choropleth comes from the Greek choros, which stands for region, so it is a map for regions. For our purposes, the proper design of a map has three important characteristics, which each translate into arguments to the `plot` function:\n", + "\n", + "- classification\n", + "\n", + "- color\n", + "\n", + "- legend" + ] + }, + { + "cell_type": "markdown", + "id": "2c3619b7", + "metadata": {}, + "source": [ + "### Classification" + ] + }, + { + "cell_type": "markdown", + "id": "a3db6d84", + "metadata": {}, + "source": [ + "Arguably the most important characteristic is the classification used, i.e., how the continuous distribution of a given variable gets translated into a small number of discrete categories, or bins. This is exactly the same issue encountered in the design of histogram bins.\n", + "\n", + "The assignment of observations to distinct bins is done by the *mapclassify* library, which is part of the *PySAL* family. However, this is done under the hood by *geopandas* so that no separate `import` statement is needed.\n", + "\n", + "The classification is set by means of the `scheme` argument. Common options are `Quantiles` (for a quantile map), `EqualInterval` (for an equal intervals map), `NaturalBreaks` (for a natural breaks map), `StdMean` (for a standard deviational map), and `BoxPlot` (for a box map). All but the last two classifications require an additional argument for the number of bins, `k`. This is not needed for the standard deviational map and the box map, for which the breakpoints are derived from the data, respectively the standard deviation and the quartiles/hinge.\n", + "\n", + "The default hinge for the box map is 1.5 times the interquartile range. Other values for the hinge can be specified by setting a different value for the argument `hinge`, but this is typically not necessary. However, to pass this to the *geopandas* `plot` function it cannot just be set as `hinge = 3.0` as in *mapclassify*. In *geopandas* it is necessary to pass this in a `classification_kwds` dictionary, where the relevant parameters are set. For example, this would be `classification_kwds = {\"hinge\": 3.0}` for a hinge of 3 times the interquartile range.\n", + "\n", + "The default for the standard deviational map is to show all observations within one standard deviation below and above the mean as one category. To separate observations below and above the mean can be accomplished by setting the argument `anchor` to `True`. Again, this is done by means of the `classification_kwds` dictionary.\n", + "\n", + "Full details on all the classifications available through *mapclassify* and their use in *geopandas* can be found at https://geopandas.org/en/stable/docs/user_guide/mapping.html# and https://pysal.org/mapclassify/api.html.\n", + "\n", + "Each of the five cases is illustrated in turn. Note that the `column` argument is used to designate the variable to be mapped.\n", + "\n", + "The placement of the legend is managed by means of the `legend_kwds` argument (similar to `classification_kwds`). This is a dictionary that specifies aspects such as the location of the legend and how it is positioned relative to its anchor point. It also makes it possible to set a `title` for the legend, e.g., to set it to the variable that is being mapped.\n", + "\n", + "In the examples, the following arguments are used: `legend_kwds={\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\": \"\"}`. This is not totally intuitive, but it works. See https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.legend.html#matplotlib.axes.Axes.legend for details about the various legend customizations.\n", + "\n", + "Also note that the map uses the default color map. More appropriate color maps will be considered below." + ] + }, + { + "cell_type": "markdown", + "id": "bcae4d0a", + "metadata": {}, + "source": [ + "#### Quantile Map\n", + "\n", + "A simple six category quantile map is illustrated by setting `scheme = \"Quantiles\"` and `k=6`. The `legend` arguments now also include a `title`. In addition, two `ax` methods are used for a minor customization: `ax.set_title` to give the map a title and, as before, `ax.set_axis_off` to get rid of the box with x-y coordinates." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7724f4c", + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.plot(\n", + " column = 'POLICE',\n", + " scheme = 'Quantiles',\n", + " k = 6,\n", + " legend = True,\n", + " legend_kwds={\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\" : \"Police\"}\n", + ")\n", + "ax.set_title(\"Quantiles\")\n", + "ax.set_axis_off()" + ] + }, + { + "cell_type": "markdown", + "id": "734148b7", + "metadata": {}, + "source": [ + "#### Maps with Set Number of Bins" + ] + }, + { + "cell_type": "markdown", + "id": "a96e9a4b", + "metadata": {}, + "source": [ + "Rather than repeating the single command for each type of map that needs the argument `k`, a small loop is constructed that creates each in turn. This is accomplished by putting the name for the respective `scheme` in a list and using that same name as the map title. The three types are `Quantiles`, `EqualInterval` and `NaturalBreaks`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ca52a103", + "metadata": {}, + "outputs": [], + "source": [ + "schemek = [\"Quantiles\",\"EqualInterval\",\"NaturalBreaks\"]\n", + "for i in schemek:\n", + " ax = dfs.plot(\n", + " column = 'POLICE',\n", + " scheme = i,\n", + " k = 6,\n", + " legend = True,\n", + " legend_kwds={\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\": \"Police\"}\n", + " )\n", + " ax.set_title(i)\n", + " ax.set_axis_off()" + ] + }, + { + "cell_type": "markdown", + "id": "6fbc2acd", + "metadata": {}, + "source": [ + "Note the contrast in the visualization of the spatial distribution between the different classifications. It is important to keep in mind that each has pros and cons. For example, the quantile map yields an equal number of observations in each category, but the range of the categories can vary subtantially, resulting in the grouping of very disparate observations. In the example, this is the case for the top category, which ranges from 1,275 to 10,972.\n", + "\n", + "On the other hand, the range in an equal intervals map is the same for all categories, but as a result some bins may have very few or very many observations, as is the case here for the lowest bin.\n", + "\n", + "Finally, a natural breaks map uses an optimization criterion (essentially equivalent to k-means on one variable) to determine the grouping of observations. Both the number of observations in each bin and the range of the bins is variable." + ] + }, + { + "cell_type": "markdown", + "id": "62839ace", + "metadata": {}, + "source": [ + "#### Maps with Predetermined number of Bins\n", + "\n", + "The standard deviational map and box map have a pre-set number of bins, depending on, respectively, standard deviational units and quantiles/interquantile range. Again, they are illustrated using a small loop." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b56d7d8", + "metadata": {}, + "outputs": [], + "source": [ + "schemenok = [\"StdMean\",\"BoxPlot\"]\n", + "for i in schemenok:\n", + " ax = dfs.plot(\n", + " column = 'POLICE',\n", + " scheme = i,\n", + " legend = True,\n", + " legend_kwds={\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\": 'Police'}\n", + " )\n", + " ax.set_title(i)\n", + " ax.set_axis_off()" + ] + }, + { + "cell_type": "markdown", + "id": "c2b0cd96", + "metadata": {}, + "source": [ + "Both types of maps are designed to highlight outliers. In the standard deviational map, these are observations more than two standard deviations away from the mean, in the box map, the outliers are outside the hinge (1.5 times the interquartile range from the median). This can be customized by setting a different value for the hinge through the `classification_kwds` argument. For example, selecting only the most extreme observations is achieved by setting `classification_kwds = {\"hinge\": 3.0}`, as illustrated below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5de45eaa", + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.plot(\n", + " column = 'POLICE',\n", + " scheme = 'BoxPlot',\n", + " k = 6,\n", + " classification_kwds = {'hinge': 3.0},\n", + " legend = True,\n", + " legend_kwds={\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\": \"Police\"}\n", + ")\n", + "ax.set_title(\"Box Map\")\n", + "ax.set_axis_off()" + ] + }, + { + "cell_type": "markdown", + "id": "df595929", + "metadata": {}, + "source": [ + "A standard deviational map with the categories below and above the mean shown is implemented with `classification_kwds = {\"anchor\" : True}`, as shown below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e17b901", + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.plot(\n", + " column = 'POLICE',\n", + " scheme = 'StdMean',\n", + " k = 6,\n", + " classification_kwds = {'anchor': True},\n", + " legend = True,\n", + " legend_kwds = {\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\": 'Police'}\n", + ")\n", + "ax.set_title(\"Standard Deviational Map\")\n", + "ax.set_axis_off()" + ] + }, + { + "cell_type": "markdown", + "id": "0a4e8ede", + "metadata": {}, + "source": [ + "Whereas the first three types of classifications have a color scheme that suggests a progression from low to high values, a so-called *sequential* legend, the standard deviational map and box map focus on differences from a central value. This requires a color map that highlights the move away from the center, a so-called *diverging* legend. In the examples shown so far, the categories were shown with the default sequential color map, which is not appropriate. The needed customizations are considered next." + ] + }, + { + "cell_type": "markdown", + "id": "934e2cab", + "metadata": {}, + "source": [ + "### Color Map" + ] + }, + { + "cell_type": "markdown", + "id": "4ee73f79", + "metadata": {}, + "source": [ + "The color scheme for the map is set by means of the `cmap` argument. This refers to a *matplotlib* color map, i.e., a pre-determined range of colors optimized for a particular purpose. For example, this allows for a different color map to represent a sequential vs. a diverging legend.\n", + "\n", + "The full range of color maps can be found at https://matplotlib.org/stable/users/explain/colors/colormaps.html.\n", + "\n", + "For our purposes, a good sequential color map uses a gradation that goes from light to dark, either in the same color, such as `cmap=\"Blues\"`, or moving between colors, such as `cmap=\"YlOrRd\"`. For a diverging legend, going from one extreme color to another is preferred, e.g., dark blue to light blue and then to light red and dark red, as in `cmap=\"bwr\"`, or even more extreme, as in `cmap=\"seismic\"`.\n", + "\n", + "Some examples are shown below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d4eaf9c1", + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.plot(\n", + " column = 'POLICE',\n", + " scheme = 'Quantiles',\n", + " k = 6,\n", + " cmap = 'Blues',\n", + " legend = True,\n", + " legend_kwds={\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\": 'Police'}\n", + ")\n", + "ax.set_title(\"Quantiles\")\n", + "ax.set_axis_off()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e523a1f3", + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.plot(\n", + " column = 'POLICE',\n", + " scheme = 'Quantiles',\n", + " k = 6,\n", + " cmap = 'YlOrRd',\n", + " legend = True,\n", + " legend_kwds={\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\": 'Police'}\n", + ")\n", + "ax.set_title(\"Quantiles\")\n", + "ax.set_axis_off()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bfd213e8", + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.plot(\n", + " column = 'POLICE',\n", + " scheme = 'BoxPlot',\n", + " cmap = 'seismic',\n", + " legend = True,\n", + " legend_kwds={\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\": 'Police'}\n", + ")\n", + "ax.set_title(\"Box Map\")\n", + "ax.set_axis_off()" + ] + }, + { + "cell_type": "markdown", + "id": "bc43ce36", + "metadata": {}, + "source": [ + "But notice when this is applied to the standard deviational map with `cmap = bwr`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c658bd19", + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.plot(\n", + " column = 'POLICE',\n", + " scheme = 'StdMean',\n", + " cmap = 'bwr',\n", + " legend = True,\n", + " legend_kwds={\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\": 'Police'}\n", + ")\n", + "ax.set_title(\"Standard Deviational Map\")\n", + "ax.set_axis_off()" + ] + }, + { + "cell_type": "markdown", + "id": "75ec5a5a", + "metadata": {}, + "source": [ + "What happened? Many of the counties are invisible. The reason is that there is no borderline specified for the map. This final customization is considered next." + ] + }, + { + "cell_type": "markdown", + "id": "bb66c005", + "metadata": {}, + "source": [ + "### Final Customization" + ] + }, + { + "cell_type": "markdown", + "id": "da719cbc", + "metadata": {}, + "source": [ + "As mentioned, the full range of *matplotlib* customizations is available to manipulate legends, colors and placement. For our purposes, one more map-specific element is of interest. As seen in the previous examples, the border between polygons is not clear or even non-existent. \n", + "\n", + "This can be fixed by setting the `edgecolor` and associated `linewidth` attributes. For example, with `edgecolor = \"Black\"`, the standard deviational map becomes more meaningful." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a6392b64", + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.plot(\n", + " column = 'POLICE',\n", + " scheme = 'StdMean',\n", + " cmap = 'bwr',\n", + " edgecolor = \"Black\",\n", + " legend = True,\n", + " legend_kwds={\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\": 'Police'}\n", + ")\n", + "ax.set_title(\"Standard Deviational Map\")\n", + "ax.set_axis_off()" + ] + }, + { + "cell_type": "markdown", + "id": "0fd86ac4", + "metadata": {}, + "source": [ + "#### Saving the Map to a File" + ] + }, + { + "cell_type": "markdown", + "id": "2f317018", + "metadata": {}, + "source": [ + "So far, the maps are generated in the notebook, but are not separately available. To save a specific map to a file, the `matplotlib.pyplot.savefig` command is used. For example, to save the standard deviational map (or any other map) to a png format file, only the filename needs to be specified as an argument to `plt.savefig`. Optionally, to get higher quality figures, the number of dots per inch can be set by means of `dpi`. \n", + "\n", + "This is illustrated for the standard deviational map where a more subtle border line is obtained by setting the thickness with `linewidth = 0.2`. The quality is set to `dpi = 600`.\n", + "\n", + "The file will be in the current working directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61991782", + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.plot(\n", + " column = 'POLICE',\n", + " scheme = 'StdMean',\n", + " cmap = 'bwr',\n", + " edgecolor = \"Black\",\n", + " linewidth = 0.2,\n", + " legend = True,\n", + " legend_kwds={\"loc\":\"center left\",\"bbox_to_anchor\":(1,0.5), \"title\": 'Police'}\n", + ")\n", + "ax.set_title(\"Standard Deviational Map\")\n", + "ax.set_axis_off()\n", + "plt.savefig(\"police_stdmean.png\",dpi=600)" + ] + }, + { + "cell_type": "markdown", + "id": "27a262bc", + "metadata": {}, + "source": [ + "Finally, a map with just the county borders is obtained with the `boundary.plot` command, where the color of the border line is controlled by `edgecolor` and the line thickness by `linewidth`, as before." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "98d3c839", + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.boundary.plot(\n", + " edgecolor = \"Black\",\n", + " linewidth = 0.2,\n", + ")\n", + "ax.set_title(\"Map of County Boundaries\")\n", + "ax.set_axis_off()" + ] + }, + { + "cell_type": "markdown", + "id": "1ad019ff", + "metadata": {}, + "source": [ + "## Practice" + ] + }, + { + "cell_type": "markdown", + "id": "72ca1664", + "metadata": {}, + "source": [ + "Use your own data set or one of the PySAL sample data sets to load a spatial data frame and experiment with various map types, color schemes and other customizations. Save each map to a file for inclusion in papers, reports, etc." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/4_spatial_weights.ipynb b/notebooks/4_spatial_weights.ipynb new file mode 100644 index 00000000..051034bc --- /dev/null +++ b/notebooks/4_spatial_weights.ipynb @@ -0,0 +1,1424 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7b8975c4", + "metadata": {}, + "source": [ + "# Spatial Weights\n", + "\n", + "### Luc Anselin\n", + "\n", + "### 09/06/2024\n" + ] + }, + { + "cell_type": "markdown", + "id": "4cfd0985", + "metadata": {}, + "source": [ + "## Preliminaries\n", + "\n", + "In this notebook, basic operations pertaining to spatial weights are reviewed. Two major cases are considered: reading weights files constructed by other software, such as *GeoDa*, and creating weights from GeoDataFrames or spatial layers using the functionality in *libpysal.weights*. In addition, some special operations are covered, such as creating spatial weights for regular grids and turning a *PySAL* weights object into a full matrix. The computation of a spatially lagged variable is illustrated as well.\n", + "\n", + "A video recording is available from the GeoDa Center YouTube channel playlist *Applied Spatial Regression - Notebooks*, at https://www.youtube.com/watch?v=IbmTItot0q8&list=PLzREt6r1NenmhNy-FCUwiXL17Vyty5VL6&index=4." + ] + }, + { + "cell_type": "markdown", + "id": "6494b68c", + "metadata": {}, + "source": [ + "### Modules Needed\n", + "\n", + "The main functionality is provided by the utilities in *libpysal* for spatial weights, and the functionality in *geopandas* for data input and output. All of these rely on *numpy* as a dependency.\n", + "\n", + "To simplify notation, the `libpysal.weights` module is imported as `weights`, and `get_path` and `open` are imported from respectively `libpysal.examples` and `libpysal.io`.\n", + "\n", + "The `warnings` module filters some warnings about future changes. To avoid some arguably obnoxious new features of *numpy* 2.0, it is necessary to include the `set_printoptions` command if you are using a Python 3.12 environment with numpy 2.0 or greater.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "e398e42f", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "import warnings\n", + "warnings.filterwarnings(\"ignore\")\n", + "import numpy as np\n", + "import os\n", + "os.environ['USE_PYGEOS'] = '0'\n", + "import geopandas as gpd\n", + "from libpysal.examples import get_path\n", + "from libpysal.io import open\n", + "import libpysal.weights as weights\n", + "np.set_printoptions(legacy=\"1.25\")" + ] + }, + { + "cell_type": "markdown", + "id": "1ac85fb3", + "metadata": {}, + "source": [ + "### Functions Used\n", + "\n", + "- from numpy:\n", + " - array\n", + " - mean\n", + " - std\n", + " - flatten\n", + " - @\n", + "\n", + "- from geopandas:\n", + " - read_file\n", + " - astype\n", + " \n", + "- from libpysal.examples:\n", + " - get_path\n", + "\n", + "- from libpysal.io:\n", + " - open\n", + "\n", + "- from libpysal.weights:\n", + " - neighbors\n", + " - weights\n", + " - n\n", + " - min_neighbors, max_neighbors, mean_neighbors\n", + " - pct_nonzero\n", + " - asymmetry, asymmetries\n", + " - Kernel.from_file\n", + " - Queen.from_dataframe\n", + " - transform\n", + " - Queen.from_file\n", + " - KNN.from_dataframe\n", + " - symmetrize\n", + " - Kernel\n", + " - Kernel.from_shapefile\n", + " - lat2W\n", + " - full\n", + " - lag_spatial" + ] + }, + { + "cell_type": "markdown", + "id": "67da216d", + "metadata": {}, + "source": [ + "### Files and Variables\n", + "\n", + "This notebook uses data on socio-economic correlates of health outcomes contained in the **chicagoSDOH** sample shape files and associated spatial weights. It is assumed that all sample files have been installed.\n", + "\n", + "- **Chi-SDOH.shp,shx,dbf,prj**: socio-economic indicators of health for 2014 in 791 Chicago tracts\n", + "- **Chi-SDOH_q.gal**: queen contiguity spatial weights from `GeoDa`\n", + "- **Chi-SDOH_k6s.gal**: k-nearest neighbor weights for k=6, made symmetric in `GeoDa`\n", + "- **Chi-SDOH_k10tri.kwt**: triangular kernel weights based on a variable bandwidth with 10 nearest neighbors from `GeoDa`\n", + "\n", + "As before, file names and variable names are specified at the top of the notebook so that this is the only part that needs to be changed for other data sets and variables." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "12a910c4", + "metadata": {}, + "outputs": [], + "source": [ + "infileshp = \"Chi-SDOH.shp\" # input shape file\n", + "infileq = \"Chi-SDOH_q.gal\" # queen contiguity from GeoDa\n", + "infileknn = \"Chi-SDOH_k6s.gal\" # symmetric k-nearest neighbor weights from GeoDa\n", + "infilekwt = \"Chi-SDOH_k10tri.kwt\" # triangular kernel weights for a variable knn bandwidth from GeoDa\n", + "outfileq = \"test_q.gal\" # output file for queen weights computed with libpysal\n", + "outfilek = \"test_k.kwt\" # outpuf file for kernel weights computed with libpysal\n", + "y_name = [\"YPLL_rate\"] # variable to compute spatial lag" + ] + }, + { + "cell_type": "markdown", + "id": "6051ceb7", + "metadata": {}, + "source": [ + "## Spatial Weights from a File (GeoDa)" + ] + }, + { + "cell_type": "markdown", + "id": "51160fd3", + "metadata": {}, + "source": [ + "Spatial weights are an essential part of any spatial autocorrelation analysis and spatial regression. Functionality to create and analyze spatial weights is contained in the `libpysal.weights` library.\n", + "The full range of functions is much beyond the current scope and can be found at https://pysal.org/libpysal/api.html.\n", + "\n", + "Only the essentials are covered here, sufficient to proceed\n", + "with the spatial regression analysis. Also, only the original `Weights` class is considered. A newer alternative is provided by the `Graph` class, but it is not further discussed here. Full details can be found at https://pysal.org/libpysal/user-guide/graph/w_g_migration.html.\n", + "\n", + "Arguably the easiest way to create spatial weights is to use the *GeoDa* software (https://geodacenter.github.io/download.html), which\n", + "provides functionality to construct a wide range of contiguity as well as distance\n", + "based weights through a graphical user interface. The weights information is stored as **gal**, **gwt** or **kwt** files. Importing these weights into *PySAL* is considered first.\n" + ] + }, + { + "cell_type": "markdown", + "id": "85f6ef20", + "metadata": {}, + "source": [ + "### Queen Contiguity Weights\n", + "\n", + "Contiguity weights can be read into PySAL spatial weights objects using the `read` function, after opening the file with `libpysal.io.open` (here, just `open`). This is applied to the queen contiguity weights created by `GeoDa`, contained in the file **infileq**, after obtaining its path using `get_path`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce630240", + "metadata": {}, + "outputs": [], + "source": [ + "inpath = get_path(infileq)\n", + "wq = open(inpath).read()\n", + "wq" + ] + }, + { + "cell_type": "markdown", + "id": "8519f624", + "metadata": {}, + "source": [ + "The result is a PySAL spatial weights object of the class `libpysal.weights.weights.W`. This object contains lists of `neighbors` and `weights` as well as many other attributes and methods. \n", + "\n", + "It is useful to remember that the `neighbors` and `weights` are dictionaries that use an ID variable or simple sequence number as the key. A quick view of the relevant keys is obtained by converting them to a `list` and printing out the first few elements." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72590f7d", + "metadata": {}, + "outputs": [], + "source": [ + "print(list(wq.neighbors.keys())[0:5])" + ] + }, + { + "cell_type": "markdown", + "id": "e69b779d", + "metadata": {}, + "source": [ + "This reveals that the keys are simple strings, starting at **'1'** and not at **0** as in the usual Python indexing. The IDs of the neighbors for a given observation can be listed by specifying the key. For example, for observation with ID='1', this yields:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ca74d6fc", + "metadata": {}, + "outputs": [], + "source": [ + "wq.neighbors['1']" + ] + }, + { + "cell_type": "markdown", + "id": "8772b742", + "metadata": {}, + "source": [ + "When an inappropriate key is used, an error is generated (recall that dictionaries have no order, so there are no sequence numbers). For example, here `1` is entered as an integer, but it should have been a string, as above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b8f34cf", + "metadata": {}, + "outputs": [], + "source": [ + "wq.neighbors[1]" + ] + }, + { + "cell_type": "markdown", + "id": "d40d3b5c", + "metadata": {}, + "source": [ + "The weights associated with each observation key are found using `weights`. For example, for observation with ID='1' this yields:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1342d4e9", + "metadata": {}, + "outputs": [], + "source": [ + "wq.weights['1']" + ] + }, + { + "cell_type": "markdown", + "id": "4f645036", + "metadata": {}, + "source": [ + "At this point, all the weights are simply binary. Row-standardization is considered below." + ] + }, + { + "cell_type": "markdown", + "id": "f90c1984", + "metadata": {}, + "source": [ + "#### Weights Characteristics" + ] + }, + { + "cell_type": "markdown", + "id": "50ee4c2d", + "metadata": {}, + "source": [ + "A quick check on the number of observations, i.e., the number of rows in the weights matrix." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a60409f", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "wq.n" + ] + }, + { + "cell_type": "markdown", + "id": "1751a0a5", + "metadata": {}, + "source": [ + "Minimum, maximum and average number of neighbors and percent non-zero (an indication of sparsity)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5de46f3", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "wq.min_neighbors,wq.max_neighbors,wq.mean_neighbors,wq.pct_nonzero" + ] + }, + { + "cell_type": "markdown", + "id": "c406929f", + "metadata": {}, + "source": [ + "There is no explicit check for symmetry as such, but instead the lack of symmetry can be assessed by means of the `asymmetry` method, or the list of id pairs with asymmetric weights is obtained by means of the `asymmetries` attribute." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "430a2f59", + "metadata": {}, + "outputs": [], + "source": [ + "print(wq.asymmetry())\n", + "print(wq.asymmetries)" + ] + }, + { + "cell_type": "markdown", + "id": "545d5999", + "metadata": {}, + "source": [ + "Since contiguity weights are symmetric by construction, the presence of an asymmetry would indicate some kind of error. This is not the case here." + ] + }, + { + "cell_type": "markdown", + "id": "aa2ee8a3", + "metadata": {}, + "source": [ + "### K-Nearest Neighbors Weights" + ] + }, + { + "cell_type": "markdown", + "id": "e856073b", + "metadata": {}, + "source": [ + "Similarly, the symmetric knn weights (k=6) created by `GeoDa` can be read from the file **infileknn**:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a86f85ce", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "inpath = get_path(infileknn)\n", + "wk6s = open(inpath).read()\n", + "wk6s" + ] + }, + { + "cell_type": "markdown", + "id": "d4ea8c6c", + "metadata": {}, + "source": [ + "Some characteristics:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74c91c22", + "metadata": {}, + "outputs": [], + "source": [ + "wk6s.n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f55fc305", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "print(wk6s.min_neighbors,wk6s.max_neighbors,wk6s.mean_neighbors,wk6s.pct_nonzero)" + ] + }, + { + "cell_type": "markdown", + "id": "212cd898", + "metadata": {}, + "source": [ + "Note how the operation to make the initially asymmetric k-nearest neighbor weights symmetric has resulted in many observations having more than 6 neighbors (`max_neighbors` is larger than 6). That is the price to pay to end up with symmetric weights, which is required for some of the estimation methods. We can list neighbors and weights in the usual way. As it turns out, the observation with key `1` is not adjusted, but observation with key `3` now has eight neighbors (up from the original six).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8fb35473", + "metadata": {}, + "outputs": [], + "source": [ + "wk6s.neighbors['1']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3bf5d207", + "metadata": {}, + "outputs": [], + "source": [ + "wk6s.neighbors['3']" + ] + }, + { + "cell_type": "markdown", + "id": "69dc85a3", + "metadata": {}, + "source": [ + "### Kernel Weights" + ] + }, + { + "cell_type": "markdown", + "id": "f523e10a", + "metadata": {}, + "source": [ + "Triangular kernel weights based on a variable bandwidth with 10 nearest neighbors created by `GeoDa` are contained in the file **infilekwt**. The properties of kernel weights are considered in more detail in a later notebook.\n", + "\n", + "The weights can be read in the usual fashion, by means of `libpysal.io.open`:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "cc957469", + "metadata": {}, + "outputs": [], + "source": [ + "inpath = get_path(infilekwt)\n", + "kwtri = open(inpath).read()" + ] + }, + { + "cell_type": "markdown", + "id": "f7f9a159", + "metadata": {}, + "source": [ + "However, this does not give the desired result. The object is not recognized as kernel weights, but\n", + "as a standard spatial weights object, revealed by checking the `type`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b8aa4de", + "metadata": {}, + "outputs": [], + "source": [ + "print(type(kwtri))" + ] + }, + { + "cell_type": "markdown", + "id": "6b69738a", + "metadata": {}, + "source": [ + "Tthe kernel weights can be checked with the usual `weights` attribute. However, the values for the keys in this example are not characters, but simple integers. This is revealed by a quick check of the keys." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b738d26", + "metadata": {}, + "outputs": [], + "source": [ + "print(list(kwtri.neighbors.keys())[0:5])" + ] + }, + { + "cell_type": "markdown", + "id": "70e0fafb", + "metadata": {}, + "source": [ + "Now, with the integer 1 as the key, the contents of the weights can be listed. Note the presence of the weights 1.0 (for the diagonal). All is fine, except that *PySAL* does not recognize the weights as kernel weights." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "996a5c2b", + "metadata": {}, + "outputs": [], + "source": [ + "print(kwtri.weights[1])" + ] + }, + { + "cell_type": "markdown", + "id": "c4584ce6", + "metadata": {}, + "source": [ + "The alternative, using the `weights.Kernel.from_file` method from `libpysal` has the same problem." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dbf96029", + "metadata": {}, + "outputs": [], + "source": [ + "kwtri10f = weights.Kernel.from_file(inpath)\n", + "print(type(kwtri10f))\n", + "print(kwtri10f.weights[1])" + ] + }, + { + "cell_type": "markdown", + "id": "3f9b8309", + "metadata": {}, + "source": [ + "#### Changing the class of weights" + ] + }, + { + "cell_type": "markdown", + "id": "d28c0700", + "metadata": {}, + "source": [ + "In this particular case, a hack is to force the class of the weights object to be a kernel weight. This is generally not recommended, but since the object in question has all the characteristics of kernel weights, it is safe to do so.\n", + "\n", + "It is accomplished by setting the attribute `__class__` of the weights object to `libpysal.weights.distance.Kernel`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0a323c47", + "metadata": {}, + "outputs": [], + "source": [ + "kwtri10f.__class__ = weights.distance.Kernel\n", + "print(type(kwtri10f))" + ] + }, + { + "cell_type": "markdown", + "id": "2efc8b5b", + "metadata": {}, + "source": [ + "## Creating Weights from a GeoDataFrame" + ] + }, + { + "cell_type": "markdown", + "id": "c193a5dd", + "metadata": {}, + "source": [ + "### Queen Contiguity Weights" + ] + }, + { + "cell_type": "markdown", + "id": "713be4b9", + "metadata": {}, + "source": [ + "In *PySAL*, the spatial weights construction is handled by `libpysal.weights`. The generic function is `weights..from_dataframe` with as arguments the geodataframe and optionally the `ids` (recommended). For the Chicago data, the ID variable is **OJECTID**. To make sure the latter is an integer (it is not in the original data frame), its type is changed by means of the `astype` method. \n", + "\n", + "The same operation can also create a contiguity weights file from a shape file, using `weigths..from_shapefile`, but this is left as an exercise." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "015c694f", + "metadata": {}, + "outputs": [], + "source": [ + "inpath = get_path(infileshp)\n", + "dfs = gpd.read_file(inpath)\n", + "dfs = dfs.astype({'OBJECTID':'int'})\n", + "wq1 = weights.Queen.from_dataframe(dfs,ids='OBJECTID')\n", + "wq1" + ] + }, + { + "cell_type": "markdown", + "id": "9029a443", + "metadata": {}, + "source": [ + "A quick check on the keys reveals these are integers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e404c2c", + "metadata": {}, + "outputs": [], + "source": [ + "print(list(wq1.neighbors.keys())[0:5])" + ] + }, + { + "cell_type": "markdown", + "id": "b9263cc3", + "metadata": {}, + "source": [ + "Again, some characteristics:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d0f697e", + "metadata": {}, + "outputs": [], + "source": [ + "wq1.n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7cbad937", + "metadata": {}, + "outputs": [], + "source": [ + "print(wq1.min_neighbors,wq1.max_neighbors,wq1.mean_neighbors,wq1.pct_nonzero)" + ] + }, + { + "cell_type": "markdown", + "id": "baf427f5", + "metadata": {}, + "source": [ + "The structure of the weights is identical to that from the file read from `GeoDa`. For example, the first set of neighbors and weights are:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7321592", + "metadata": {}, + "outputs": [], + "source": [ + "print(wq1.neighbors[1])\n", + "print(wq1.weights[1])" + ] + }, + { + "cell_type": "markdown", + "id": "7dfc46a3", + "metadata": {}, + "source": [ + "### Row-standardization" + ] + }, + { + "cell_type": "markdown", + "id": "0756cb5e", + "metadata": {}, + "source": [ + "As created, the weights are simply 1.0 for binary weights. To turn the weights into row-standardized form, a *transformation* is needed, `wq1.transform = 'r'`:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "931dda95", + "metadata": {}, + "outputs": [], + "source": [ + "wq1.transform = 'r'\n", + "wq1.weights[1]" + ] + }, + { + "cell_type": "markdown", + "id": "b659247c", + "metadata": {}, + "source": [ + "### Writing a Weights File" + ] + }, + { + "cell_type": "markdown", + "id": "05e387f1", + "metadata": {}, + "source": [ + "To write out the weights object to a GAL file, `libpysal.io.open` is used with the `write` method. The argument to the `open` command is the filename and `mode='w'` (for writing a file). The weights object itself is the argument to the `write` method.\n", + "\n", + "Note that even though the weights are row-standardized, this information is lost in the output file." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "d0547d6d", + "metadata": {}, + "outputs": [], + "source": [ + "open(outfileq,mode='w').write(wq1)" + ] + }, + { + "cell_type": "markdown", + "id": "1c928df1", + "metadata": {}, + "source": [ + "A quick check using the `weights.Queen.from_file` operation on the just created weights file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "681ef494", + "metadata": {}, + "outputs": [], + "source": [ + "wq1a = weights.Queen.from_file(outfileq)\n", + "print(wq1a.n)\n", + "print(list(wq1a.neighbors.keys())[0:5])" + ] + }, + { + "cell_type": "markdown", + "id": "3f26a5b6", + "metadata": {}, + "source": [ + "Note how the type of the key has changed from integer above to character after reading from the outside file. This again stresses the importance of checking the keys before any further operations.\n", + "\n", + "The weights are back to their original binary form, so the row-standardization is lost after writing the output file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bc1e60e9", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "wq1a.weights['1']" + ] + }, + { + "cell_type": "markdown", + "id": "b4e53bca", + "metadata": {}, + "source": [ + "### KNN Weights\n", + "\n", + "The corresponding functionality for k-nearest neighbor weights is `weights.KNN.from_dataframe`. An important argument is `k`, the number of neighbors, with the default set to `2`, which is typically not that useful. Again, it is useful to include OBJECTID as the ID variable. Initially the weights are in binary form. As before, they are row-standardized.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2128e659", + "metadata": {}, + "outputs": [], + "source": [ + "wk6 = weights.KNN.from_dataframe(dfs,k=6,ids='OBJECTID')\n", + "print(wk6.n)\n", + "print(list(wk6.neighbors.keys())[0:5])\n", + "wk6" + ] + }, + { + "cell_type": "markdown", + "id": "1e5173eb", + "metadata": {}, + "source": [ + "To compare the just created weights to the symmetric form read into **wk6s**, the list of neighbors for observation 3 is informative. It consists of a subset of six from the list of eight from the above symmetric knn weights." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "599541a1", + "metadata": {}, + "outputs": [], + "source": [ + "print(wk6.neighbors[3])" + ] + }, + { + "cell_type": "markdown", + "id": "d7b4f435", + "metadata": {}, + "source": [ + "The k-nearest neighbor weights are intrinsically asymmetric. Rather than listing all the pairs that contain such asymmetries, the length of this list can be checked using the `asymmetry` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1b112430", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "print(len(wk6.asymmetry()))" + ] + }, + { + "cell_type": "markdown", + "id": "02aec84c", + "metadata": {}, + "source": [ + "KNN weights have a built-in method to make them symmetric: `symmetrize`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "239b43ae", + "metadata": {}, + "outputs": [], + "source": [ + "wk6s2 = wk6.symmetrize()\n", + "print(len(wk6.asymmetry()))\n", + "print(len(wk6s2.asymmetry()))" + ] + }, + { + "cell_type": "markdown", + "id": "cd55eca5", + "metadata": {}, + "source": [ + "The entries are now the same as for the symmetric knn GAL file that was read in from `GeoDa`. For example, the neighbors of observation with key `3` are:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea5eb99d", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "print(wk6s2.neighbors[3])" + ] + }, + { + "cell_type": "markdown", + "id": "9c828bbe", + "metadata": {}, + "source": [ + "Finally, to make them row-standardized, the same transformation is used." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e98d4fde", + "metadata": {}, + "outputs": [], + "source": [ + "wk6s2.transform = 'r'\n", + "wk6s2.weights[3]" + ] + }, + { + "cell_type": "markdown", + "id": "5a380252", + "metadata": {}, + "source": [ + "## Kernel Weights" + ] + }, + { + "cell_type": "markdown", + "id": "d805bada", + "metadata": {}, + "source": [ + "There are several ways to create the kernel weights that are used later in the course, for example to compute HAC standard errors in ordinary least squares regression. One is to create the weights in `GeoDa` and save them as a weights file with a **kwt** extension. However, currently, there is a bug in libpysal so that the proper class needs to be set explicitly.\n", + "\n", + "The alternative is to compute the weights directly with `PySAL`. This can be implemented in a number of ways. One is to create the weights using the `libpysal.weights.Kernel` function, with a matrix of x-y coordinates passed. Another is to compute the weights directly from the information in a shape file, using `libpysal.weights.Kernel.from_shapefile`.\n", + "\n", + "Each is considered in turn." + ] + }, + { + "cell_type": "markdown", + "id": "8b452302", + "metadata": {}, + "source": [ + "### Kernel Weights Computation" + ] + }, + { + "cell_type": "markdown", + "id": "b3734b6f", + "metadata": {}, + "source": [ + "Direct computation of kernel weights takes as input an array of coordinates. Typically these are the coordinates of the locations, but it is a perfectly general approach and can take any number of variables to compute *general* distances (or economic distances). In the example, the X and Y coordinates contained in the geodataframe **dfs** are used as `COORD_X` and `COORD_Y`. \n", + "\n", + "First, the respective columns from the data frame are turned into a numpy array.\n", + "\n", + "The command to create the kernel weights is `libpysal.weights.Kernel`. It takes the array as the first argument, followed by a number of options. To have a variable bandwidth that follows the 10 nearest neighbors, \n", + "`fixed = False` (the default is a fixed bandwidth) and `k=10`. The kernel function is selected as `function=\"triangular\"` (this is also the default, but it is included here for clarity). Finally, the use of kernel weights in the HAC calculations requires the diagonals to be set to the value of one, achieved by means\n", + "of `diagonal=True`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3a98748", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "coords = np.array(dfs[['COORD_X','COORD_Y']])\n", + "kwtri10 = weights.Kernel(coords,fixed=False,k=10,\n", + " function=\"triangular\",diagonal=True)\n", + "print(type(kwtri10))" + ] + }, + { + "cell_type": "markdown", + "id": "80d86e62", + "metadata": {}, + "source": [ + "The result is an object of class `libpysal.weights.distance.Kernel`. This contains several attributes, such as the kernel function used." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "82abf5af", + "metadata": {}, + "outputs": [], + "source": [ + "kwtri10.function" + ] + }, + { + "cell_type": "markdown", + "id": "b3a6d3c3", + "metadata": {}, + "source": [ + "A check on the keys. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff74853e", + "metadata": {}, + "outputs": [], + "source": [ + "print(list(kwtri10.neighbors.keys())[0:5])" + ] + }, + { + "cell_type": "markdown", + "id": "19215410", + "metadata": {}, + "source": [ + "Note that the index starts at 0 and the keys are integers. The neighbors for the first observation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f47c9c44", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "kwtri10.neighbors[0]" + ] + }, + { + "cell_type": "markdown", + "id": "5e6ea4dc", + "metadata": {}, + "source": [ + "The kernel weights for the first observations:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "380606c6", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "kwtri10.weights[0]" + ] + }, + { + "cell_type": "markdown", + "id": "75e81a3c", + "metadata": {}, + "source": [ + "These are the same values as we obtained above from reading the kwt file, but now they are recognized as a proper kernel weights object." + ] + }, + { + "cell_type": "markdown", + "id": "bc1b40f5", + "metadata": {}, + "source": [ + "### Kernel Weights from a Shape File" + ] + }, + { + "cell_type": "markdown", + "id": "df575721", + "metadata": {}, + "source": [ + "Contiguity weights, distance weights and kernel weights can also be constructed directly from a shape file, using the relevant `from_shapefile` methods. For kernel weights, this can be based on either point coordinates or on the coordinates of polygon centroids to compute the distances needed. The relevant function is `libpysal.weights.Kernel.from_shapefile` with as its main argument the file (path) name of the \n", + "shape file involved. The other arguments are the same options as before. The shape file in **infileshp** is used as the input file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7cc1d8ae", + "metadata": {}, + "outputs": [], + "source": [ + "inpath = get_path(infileshp)\n", + "kwtri10s = weights.Kernel.from_shapefile(inpath,\n", + " fixed=False,k=10,\n", + " function=\"triangular\",diagonal=True)\n", + "print(type(kwtri10s))" + ] + }, + { + "cell_type": "markdown", + "id": "380bf1a0", + "metadata": {}, + "source": [ + "The result is of the proper type, contains the same structure as before, with matching function, neighbors and weights." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d776d966", + "metadata": {}, + "outputs": [], + "source": [ + "print(kwtri10s.function)\n", + "print(list(kwtri10s.neighbors.keys())[0:5])\n", + "print(kwtri10s.neighbors[0])\n", + "print(kwtri10s.weights[0])" + ] + }, + { + "cell_type": "markdown", + "id": "948f1fb0", + "metadata": {}, + "source": [ + "### Writing the Kernel Weights" + ] + }, + { + "cell_type": "markdown", + "id": "03f68cfb", + "metadata": {}, + "source": [ + "We use the same method as for the queen weights to write the just constructed kernel weights to an outside kwt file. The output file is `outfilek`." + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "9ec284cd", + "metadata": {}, + "outputs": [], + "source": [ + "open(outfilek,mode='w').write(kwtri10s)" + ] + }, + { + "cell_type": "markdown", + "id": "d05c15d0", + "metadata": {}, + "source": [ + "Quick check:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7ab1227", + "metadata": {}, + "outputs": [], + "source": [ + "kk = weights.Kernel.from_file(outfilek)\n", + "print(type(kk))" + ] + }, + { + "cell_type": "markdown", + "id": "021da548", + "metadata": {}, + "source": [ + "So, the same problem as mentioned above persists for weights files written by *PySAL* and the proper class needs to be set explicitly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "183515cd", + "metadata": {}, + "outputs": [], + "source": [ + "kk.__class__ = weights.distance.Kernel\n", + "print(type(kk))" + ] + }, + { + "cell_type": "markdown", + "id": "e0eeb182", + "metadata": {}, + "source": [ + "## Special Weights Operations" + ] + }, + { + "cell_type": "markdown", + "id": "4c7a1d94", + "metadata": {}, + "source": [ + "A few special weights operations will come in handy later on. One is to create spatial weights for a regular grid setup, which is very useful for simulation designs. The other is to turn a spatial weights object into a standard numpy array, which can be be used in all kinds of matrix operations." + ] + }, + { + "cell_type": "markdown", + "id": "3507af4d", + "metadata": {}, + "source": [ + "### Weights for Regular Grids" + ] + }, + { + "cell_type": "markdown", + "id": "1686fa14", + "metadata": {}, + "source": [ + "The `weights.lat2W` operation creates rook contiguity spatial weights (the default, queen contiguity is available for `rook = False`) for a regular rectangular grid with the number of rows and the number of columns as the arguments. The result is a simple binary weights object, so row-standardization is typically needed as well.\n", + "\n", + "For a square grid, with **gridside=20** as the number of rows/columns, the result has dimension 400." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32e033b7", + "metadata": {}, + "outputs": [], + "source": [ + "gridside = 20\n", + "wgrid = weights.lat2W(gridside,gridside,rook=True)\n", + "wgrid.n" + ] + }, + { + "cell_type": "markdown", + "id": "80e3a3ae", + "metadata": {}, + "source": [ + "Quick check on the neighbor keys." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d19a637f", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "print(list(wgrid.neighbors.keys())[0:5])" + ] + }, + { + "cell_type": "markdown", + "id": "6979a434", + "metadata": {}, + "source": [ + "Since this is a square grid, the first observation, in the upper left corner, has only two neighbors, one\n", + "to the right (1) and one below (20 - since the first row goes from 0 to 19)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d97e2ec", + "metadata": {}, + "outputs": [], + "source": [ + "wgrid.neighbors[0]" + ] + }, + { + "cell_type": "markdown", + "id": "bfd62652", + "metadata": {}, + "source": [ + "Row-standardization yields the actual weights." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75cc9d20", + "metadata": {}, + "outputs": [], + "source": [ + "wgrid.transform = 'r'\n", + "wgrid.weights[0]" + ] + }, + { + "cell_type": "markdown", + "id": "b930a20a", + "metadata": {}, + "source": [ + "Any non-border cell has four neighbors, one to the left, right, up and down." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f4fe7013", + "metadata": {}, + "outputs": [], + "source": [ + "wgrid.weights[21]" + ] + }, + { + "cell_type": "markdown", + "id": "634ddd29", + "metadata": {}, + "source": [ + "### Weights as Matrices" + ] + }, + { + "cell_type": "markdown", + "id": "3f79a8b0", + "metadata": {}, + "source": [ + "The `weights.full` operation turns a spatial weights object into a standard numpy array. The function returns a tuple, of which the first element is the actual matrix and the second consists of a list of keys. For actual matrix operations, the latter is not that useful.\n", + "\n", + "It is important to remember to always extract the first element of the tuple as the matrix of interest. Otherwise, one quickly runs into trouble with array operations.\n", + "\n", + "This is illustrated for the row-standardized queen weights **wq1** created earlier." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8a19df1", + "metadata": {}, + "outputs": [], + "source": [ + "wq1full, wqfkeys = weights.full(wq1)\n", + "print(type(wq1full),type(wqfkeys))\n", + "wq1full.shape" + ] + }, + { + "cell_type": "markdown", + "id": "6b47feef", + "metadata": {}, + "source": [ + "## Spatially Lagged Variables" + ] + }, + { + "cell_type": "markdown", + "id": "355c9d91", + "metadata": {}, + "source": [ + "Spatially lagged variables are essential in the specification of spatial regression models. They are the product of a spatial weight matrix with a vector of observations and yield new values as (weighted) averages of the values observed at neighboring locations (with the neighbors defined by the spatial weights).\n", + "\n", + "This is illustrated for the variable **y_name** extracted from the data frame. Its mean and standard deviation are listed using the standard `numpy` methods." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f357ab21", + "metadata": {}, + "outputs": [], + "source": [ + "y = np.array(dfs[y_name])\n", + "print(y.shape)\n", + "print(y.mean())\n", + "print(y.std())" + ] + }, + { + "cell_type": "markdown", + "id": "6f369e77", + "metadata": {}, + "source": [ + "The new spatially lagged variable is created with the `weights.lag_spatial` command, passing the weights object **wq1** and the vector of interest, **y**. Its important to make sure that the dimensions match. In particular, if the vector in question is not an actual column vector, but a one-dimensional array, the result will not be a vector, but an array. This may cause trouble in some applicaitons." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc5f2f0c", + "metadata": {}, + "outputs": [], + "source": [ + "wy = weights.lag_spatial(wq1,y)\n", + "print(wy.shape)\n", + "print(wy.mean())\n", + "print(wy.std())" + ] + }, + { + "cell_type": "markdown", + "id": "0d706e58", + "metadata": {}, + "source": [ + "The result is a column vector. The mean roughly corresponds to that of the original variable, but the spatially lagged variable has a smaller standard deviation. This illustrates the *smoothing* implied by the spatial lag operation.\n", + "\n", + "To illustrate the problem with numpy arrays rather than vectors, the original vector is flattened and then the `lag_spatial` operation is applied to it. Everything works fine, except that the result is an array, and not a column vector." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b45d600a", + "metadata": {}, + "outputs": [], + "source": [ + "yy = y.flatten()\n", + "print(yy.shape)\n", + "wyy = weights.lag_spatial(wq1,yy)\n", + "print(wyy.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "682469c5", + "metadata": {}, + "source": [ + "The same result can also be obtained using an explicit matrix-vector multiplication with the full matrix **wq1full** just created." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "622eb958", + "metadata": {}, + "outputs": [], + "source": [ + "wy1 = wq1full @ y\n", + "print(wy1.shape)\n", + "print(wy1.mean())\n", + "print(wy1.std())" + ] + }, + { + "cell_type": "markdown", + "id": "65aadbff", + "metadata": {}, + "source": [ + "## Practice\n", + "\n", + "Experiment with various spatial weights for your own data set or for one of the PySAL sample data sets. Create a spatially lagged variable for each of the weights and compare their properties, such as the mean, standard deviation, correlation between the original variable and the spatial lag, etc.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}