Merge pull request #163 from pedrovma/main

Spreg version 1.8
pysal · Nov 5, 2024 · 2b72839 · 2b72839
2 parents 9eabac4 + 550c999
commit 2b72839
Show file tree

Hide file tree

Showing 26 changed files with 11,459 additions and 781 deletions.
diff --git a/notebooks/10_specification_tests_properties.ipynb b/notebooks/10_specification_tests_properties.ipynb
diff --git a/notebooks/11_distance_decay.ipynb b/notebooks/11_distance_decay.ipynb
diff --git a/notebooks/12_estimating_slx.ipynb b/notebooks/12_estimating_slx.ipynb
diff --git a/notebooks/13_ML_estimation_spatial_lag.ipynb b/notebooks/13_ML_estimation_spatial_lag.ipynb
diff --git a/notebooks/14_IV_estimation_spatial_lag.ipynb b/notebooks/14_IV_estimation_spatial_lag.ipynb
diff --git a/notebooks/15_ML_estimation_spatial_error.ipynb b/notebooks/15_ML_estimation_spatial_error.ipynb
diff --git a/notebooks/16_GMM_estimation_spatial_error.ipynb b/notebooks/16_GMM_estimation_spatial_error.ipynb
diff --git a/notebooks/17_GMM_higher_order.ipynb b/notebooks/17_GMM_higher_order.ipynb
@@ -0,0 +1,354 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "89145993",
+   "metadata": {},
+   "source": [
+    "# GMM Estimation - Higher Order Models\n",
+    "\n",
+    "### Luc Anselin\n",
+    "\n",
+    "### (revised 09/26/2024)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a85427f9",
+   "metadata": {},
+   "source": [
+    "## Preliminaries\n",
+    "\n",
+    "This module covers the estimation of higher order spatial models, such as SAR-SAR (spatial lag with spatial autoregressive errors) and the generalized nested specification (GNS), i.e., a Spatial Durbin model with spatial autoregressive errors. These specifications are estimated as special cases of `spreg.GMM_Error`, by including the argument `add_wy = True`, without or with `slx_lags = 1`.\n",
+    "\n",
+    "In general, these specifications should be avoided, but they are included here for the sake of completeness. As shown by Koley and Bera (2024), the full set of parameters in the GNS model is not identified, and ML cannot be applied. The SAR-SAR models suffers from similar problems, and its ML estimation typically has a very hard time to convert, switching back and forth between the estimates for $\\rho$ and $\\lambda$ (this is referred to by Bivand and Piras in the `R-spatialreg` package as the \"banana\" problem). As a result, ML estimation of these models is not included in *spreg*. However, it remains possible to estimate them by means of IV/GMM methods, although the results need to be interpreted with caution. Also, as it turns out, in practice, the results often do not make sense and are difficult to interpret."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9a2c9eb",
+   "metadata": {},
+   "source": [
+    "### Modules Needed\n",
+    "\n",
+    "The same modules are needed as for the GMM estimation of the error model: `GMM_Error` imported from `spreg`, utilities in *libpysal* (to open spatial weights and access the sample data set), *pandas* and *geopandas*."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "5ac490b0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import warnings\n",
+    "warnings.filterwarnings(\"ignore\")\n",
+    "import os\n",
+    "os.environ['USE_PYGEOS'] = '0'\n",
+    "\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import geopandas as gpd\n",
+    "from libpysal.io import open\n",
+    "from libpysal.examples import get_path\n",
+    "from libpysal.weights import lag_spatial\n",
+    "\n",
+    "from spreg import GMM_Error"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0ee18820",
+   "metadata": {},
+   "source": [
+    "### Functions Used\n",
+    "\n",
+    "- from pandas/geopandas:\n",
+    "  - read_file\n",
+    "  \n",
+    "- from libpysal:\n",
+    "  - io.open\n",
+    "  - examples.get_path\n",
+    "\n",
+    "- from spreg:\n",
+    "  - spreg.GMM_Error"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "47934d1b-7587-4ddf-be38-83904eede8e8",
+   "metadata": {},
+   "source": [
+    "### Variable definition and data input"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bfb04af4-4521-4ac0-8e55-d3b92f54a403",
+   "metadata": {},
+   "source": [
+    "The data set and spatial weights are again from the **chicagoSDOH** sample data set. They are the same as for the GMM Error estimation:\n",
+    "\n",
+    "- **Chi-SDOH.shp,shx,dbf,prj**: socio-economic indicators of health for 2014 in 791 Chicago tracts\n",
+    "- **Chi-SDOH_q.gal**: queen contiguity weights\n",
+    "\n",
+    "To illustrate the methods, the same descriptive model is used as in the ML notebook. It relates the rate of uninsured households in a tract(for health insurance, **EP_UNINSUR**) to the lack of high school education (**EP_NOHSDP**), the economic deprivation index (**HIS_ct**), limited command of English (**EP_LIMENG**) and the lack of access to a vehicle (**EP_NOVEH**). This is purely illustrative of a spatial error specification and does not have a particular theoretical or policy motivation.\n",
+    "\n",
+    "In an alternative specification, **HIS_ct** is considered to be endogenous, with, as before, **COORD_X** and **COORD_Y** as instruments.\n",
+    "\n",
+    "The file names and variable names are set in the usual manner. Any customization for different data sets/weights and different variables should be specified in this top cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "14a1e98e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "infileshp = get_path(\"Chi-SDOH.shp\")     # input shape file with data\n",
+    "infileq = get_path(\"Chi-SDOH_q.gal\")     # queen contiguity weights created with GeoDa\n",
+    "\n",
+    "y_name = 'EP_UNINSUR'\n",
+    "x_names = ['EP_NOHSDP','HIS_ct','EP_LIMENG','EP_NOVEH']\n",
+    "xe_names = ['EP_NOHSDP','EP_LIMENG','EP_NOVEH']\n",
+    "yend_names = ['HIS_ct']\n",
+    "q_names = ['COORD_X','COORD_Y']\n",
+    "ds_name = 'Chi-SDOH'\n",
+    "w_name = 'Chi-SDOH_q'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d9514984-a22e-4d5b-a993-f40771a290a0",
+   "metadata": {},
+   "source": [
+    "The `read_file` and `open` functions are used to access the sample data set and contiguity weights. The weights are row-standardized and the data frames for the dependent and explanatory variables are constructed. As before, this functionality is agnostic to the actual data sets and variables used, since it relies on the specification given in the initial block above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "3248d23d-2247-40bf-a3a5-3ba21b477aa8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dfs = gpd.read_file(infileshp)\n",
+    "wq =  open(infileq).read()    \n",
+    "wq.transform = 'r'    # row-transform the weights\n",
+    "y = dfs[y_name]\n",
+    "x = dfs[x_names]\n",
+    "yend = dfs[yend_names]\n",
+    "xe = dfs[xe_names]\n",
+    "q = dfs[q_names]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a8a0c186-dd30-4f64-9a5b-d95bb562ebde",
+   "metadata": {},
+   "source": [
+    "## The SAR-SAR Model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4cc74112-a2eb-4efa-9116-d1e5197d20ed",
+   "metadata": {},
+   "source": [
+    "### Exogenous variables only"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4eec12f7",
+   "metadata": {},
+   "source": [
+    "GMM estimation of the SAR-SAR model is implemented in `spreg.GMM_Error`. This requires the standard regression arguments (i.e., at a minimum, `y`, `x` and `w`, as well as `yend` and `q` for the endogenous case), as well as `add_wy = True`. The same three methods are implemented as for generic `GMM_Error`, but here only the default `estimator = \"het\"` will be considered. Also, none of the special options are included (see the GMM Error notebook for details)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "799e1559",
+   "metadata": {},
+   "source": [
+    "GMM-heteroskedastic is the default, so the `estimator` argument does not need to be specified. The first illustration is for all default settings with only exogenous regressors. As usual, there is the option to use higher order lags for the instruments, but this is not pursued here.\n",
+    "\n",
+    "Also, note that in constrast to the standard lag (and spatial Durbin) specifications, there is no `spat_impacts` option."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1975d6b2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sarsar1 = GMM_Error(y,x,w=wq,\n",
+    "                 add_wy=True,\n",
+    "                     name_w=w_name,name_ds=ds_name)\n",
+    "print(sarsar1.summary)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "debc6278",
+   "metadata": {},
+   "source": [
+    "In this example, there was no evidence to include a spatial error term suggested by the AK test in the spatial lag model (see the notebook on IV estimation of the spatial lag model). As a result, it is not a surprise to find the coefficient $\\lambda$ not to be significant. Typical in the SAR-SAR model, the signs of $\\rho$ and $\\lambda$ tend to be opposite, which is difficult to interpret and may point to an identification problem."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e185d6b5",
+   "metadata": {},
+   "source": [
+    "### Exogenous and endogenous variables"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3d0b559",
+   "metadata": {},
+   "source": [
+    "An extension to include additional endogenous variables is carried out in the standard way. The endogenous variables and associated instruments are listed below the table with estimates. As usual, there is an option to include the spatial lags of the instruments (`True` by default)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "afa70961",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sarsar2 = GMM_Error(y,xe,w=wq,yend=yend,q=q,\n",
+    "                     add_wy = True,\n",
+    "                     name_w=w_name,name_ds=ds_name)\n",
+    "print(sarsar2.summary)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b0f21d50",
+   "metadata": {},
+   "source": [
+    "The consideration of endogeneity makes **HIS_ct** insignificant (as well as **EP_NOHSDP**). The $\\lambda$ coefficient remains non-significant, but its sign changes."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "418df371",
+   "metadata": {},
+   "source": [
+    "## The GNS Model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05445788",
+   "metadata": {},
+   "source": [
+    "The GNS model is treated as an SLX-Error model with an additional spatially lagged dependent variable as a regressor. This is accomplished by setting both `slx_lags = 1` (or higher) and `add_wy = True` in the `GMM_Error` call. As in the SAR-SAR case, there is no `spat_impacts` option."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1aabde7b",
+   "metadata": {},
+   "source": [
+    "### Exogenous variables only"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0a22c12",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gns1 = GMM_Error(y,x,w=wq,\n",
+    "                     slx_lags = 1, add_wy = True,\n",
+    "                     name_w=w_name,name_ds=ds_name)\n",
+    "print(gns1.summary)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "343e01de",
+   "metadata": {},
+   "source": [
+    "The results are only provided as an illustration of the functionality. Again, the typical pattern emerges of opposite signs for $\\rho$ and $\\lambda$, but now only $\\lambda$ is significant. None of the SLX terms is significant at p=0.01, and of the original regressors, only **EP_LIMENG** remains significant."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "01449367",
+   "metadata": {},
+   "source": [
+    "### Exogenous and endogenous variables"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f0f04f20",
+   "metadata": {},
+   "source": [
+    "Endogenous variables are included by specifying `yend` and `q` (the `x` argument is set to `xe`, for exogenous variables only)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "30acd3e5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gns2 = GMM_Error(y,xe,w=wq,yend=yend,q=q,\n",
+    "                     add_wy = True,slx_lags = 1,\n",
+    "                     name_w=w_name,name_ds=ds_name)\n",
+    "print(gns2.summary)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a89c73d",
+   "metadata": {},
+   "source": [
+    "In this example, the estimate of $\\rho$ is 0.9, which is highly suspicious. The estimate for $\\lambda$ is now marginally significant and positive. Of the other variables in the model, only **EP_LIMENG** remains as significant.\n",
+    "\n",
+    "Again, this highlights the caution that is needed when implementing this model. In general, it should be avoided and the model under consideration should be respecified in a different way."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b086a22",
+   "metadata": {},
+   "source": [
+    "## Practice\n",
+    "\n",
+    "Since these models should be avoided, there is not much point in practicing them, other than to gain insight into the often conflicting (and confusing) indications provided by the parameter estimates. It is not because a model is not identified that no estimates can be obtained. However, those results are not necessarily (and usually not) meaningful."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}