06-rcbd-spatial-example-SAS.Rmd

# RCBD Example: SAS {#rcbd-sas}

```{r include=FALSE}
require(SASmarkdown)
saspath <- "C:/Program Files/SASHome/SASFoundation/9.4/sas.exe"
sasopts <- "-nosplash -ls 200"
```

## Load and Explore Data

These data are from a winter wheat variety trial in Alliance, Nebraska [@stroup1994] and are commonly used to demonstrate spatial adjustment in linear modeling. The data represent yields of 56 varieties originally laid out in a randomized complete block design (RCB). The local topography combined with winter kill, however, induced spatial variability that did not correspond to the RCB design [@stroup2013] which produced biased estimates in a standard unadjusted analysis. <br><br>

The code below reads the data from a CSV file located the GitHub repo for this book. The data file can be downloaded from [here](https://raw.githubusercontent.com/IdahoAgStats/guide-to-field-trial-spatial-analysis/master/data/stroup_nin_wheat.csv). In this file, missing values for yield are denoted as 'NA'. These values are not actually missing, but represent blank plots separating blocks in the original design of the study. The **PROC FORMAT** section converts these to numeric missing values in SAS (defined as a single period). They are then removed from the data before proceeding to analyses. Also note that row and column indices are multiplied by constants to convert them to the meter dimensions of the plots. . <br><br>

The first six observations are displayed. The data contains variables for variety (gen), replication (rep), yield, and column and row identifiers.<br><br> Last, a map of the 4 replications (blocks) from the original experimental design is shown.

```{r SAS6_1, engine="sashtml", collectcode=TRUE, engine.path=saspath, include=TRUE, engine.opts=sasopts, comment=""}
proc format;
  invalue has_NA
  'NA' = .;
;

filename NIN url "https://raw.githubusercontent.com/IdahoAgStats/guide-to-field-trial-spatial-analysis/master/data/stroup_nin_wheat.csv";

data alliance;
    infile NIN firstobs=2 delimiter=',';
    informat yield has_NA.;
    input entry $   rep $   yield   col row;
    Row  = 4.3*Row;
    Col = 1.2*Col;
    if yield=. then delete;
run;

proc print data=alliance(obs=6);
title1 ' Alliance Nebraska Wheat Variety Data';

run;
ODS html GPATH = ".\img\" ;

proc sgplot data=alliance;
	styleattrs datacolors=(cx28D400 cx00D4C8 cx0055D4 cxB400D4) datalinepatterns=(solid dash);
	HEATMAPPARM y=row x=col COLORgroup=rep/ outline; 
title1 'Layout of Blocks';
run;
```

### Plots of Field Trends

A first step in assessing spatial variability is to plot the data to visually assess any trends or patterns. This code uses the **SGPLOT** procedure to examine spatial patterns through a heat map of yields, as well as potential trends across replications, columns, and rows using box plots.

```{r SAS6_2, engine="sashtml",  engine.path=saspath, engine.opts=sasopts, include=TRUE, comment=""}
ODS html GPATH = ".\img\" ;

proc sgplot data=alliance;
	HEATMAPPARM y=Row x=Col COLORRESPONSE=yield/ colormodel=(blue yellow green); 
run;

proc sgplot data=alliance;
	vbox yield/category=rep FILLATTRS=(color=red) LINEATTRS=(color=black) WHISKERATTRS=(color=black);
run;

proc sgplot data=alliance;
	vbox yield/category=Col FILLATTRS=(color=yellow) LINEATTRS=(color=black) WHISKERATTRS=(color=black);
run;

proc sgplot data=alliance;
	vbox yield/category=Row FILLATTRS=(color=blue) LINEATTRS=(color=black) WHISKERATTRS=(color=black);
run;
```

In the heat map, there is a notable region where yield dips in the north west corner of the study area. This area runs across the two top most blocks and is positioned towards their ends, making those blocks non-homogeneous. As Stroup [@stroup2013] notes, this is due to a hilly area with low snow cover and high exposure to low winter temperatures. The box plots demonstrate this pattern across blocks, columns, and rows as well. From these initial graphics, it is clear there are discernible spatial patterns and trends present.<br><br>

## Estimating and Testing Spatial Correlation

### Examine the Number of Distance Pairs and Maximum Lags between Residuals

The process of modeling spatial variability begins by obtaining the residuals from the original RCB analysis. Here **PROC MIXED** is used to fit the RCB model and output the model residuals to the SAS data set "residuals".<br><br>

```{r SAS6_3, engine="sashtml", collectcode=TRUE, engine.path=saspath, include=TRUE, engine.opts=sasopts, comment=""}

ods html close;
proc mixed data=alliance;
   class Rep Entry;
   model Yield = Entry / outp=residuals;
   random Rep;
run;
ods html;
```

Examining and estimating spatial variability typically proceeds in several steps. In SAS, these can all be accomplished using **PROC VARIOGRAM**. In this first step, we summarize the potential distances (lags) between row/column positions. The **nhclasses** option sets a value for the number of lag classes or bins to try. Some trial and error here may be necessary in order to find a good setting, however, this value should be big enough to cover the range of possible distances between data row and column coordinates. Setting this initially to 40 was found to be a reasonable choice here. The **novariogram** option tells SAS to only look at these distances and not compute the empirical semivariance values yet.<br><br>

```{r SAS6_4, engine="sashtml",  engine.path=saspath, engine.opts=sasopts, include=TRUE, comment=""}

/* Examine lag distance & maxlag */
ODS html GPATH = ".\img\" ;

proc variogram data=residuals;
   compute novariogram nhclasses=40;
   coordinates xc=row yc=col;
   var resid;
run;
```

This output indicates that, with 40 bins, the minimum lag distance is approximately 1.2m and that the maximum lag distances vary from 25 to 43m in rows and columns, respectively. The histogram graphically displays the number of pairs at each lag distance bin. Ideally, we want bins that have at least 30 pairs to improve accuracy in estimation of the empirical semivariance. There is sufficient data here such that setting the maximum lag to 30 results in bins that have much more than 30 pairs and will provide an accurate estimate of semivariance. <br><br>

### Compute Moran's I and Geary's C.

Two common metrics of spatial correlation are Moran's I and Geary's c. Both these measures are essentially weighted correlations between pairs of observations and measure global and local variability, respectively. If no spatial correlation is present, Moran's I has an expected value close to 0.0 while Geary's C will be close to 1.0. The estimates for I and C can be computed and tested against these null values with the **PROC VARIOGRAM** code below. <br><br>

```{r SAS6_5, engine="sashtml",  engine.path=saspath, engine.opts=sasopts, include=TRUE, comment=""}
ODS html GPATH = ".\img\" ;

proc variogram data=residuals plots(only)=moran ;
   compute lagd=1.2 maxlag=30 novariogram autocorr(assum=nor) ;
   coordinates xc=row yc=col;
   var resid;
run;
```
Both Moran's I and Geary's C are significant here, indicating the presence of spatial correlation. The accompanying scatter plot of residuals vs lag distance also show a positive correlation where residuals increase in magnitude with increasing distance between points.<br><br>

## Estimation and Modeling of Semivariance 

### Estimating Empirical Semivariance

Using the lag distance and maximum lag values obtained in the previous steps, we can now use the **compute** statement to estimate an empirical variogram as described in Section \@ref(intro)

```{r SAS6_6, engine="sashtml", engine.path=saspath, include=TRUE, engine.opts=sasopts, comment=""}
ODS html GPATH = ".\img\" ;

proc variogram data=residuals plots(only)=(semivar);
   coordinates xc=Col yc=Row;
   compute lagd=1.2 maxlags=30;
  var resid;
run;
```
In the plot, the semi-variance increases as distance between points increases up to approximately 20m where it begins to level off. This type of pattern is very common for spatial relationships.<br><br>

### Fitting an Empirical Variogram Model

In Section \@ref(background), several theoretical variogram models were described. We can use **PROC VARIOGRAM** to fit and compare any number of these models. In the code below, the Gaussian, Exponential, Power, and Spherical models are fit using the **model** statement. By default when several models are listed, SAS will carry out a more sophisticated spatial modeling approach that uses combinations of these models. We, however, just want to focus on single model scenarios. Hence, the **nest=1** option is given which restricts the variogram modeling to single models only.<br><br>

```{r SAS6_7, engine="sashtml", engine.path=saspath, include=TRUE, engine.opts=sasopts, comment=""}
ODS html GPATH = ".\img\" ;

proc variogram data=residuals plots(only)=(fitplot);
   coordinates xc=Col yc=Row;
   compute lagd=1.2 maxlags=30;
   model form=auto(mlist=(gau, exp, pow, sph) nest=1);
  var resid;
run;
```

The table **Fit Summary** provides statistics on each variogram model fit. The AIC or SSE statistics can be used as a guide for model selection where smaller values indicate a better fit.  For this data, the Gaussian model is rated as "best", although the Spherical model is relatively similar. Given the numerically lower AIC and SSE values, however, **PROC VARIOGRAM** selects the Gaussian model and reports those semivariogram model parameter estimates in the final table. These parameter estimates will be used in the next step. The plot shown at the end provides a visual comparison each variogram model fit with the selected Gaussian model highlighted in bold.<br><br>

## Using the Estimated variogram in an Adjusted Analysis
The previous steps have:<br>
1. Diagnosed the presence and pattern of spatial variability.<br>
2. Identified and estimated the a model to describe the variability<br>

The last step is to utilize this information in a statistical analysis as outlined at the beginning of Section \@ref(background). To do this, the linear mixed model procedure **PROC MIXED** will be used to incorporate the spatial variability into the linear model variance-covariance matrix, assuming the variogram model identified above.<br>

As an initial step, for comparison purposes, the unadjusted RCB model is run first and the means saved in data set "NIN_RCBD_means".<br><br>

### Unadjusted RCBD Model

```{r SAS6_8, engine="sashtml", collectcode=TRUE, engine.path=saspath, include=TRUE, engine.opts=sasopts, comment=""}

/** Fit unadjusted RCBD model **/
ODS html GPATH = ".\img\" ;

proc mixed data=alliance ;
	class entry rep;
	model yield = entry ;
	random rep;
	lsmeans entry/cl;
	ods output LSMeans=NIN_RCBD_means;
	title1 'NIN data: RCBD';
run;
```

The F-statistic and accompanying large p-value for 'gen' indicate no differences between cultivars. This is surprising for a variety trial. The fit statistics (e.g. AIC) are helpful for model comparison, although on their own, these metrics do not have much meaning. The AIC value for this standard RCB analysis is 1221.7; let's compare that number to that for the spatially-adjusted models (in the case of AIC, smaller numbers indicate a better fitting model).  Another point of comparison is the standard error of the means: 3.8557.<br>

The next step is to try a spatially-adjusted model. This is done by adding a **repeated** statement specifying the spatial model form (Gaussian, **type=sp(gau)**) and the variables related to spatial positions in the study (row, col). The **local** option tells SAS to use a nugget parameter in the Gaussian variogram model. The second additional statement is **parms**. This gives SAS a starting place for estimating the range, sill, and nugget parameters. The values given here are taken from the **PROC VARIOGRAM** output above. While this step can be omitted, it is recommended not to do so because, on its own, **PROC MIXED** can often have trouble narrowing in on reasonable parameter estimates. One reason for the preceding variogram steps was to obtain these parameter values so **PROC MIXED** has a good starting place for estimation. As before, the estimated means (adjusted now) are saved, this time in data set "NIN_Spatial_means".<br>

### RCB Model with Spatial Covariance

```{r SAS6_9, engine="sashtml", collectcode=TRUE, engine.path=saspath, include=TRUE, engine.opts=sasopts, comment=""}

/** Fit Gaussian adjusted model **/
/** Parms statement order: Range, Sill, Nugget **/
/** Model option "local" forces nugget into model **/
ODS html GPATH = ".\img\" ;

proc mixed data=alliance maxiter=150;
	class entry;
	model yield = entry /ddfm=kr;
	repeated/subject=intercept type=sp(gau) (Row Col) local;
	parms (11) (22) (19);
	lsmeans entry/cl;
	ods output LSMeans=NIN_Spatial_means;
	title1 'NIN data: Gaussian Spatial Adjustment';
run;

```

From this output, we see the AIC value is now 1073.1. This is substantially lower than the AIC=1221.7 from the unadjusted model indicating a better model fit. Additionally, the gen effect in the model now has a low p-value (p=0.0024) giving evidence for differences among cultivars where we did not see them previously. Lastly, the standard errors for the means are now lower than those of the unadjusted model indicating an increased precision in mean estimation.<br>

### Other Spatial Adjustments
#### RCB Model with Row and Column Covariates

Spatial variability might also be addressed by including covariates in the model for row and/or column trends. The following script fits this model where rows and columns enter the model as continuous effects (not in the class statement).

```{r SAS6_10, engine="sashtml", collectcode=TRUE, engine.path=saspath, include=TRUE, engine.opts=sasopts, comment=""}

/** Row column model **/
ods html close;

proc mixed data=alliance ;
	class entry rep;
	model yield = entry  row col/ddfm=kr;
	random rep;
	lsmeans entry/cl;
	ods output LSMeans=NIN_row_col_means;
	title1 'NIN data: RCBD';
run;
ods html;

```

#### RCB with Spline Adjustment

Polynomial splines are an additional method for spatial adjustment and represent a more non-parametric method that does not rely on estimation or modeling of variograms. Instead, it uses the raw data and residuals to fit a surface to the spatial data and adjust the variance covariance matrix accordingly. **PROC MIXED**, however, does not have an option for this, so the code below moves to **PROC GLIMMIX**, the generalized Linear Mixed Model procedure in SAS. The syntax for the two procedures are very similar, so, in this example adapted from [@stroup2013], only a few changes are required. First, a random statement is added for rows and columns with a "type=rsmooth" option. This fits a spline to the residuals over the row and column axes of the study area. Second, an "effect" statement is given defining a new covariate, sm_r, which is a splined surface for raw yields over the study area. When entered into the model as a continuous covariate, this centers the data such that the resulting model residuals will have a mean of zero.

```{r SAS6_11, engine="sashtml", collectcode=TRUE, engine.path=saspath, include=TRUE, engine.opts=sasopts, comment=""}

ods html close;
/** Fit  RCBD model with Spline **/

proc glimmix data=alliance ;
	class entry rep;
	effect sp_r = spline(row col);
	model yield = entry  sp_r/ddfm=kr;
	random row col/type=rsmooth;
	lsmeans entry/cl;
	ods output LSMeans=NIN_smooth_means;
	title1 'NIN data: RCBD';
run;

ods html;

```


## Compare Estimated Means

Presence of spatial variability can also bias the mean estimates. In this data, this would result in the means ranking incorrectly relative to one another under the unadjusted model. The steps below combine the saved means from the unadjusted and spatially adjusted models above so that we can see how the ranking changes.<br><br>

```{r SAS6_12, engine="sashtml", engine.path=saspath, include=TRUE, engine.opts=sasopts, comment=""}

data NIN_RCBD_means (drop=tvalue probt alpha estimate stderr lower upper df);
	set NIN_RCBD_means;
	RCB_est = estimate;
	RCB_se = stderr;
run;

data NIN_Spatial_means (drop=tvalue probt alpha estimate stderr lower upper df);
	set NIN_Spatial_means;
	Sp_est = estimate;
	Sp_se = stderr;
run;

proc sort data=NIN_RCBD_means;
	by entry;
run;

proc sort data=NIN_Spatial_means;
	by entry;
run;

data compare;
	merge NIN_RCBD_means NIN_Spatial_means;
	by entry;
run;

proc rank data=compare out=compare descending;
	var RCB_est Sp_est;
	ranks RCB_Rank Sp_Rank;
run;

proc sort data=compare;
	by  Sp_rank;
run;

proc print data=compare(obs=15);
	var entry rcb_est Sp_est rcb_se sp_se rcb_rank sp_rank;
run;
```

This comparison of the first 15 means shows some clear changes in rank. In his discussion of this data, Stroup [@stroup2013], writes:<br>

> ... Buckskin was a variety known to be a high-yielding benchmark; its mediocre mean yield in the RCB analysis despite being observed in the field outperforming all varieties in the vicinity was one symptom that the RCB analysis was giving nonsense results.
>
> --- Stroup (2013)

This high expectation for Buckskin is realized in the adjusted analysis.  Likewise, other varieties, such as NE87612 that ranked near the bottom in the RCB analysis, ranked as number 4 in the adjusted analysis. Of the top 10 varieties identified in the adjusted analysis, only 6 are found in the top 10 of the unadjusted RCB analysis. These changes in rank illustrate why consideration of spatial variability in agricultural trials is important.
<br><br>