Introduction to GDAL

The geospatial data abstraction library (GDAL, www.gdal.org) is an open source library for translating geospatial data between different formats. It is the primary intermediary by which all open-source data analysis and GIS software is able to interact with ArcGIS datasets. For example, one can translate ArcGIS binary grids into GRASS rasters or import them as an SpatialGridDataFrame object in R. GDAL proper supports raster data types, but it has been effectively merged with another library, OGR, that supports vector data conversion.

There are some file formats that are not directly translatable by GDAL, notably ESRI proprietary Smart Data Compression (SDC) files. The GDAL website provides a list of supported vector formats and raster formats. ESRI binary grids, coverages, and personal geodatabases can be read but not written.

Since GDAL is a library, it is meant for developers who write code that reference the library, rather than for end users who actually translate datasets. The library API is in C, but there are bindings for R, Perl, Python, VB6, Ruby, Java and C#. The GDAL package does come with a few commandline utilities for working with geospatial data: GDAL Utilites. Among those tools that I commonly use

Raster information
Convert rasters between formats
Merge Tiled rasters into one
Makes a Google Maps- or Google Earth-compatible set of rasters

Binary distributions of GDAL are available for most platforms are available here, and can be installed without using the command line.

In the next post, we'll look at the R bindings for GDAL in the RGDAL package built by Tim Keitt at the University of Texas, and I will show some examples of working with raster datasets in R.



Converting R contingency tables to data frames

A contingency table presents the joint density of one or more categorical variables. Each entry in a contingency table is a count of the number of times a particular set of factors levels occurs in the dataset. For example, consider a list of plant species where each species is assigned a relative seed size (small, medium, or large) and a growth form (tree, shrub, or herb).

seed.sizes <- c("small", "medium", "large")
growth.forms <- c("tree", "shrub", "herb")
species.traits <- data.frame(
  seed.size = seed.sizes[c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3)],
  growth.form = growth.forms[c(3, 3, 2, 2, 1, 2, 2, 3, 1, 1, 1, 1)]

A contingency table will tell us how many times each combination of seeds.sizes and growth.forms occur.

tbl <- table(species.traits)

The output contingency table are of class table. The behaviour of these objects is not quite like a data frame. In fact, trying to convert them to a data frame gives a non-intuitive result.


Coercion of the table into a data frame puts each factor of the contingency table into its own column along with the frequency, rather than keeping the same structure as original table object. If we wanted to turn the table into a data frame keeping the original structure we use as.data.frame.matrix. This function is not well-documented in R, and this is probably the only situation in which it would be used. But, it works.




Changing the font in Carbon Emacs

My current monospaced font of choice is Deja Vu Sans Mono. It is pretty easy to change the default font in Carbon Emacs. It is not so easy to change the default font on Windows- or Linux-based Emacs22, which is the current release version. It will be relatively straightforward in Emacs23 via the xft package. But, for your Mac folks out there…

Installing a Font on Mac OS X

  1. Download the font (Mac will accept almost any format).
  2. Drag the font file into /System/Fonts

Figure out the long name of your font face in Carbon Emacs

  • M-x mac-font-panel-mode. Then pick your font
  • M-x describe-font and copy the long name.

Set the default font

There are a couple of ways to do this.


  • M-x customize-face RET default RET
  • Fill in the fields of the font with the long name.

.emacs.d file

(set-default-font "-apple-inconsolata-medium-r-normal--11-110-72-72-m-110-iso10646-1")

Monospaced fonts for the uninitiated

Programmers tend to use different fonts for code than what we see in books an on the web. Word processors typically use proportional fonts, where the horizontal space allowed for narrow letters like "l" is less than the space allowed for wide letters like "W". In code, we want the columns of each line to be aligned. So, programmers tend to use monospaced fonts, where each letter takes up the same amount of horizontal space. The most commonly known fixed width-font is Courier. Another common fixed-width font for Windows and Microsoft products in Consolas. The Code Project has an exhaustive list of monospaced fonts.

One of the key attributes to look for in a monospaced font is the distinction between the number one and the lowercase letter "l" and the number zero and the uppercase "O". When you are trying to read variable names that have mixtures of numbers and letters, you will be thankful for a font that distinguishes between these.



Extracting Raster Values from Points in R and GRASS

A common task in GIS analysis is to extract the value of a remotely sensed environmental variable at a point location. For instance we may wish to extract the elevation of a field plot from a digital elevation model. The elevation data is a raster (i.e. grid) and the plots are a point shapefile (or a simple text file of X, Y locations). The command for doing this in ArcGIS is ExtractValuesToPoints available in the Spatial Analyst package. Situations may arise where ArcGIS is not the most efficient way of extracting these values. So, here, I provide a brief overview of how to extract raster values to points in R and GRASS.

Extract Values to Points in R

This is strikingly easy is R. My work usually requires more statistical sophistication than is available in ArcGIS. As a result, I have completely switched to doing the extraction in R. I known I am going to end in R eventually, and it is easier to automate than writing a long python script in ArcGIS.

Data required

For the purpose of this exercise. All the data must be have the same spatial projection.

an ESRI ASCII grid. This could also be an ArcGIS binary grid if you know how to use RGDAL. That perhaps will be another post.
a point shapefile.

You also need the maptools and sp packages.

The Code

That is it. Fast, and easy.

Extracting Values in GRASS

Extracting raster values in GRASS is somewhat faster than in R, but it takes a little bit more planning in that you have to explicitly create the column that the raster values will go into.

Data Required

  • gr : A GRASS grid
  • pt : A GRASS point dataset

The Code

The basic flow of this is that you create an empty column in the point dataset with the right data type (i.e. varchar(10) string of length 10, double precision floating point numbers, int integers). Then, fill the column with the raster values.

Labels: , ,


Power Analysis for mixed-effect models in R

The power of a statistical test is the probability that a null hypothesis will be rejected when the alternative hypothesis is true. In lay terms, power is your ability to refine or "prove" your expectations from the data you collect. The most frequent motivation for estimating the power of a study is to figure out what sample size will be needed to observe a treatment effect. Given a set of pilot data or some other estimate of the variation in a sample, we can use power analysis to inform how much additional data we should collect.

I recently did a power analysis on a set of pilot data for a long-term monitoring study of the US National Park Service. I thought I would share some of the things I learned and a bit of R code for others that might need to do something like this. If you aren't into power analysis, the code below may still be useful as examples of how to use the error handling functions in R (withCallingHandlers, withRestarts), parallel programming using the snow package, and linear mixed effect regression using nlme. If you have any suggestions for improvement or if I got something wrong on the analysis, I'd love to hear from you.

1 The Study

The study system was cobblebars along the Cumberland river in Big South Fork National Park (Kentucky and Tennessee, United States). Cobblebars are typically dominated by grassy vegetation that include disjunct tall-grass prairie species. It is hypothesized that woody species will encroach onto cobblebars if they are not seasonally scoured by floods. The purpose of the NPS sampling was to observe changes in woody cover through time. The study design consisted of two-stages of clustering: the first being cobblebars, and the second being transects within cobblebars. The response variable was the percentage of the transect that was woody vegetation. Because of the clustered design, the inferential model for this study design has mixed-effects. I used a simple varying intercept model:

where y is the percent of each transect in woody vegetation sampled n times within J cobblebars, each with K transects. The parameter of inference for the purpose of monitoring change in woody vegetation through time is β, the rate at which cover changes as a function of time. α, γ, σ2γ, and σ2y are hyper-parameters that describe the hierarchical variance structure inherent in the clustered sampling design.

Below is the function code used I used to regress the pilot data. It should be noted that with this function you can log or logit transform the response variable (percentage of transect that is woody). I put this in because the responses are proportions (0,1) and errors should technically follow a beta-distribution. Log and logit transforms with Gaussian errors could approximate this. I ran all the models with transformed and untransformed response, and the results did not vary at all. So, I stuck with untransformed responses:

Model <- function(x = cobblebars,
  type = c("normal","log","logit")){
  ## Transforms
  if (type[1] == "log")
    x$prop.woody <- log(x$prop.woody)
  else if (type[1] == "logit")
    x$prop.woody <- log(x$prop.woody / (1 - x$prop.woody))

  mod <- lme(prop.woody ~ year,
             data = x,
             random = ~ 1 | cobblebar/transect,
             na.action = na.omit,
             control = lmeControl(opt = "optim",
               maxIter = 800, msMaxIter = 800)
  mod$type <- type[1]


Here are the results from this regression of the pilot data:

Linear mixed-effects model fit by REML
 Data: x 
        AIC       BIC   logLik
  -134.4319 -124.1297 72.21595

Random effects:
 Formula: ~1 | cobblebar
StdDev:  0.03668416

 Formula: ~1 | transect %in% cobblebar
        (Intercept)   Residual
StdDev:  0.02625062 0.05663784

Fixed effects: prop.woody ~ year 
                  Value  Std.Error DF   t-value p-value
(Intercept)  0.12966667 0.01881983 29  6.889896  0.0000
year        -0.00704598 0.01462383 29 -0.481815  0.6336
year -0.389

Number of Observations: 60
Number of Groups: 
              cobblebar transect %in% cobblebar 
                      6                      30 

2 We don't learn about power analysis and complex models

When I decided upon the inferential model the first thing that occurred to me was that I never learned in any statistics course I had taken how to do such a power analysis on a multi-level model. I've taken more statistics courses than I'd like to count and taught my own statistics courses for undergrads and graduate students, and the only exposure to power analysis that I had was in the context of simple t-tests or ANOVA. You learn about it in your first 2 statistics courses, then it rarely if ever comes up again until you actually need it.

I was, however, able to find a great resource on power analysis from a Bayesian perspective in the excellent book "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Andrew Gelman and Jennifer Hill. Andrew Gelman has thought and debated about power analysis and you can get more from his blog. The approach in the book is a simulation-based one and I have adopted it for this analysis.

3 Analysis Procedure

For the current analysis we needed to know three things: effect size, sample size, and estimates of population variance. We set effect size beforehand. In this context, the parameter of interest is the rate of change in woody cover through time β, and effect size is simply how large or small a value of β you want to distinguish with a regression. Sample size is also set a priori. In the analysis we want to vary sample size by varying the number of cobblebars, the number of transects per cobblebar or the number of years the study is conducted.

The population variance cannot be known precisely, and this is where the pilot data come in. By regressing the pilot data using the model we can obtain estimates of all the different components of the variance (cobblebars, transects within cobblebars, and the residual variance). Below is the R function that will return all the hyperparameters (and β) from the regression:

  ## Get the hyperparameters from the mixed effect model
  fe <- fixef(x)
    b<-fe[2] # use the data effect size if not supplied

  mu.a <- fe[1] 

  vc <- VarCorr(x)
  sigma.y <- as.numeric(vc[5, 2]) # Residual StdDev
  sigma.a <- as.numeric(vc[2, 2]) # Cobblebar StdDev
  sigma.g <- as.numeric(vc[4, 2]) # Cobblebar:transect StdDev

  hp<-c(b, mu.a, sigma.y, sigma.a, sigma.g)
  names(hp)<-c("b", "mu.a", "sigma.y", "sigma.a", "sigma.g")

To calculate power we to regress the simulated data in the same way we did the pilot data, and check for a significant β. Since optimization is done using numeric methods there is always the chance that the optimization will not work. So, we make sure the regression on the fake data catches and recovers from all errors. The solution for error recovery is to simply try the regression on a new set of fake data. This function is a pretty good example of using the R error handling function withCallingHandlers and withRestarts.

fakeModWithRestarts <- function(m.o, n = 100,  ...){
  ## A Fake Model
    i <- 0
    mod <- NULL
    while (i < n & is.null(mod)){
      mod <- withRestarts({
        f <- fake(m.orig = m.o, transform = F, ...)
        return(update(m.o, data = f))
      rs = function(){
        i <<- i + 1
  error = function(e){
  warning = function(w){
    if(w$message == "ExceededIterations")
      cat("\n", w$message, "\n")

To calculate the power of a particular design we run fakeModWithRestarts 1000 times and look at the proportion of significant β values:

dt.power <- function (m, n.sims = 1000, alpha=0.05, ...){
  ## Calculate power for a particular sampling design
  signif<-rep(NA, n.sims)
  for(i in 1:n.sims){
    lme.power <- fakeModWithRestarts(m.o = m, ...)
      signif[i] <- summary(lme.power)$tTable[2, 5] < alpha
  power <- mean(signif, na.rm = T)

Finally, we want to perform this analysis on many different sampling designs. In my case I did all combinations of set of effect sizes, cobblebars, transects, and years. So, I generated the appropriate designs:

factoredDesign <- function(Elevs = 0.2/c(1,5,10,20),
                           Nlevs = seq(2, 10, by = 2),
                           Jlevs = seq(4, 10, by = 2),
                           Klevs = c(3, 5, 7), ...){
  ## Generates factored series of sampling designs for simulation
  ## of data that follow a particular model.
  ## Inputs:
  ##   Elevs - vector of effect sizes for the slope parameter.
  ##   Nlevs - vector of number of years to sample.
  ##   Jlevs - vector of number of cobblebars to sample.
  ##   Klevs - vector of number of transects to sample.
  ## Results:
  ##   Data frame with where columns are the factors and
  ##   rows are the designs.

  # Level lengths
  lE <- length(Elevs)
  lN <- length(Nlevs)
  lJ <- length(Jlevs)
  lK <- length(Klevs)

  # Generate repeated vectors for each factor
  E <- rep(Elevs, each = lN*lJ*lK)
  N <- rep(rep(Nlevs, each = lJ*lK), times = lE)
  J <- rep(rep(Jlevs, each = lK), times = lE*lN)
  K <- rep(Klevs, times = lE*lN*lJ)
  return(data.frame(E, N, J, K))

Once we know our effect sizes, the different sample sizes we want, and the estimates of population variance we can generate simulated dataset that are similar to the pilot data. To calculate power we simply simulate a large number of dataset and calculate the proportion of slopes, β that are significantly different from zero (p-value < 0.05). This procedure is repeated for all the effect sizes and sample sizes of interest. Here is the code for generating a simulated dataset. It also does the work of doing the inverse transform of the response variables if necessary.

fake <- function(N = 2, J = 6, K = 5, b = NULL, m.orig = mod,
                 transform = TRUE, ...){
  ## Simulated Data for power analysis
  ## N = Number of years
  ## J = Number of cobblebars
  ## K = Number of transects within cobblebars
  year <- rep(0:(N-1), each = J*K)
  cobblebar <- factor(rep(rep(1:J, each = K), times = N))
  transect <- factor(rep(1:K, times = N*J))

  ## Simulated parameters
    b <- hp['b']
  g <- rnorm(J*K, 0, hp['sigma.g'])
  a <- rnorm(J*K, hp['mu.a'] + g, hp['sigma.a'])
  ## Simulated responses
  eta <- rnorm(J*K*N, a + b * year, hp['sigma.y'])
  if (transform){
    if (m.orig$type == "normal"){
      y <- eta
      y[y > 1] <- 1 # Fix any boundary problems.
      y[y < 0] <- 0
    else if (m.orig$type == "log"){
      y <- exp(eta)
      y[y > 1] <- 1
    else if (m.orig$type == "logit")
      y <- exp(eta) / (1 + exp(eta))
    y <- eta
  return(data.frame(prop.woody = y, year, transect, cobblebar))

Then I performed the power calculations on each of these designs. This could take a long time, so I set this procedure to use parallel processing if needed. Note that I had to re-~source~ the file with all the necessary functions for each processor.

powerAnalysis <- function(parallel = T, ...){
  ## Full Power Analysis
  ## Parallel
    cl <- makeCluster(7, type = "SOCK")
    clusterEvalQ(cl, source("cobblebars2.r"))
  ## The simulations
  dat <- factoredDesign(...)

  if (parallel){
    dat$power <- parRapply(cl, dat, function(x,...){
      dt.power(N = x[2], J = x[3], K = x[4], b = x[1], ...)
    }, ...)
  } else {
    dat$power <- apply(dat, 1, function(x, ...){
      dt.power(N = x[2], J = x[3], K = x[4], b = x[1], ...)
    }, ...)


The output of the powerAnalysis function is a data frame with columns for the power and all the sample design settings. So, I wrote a custom plotting function for this data frame:

plotPower <- function(dt){
  xyplot(power~N|J*K, data = dt, groups = E,
         panel = function(...){panel.xyplot(...)
                               panel.abline(h = 0.8, lty = 2)},
         type = c("p", "l"),
         xlab = "sampling years",
         ylab = "power",
         strip = strip.custom(var.name = c("C", "T"),
           strip.levels = c(T, T)),
         auto.key = T

Below is the figure for the cobblebar power analysis. I won't go into detail on what the results mean since I am concerned here with illustrating the technique and the R code. Obviously, as the number of cobblebars and transects per year increase, so does power. And, as the effect size increases, observing it with a test is easier.

Author: Todd Jobe <toddjobe@unc.edu>

Date: 2009-09-18 Fri

HTML generated by org-mode 6.30trans in emacs 22

Labels: , , , , ,


Ecology Journals LaTeX Class

LaTeX is a document markup language used to create manuscripts that have nice typesetting. Professional looking documents can be created from simple text files. It is particularly useful for creating manuscripts that have math equations in them. Writing documents in LaTeX should allow authors to focus on content of the manuscript rather than the formatting. All La-tex documents are associated with a class file which sets the formatting for the document. Users simply associate a document with a class and the text is formatted appropriately. Unfortunately, many academic journals have very specific formatting requirements not built into LaTeX. The journals of the Ecological Society of America do not have LaTeX class files appropriate for their major journals: Ecology, Ecological Applications and Ecological Monographs (even though they will accept documents in LaTeX format). One user-created LaTeX class file for ecology is available here. However, this class formats a LaTeX file into a proof-style document, not a submission-style document. I have recently been writing a submission for Ecological Applications that is in LaTeX. I had to write my own class file for formatting this document according to the ESA guidelines (http://esapubs.org/esapubs/preparation.htm). I am sure that others have run into the same problem. So, I am making my class file publicly available. Please consider this class to be in alpha. I have not, yet, written support for multiple author addresses simply because I didn't need it for the current manuscript. Feel free to improve the class. Just let me know about it, so I can use it as well. To use this class in a LaTeX document, you can either install the class file into your LaTeX distribution or simply place the file in the same folder as your LaTeX document. Bibliography formatting using BibTex requires the natbib package be installed and the ecology.bst bibliography format file. I have packaged the ecology.cls class along with an example latex file, the proof pdf, an example figure, and an example bibligraphy into a tarball available below. Download: ecologyLatexClass.tar.gz

Labels: ,


Add a column to a text file from a raster

Text files are great

I have mixed feelings about software interoperability. Sure GRASS interfaces with R, and R can interface with Excel and Access if you have the ODBC drivers. But, often users do not have the energy to learn the best techniques for moving data back and forth between programs. Further, there always seems to be some quirk with data that prevents the transfer from going as smoothly as the user's manual examples do. Then, I remember the lowly text file. Whether you use comma separated values (csv) files or tab-delimited files, every software platform on the planet can read text files of records (lines) and fields (values on a line separated by commas,tabs or some other character). If you're trying to send data to someone else, you can be confident that your recipient can read them. It may not be the flashiest or fastest way to move data between people and systems, but it always works. Finally, when you use plain text files for data, you have many more options for working with the files. Tools such as bash, awk, sed, perl, and any other programming language you can think of will can be brought to bear on data analysis problems.

Attaching raster data to points in GIS

Switch gears for a moment from a common file format to a common analysis task. For researchers who work with spatial point patterns (like plots), a regular step in data analysis is to pull data from raster layers where they intersect the point data and add it as columns to the point data. Informally, these are known as "drill-down" operations. You have a stack of maps and you want to drill down through them vertically at point locations, grab the data, and add it to the points. This is an a relatively simple task in ArcGIS if you use Hawth's tools. It's also pretty easy in GRASS, though it involves a couple of steps. The function v.what.rast will put raster values into a column of a point dataset. The column must already exist in the dataset, which can be done through the command v.db.addcol and v.what.rast function just fills in the missing values with values where the raster intersects the points.

Adding raster values to text files

Now, back to text files. I recently worked with a graduate student who wanted elevation values from a lidar-derived digital elevation model (dem) on 4 samples totalling about 1100 plots. She is not a GRASS user, and so she sent me the locations of all the plots along with an identifier in csv format. I wanted to send her back the same csv files with elevation simply added as a field. I wrote a bash script to accomplish this quickly and repeatably for her files. In this script, all the GRASS functions are hidden from the user. Users simply input a csv file and the raster column you wish to add, and a csv file is output with that column added. The script requires the awk script that I posted previously, which recognizes the types of varibles present in the input csv. I have included a portion of the full script below. GRASS has a very particaular code style that it uses for running scripts natively (See g.parser for details). This makes the script very long. So, I've cut out the not-so-important parts. If you would like the full file you can get it here.


# MODULE:       v.ascii.addrastcol
# AUTHOR(S):    R. Todd Jobe 
# PURPOSE:      Add a raster value field to a csv of points
# COPYRIGHT:    (C) 2009 by R. Todd Jobe
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  GNU General Public License for more details.
# . . . Grass header information removed
# . . . The cut code creates variables GIS_OPT_* and GIS_FLAG_*
# . . . Get the full code at:
# . . . http://www.unc.edu/~toddjobe/src/v.ascii.addrastcol
# Need space-delimited arguments for looping

# test for input raster map
eval `g.findfile element=cell file="$GIS_OPT_RAST"`
if [ ! "$file" ] ; then
   g.message -e "Raster map <$GIS_OPT_RAST> not found in mapset search path"
   exit 1

for i in $GIS_OPT_INPUT; do
   # Check that input files exist
   if [ ! "$i" ] ; then
 echo "file $i does not exist" >&2

   # Create header line if necessary and get variable types
   if [ $GIS_FLAG_H -eq 1 ]; then

   # Intermediate variables
   map=`basename $i .csv`
   tmp=`tempfile -s ".csv"`
   col=`typeVars.awk -v h=$GIS_FLAG_H -F, $i`

   # Get the raster type
   if `r.info -t $GIS_OPT_RAST | grep -q "FCELL"`; then
 rtype="double precision"
   # Grass-specific calls
   v.in.ascii --overwrite input="$i" output="$map" fs="$GIS_OPT_FS" \
 skip=$skip column="$col" x="$GIS_OPT_X" y="$GIS_OPT_Y"
   v.db.addcol map="$map" \
 columns="$GIS_OPT_RAST $rtype"
   v.what.rast vector="$map" column="$GIS_OPT_RAST" \
   v.out.ascii input="$map" fs="," \
 columns="$GIS_OPT_RAST" | \
   awk -F, -v meters="$GIS_FLAG_M" -v head="$GIS_FLAG_H" \
 -v rast="$GIS_OPT_RAST" '
   BEGIN{ if(head == 1) print rast;}
        # Convert Ft. elevation to meters if necessary
        if($4!="" && meters==1) $4=$4 * 0.30480060
        print $4
   }' >"$tmp"
   # Join raster values to input and write to output
   paste -d$GIS_OPT_FS "$i" "$tmp" >"$of"

   # Cleanup
   rm "$tmp"
   if [ $GIS_FLAG_R -eq 1 ] ; then
 g.remove -f --quiet vect=$map
To see the useage for the full script, you can simply start GRASS and from the command line type v.ascii.addrastcol --help. Here's a short example for a random set of 25 points on Ft. Bragg North Carolina.
GRASS 6.4.0RC3 (ftBragg):~ > head locs.csv
GRASS 6.4.0RC3 (ftBragg):~ > v.ascii.addrastcol --quiet -hr \
input=locs.csv rast=iLandcover x=2 y=3 --quiet -r
GRASS 6.4.0RC3 (ftBragg):~ > head locs_isect.csv

Labels: , , ,