Computational Ecology

Introduction to GDAL

2010-08-31T05:22:00.001-04:00

The geospatial data abstraction library (GDAL, www.gdal.org) is an open source library for translating geospatial data between different formats. It is the primary intermediary by which all open-source data analysis and GIS software is able to interact with ArcGIS datasets. For example, one can translate ArcGIS binary grids into GRASS rasters or import them as an SpatialGridDataFrame object in R. GDAL proper supports raster data types, but it has been effectively merged with another library, OGR, that supports vector data conversion.

There are some file formats that are not directly translatable by GDAL, notably ESRI proprietary Smart Data Compression (SDC) files. The GDAL website provides a list of supported vector formats and raster formats. ESRI binary grids, coverages, and personal geodatabases can be read but not written.

Since GDAL is a library, it is meant for developers who write code that reference the library, rather than for end users who actually translate datasets. The library API is in C, but there are bindings for R, Perl, Python, VB6, Ruby, Java and C#. The GDAL package does come with a few commandline utilities for working with geospatial data: GDAL Utilites. Among those tools that I commonly use

gdalinfo: Raster information
gdal_translate.py: Convert rasters between formats
gdal_merge.py: Merge Tiled rasters into one
gdal2tiles.py: Makes a Google Maps- or Google Earth-compatible set of rasters

Binary distributions of GDAL are available for most platforms are available here, and can be installed without using the command line.

In the next post, we'll look at the R bindings for GDAL in the RGDAL package built by Tim Keitt at the University of Texas, and I will show some examples of working with raster datasets in R.

Converting R contingency tables to data frames

2010-08-11T09:41:00.001-04:00

A contingency table presents the joint density of one or more categorical variables. Each entry in a contingency table is a count of the number of times a particular set of factors levels occurs in the dataset. For example, consider a list of plant species where each species is assigned a relative seed size (small, medium, or large) and a growth form (tree, shrub, or herb).

seed.sizes <- c("small", "medium", "large")
growth.forms <- c("tree", "shrub", "herb")
species.traits <- data.frame(
  seed.size = seed.sizes[c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3)],
  growth.form = growth.forms[c(3, 3, 2, 2, 1, 2, 2, 3, 1, 1, 1, 1)]
)


seed.size	growth.form
small	herb
small	herb
small	shrub
small	shrub
small	tree
medium	shrub
medium	shrub
medium	herb
medium	tree
large	tree
large	tree
large	tree

A contingency table will tell us how many times each combination of seeds.sizes and growth.forms occur.

tbl <- table(species.traits)


herb	shrub	tree
0	0	3
1	2	1
2	2	1

The output contingency table are of class table. The behaviour of these objects is not quite like a data frame. In fact, trying to convert them to a data frame gives a non-intuitive result.

as.data.frame(tbl)


seed.size	growth.form	Freq
large	herb	0
medium	herb	1
small	herb	2
large	shrub	0
medium	shrub	2
small	shrub	2
large	tree	3
medium	tree	1
small	tree	1

Coercion of the table into a data frame puts each factor of the contingency table into its own column along with the frequency, rather than keeping the same structure as original table object. If we wanted to turn the table into a data frame keeping the original structure we use as.data.frame.matrix. This function is not well-documented in R, and this is probably the only situation in which it would be used. But, it works.

as.data.frame.matrix(tbl)


herb	shrub	tree
0	0	3
1	2	1
2	2	1

Changing the font in Carbon Emacs

2010-07-29T06:02:00.001-04:00

My current monospaced font of choice is Deja Vu Sans Mono. It is pretty easy to change the default font in Carbon Emacs. It is not so easy to change the default font on Windows- or Linux-based Emacs22, which is the current release version. It will be relatively straightforward in Emacs23 via the xft package. But, for your Mac folks out there…

Installing a Font on Mac OS X

Download the font (Mac will accept almost any format).
Drag the font file into /System/Fonts

Figure out the long name of your font face in Carbon Emacs

M-x mac-font-panel-mode. Then pick your font
M-x describe-font and copy the long name.

Set the default font

There are a couple of ways to do this.

Customization

M-x customize-face RET default RET
Fill in the fields of the font with the long name.

`.emacs.d` file

(set-default-font "-apple-inconsolata-medium-r-normal--11-110-72-72-m-110-iso10646-1")

Monospaced fonts for the uninitiated

Programmers tend to use different fonts for code than what we see in books an on the web. Word processors typically use proportional fonts, where the horizontal space allowed for narrow letters like "l" is less than the space allowed for wide letters like "W". In code, we want the columns of each line to be aligned. So, programmers tend to use monospaced fonts, where each letter takes up the same amount of horizontal space. The most commonly known fixed width-font is Courier. Another common fixed-width font for Windows and Microsoft products in Consolas. The Code Project has an exhaustive list of monospaced fonts.

One of the key attributes to look for in a monospaced font is the distinction between the number one and the lowercase letter "l" and the number zero and the uppercase "O". When you are trying to read variable names that have mixtures of numbers and letters, you will be thankful for a font that distinguishes between these.

Extracting Raster Values from Points in R and GRASS

2010-07-26T16:56:00.001-04:00

A common task in GIS analysis is to extract the value of a remotely sensed environmental variable at a point location. For instance we may wish to extract the elevation of a field plot from a digital elevation model. The elevation data is a raster (i.e. grid) and the plots are a point shapefile (or a simple text file of X, Y locations). The command for doing this in ArcGIS is ExtractValuesToPoints available in the Spatial Analyst package. Situations may arise where ArcGIS is not the most efficient way of extracting these values. So, here, I provide a brief overview of how to extract raster values to points in R and GRASS.

Extract Values to Points in R

This is strikingly easy is R. My work usually requires more statistical sophistication than is available in ArcGIS. As a result, I have completely switched to doing the extraction in R. I known I am going to end in R eventually, and it is easier to automate than writing a long python script in ArcGIS.

Data required

For the purpose of this exercise. All the data must be have the same spatial projection.

gr.asc: an ESRI ASCII grid. This could also be an ArcGIS binary grid if you know how to use RGDAL. That perhaps will be another post.
pt.shp: a point shapefile.

You also need the maptools and sp packages.

The Code

library(maptools) # autoloads sp gr <- readAsciiGrid("www.unc.edu/~toddjobe/data/gr.asc") pt <- readShapePoints("www.unc.edu/~toddjobe/data/pt.shp") overlay(gr, pt)

coordinates gr.asc 1497 (569292, 1224170) 6094.6080 539 (567718, 1227840) 7964.5331 1023 (564565, 1225810) -293.6599 663 (562462, 1227260) -5351.7297 69 (563675, 1229710) -2923.9394 716 (563062, 1227180) -4241.8255 339 (567636, 1228780) 4781.6488 509 (561722, 1227870) -3958.7312 805 (560981, 1226690) 139.9091 155 (560884, 1229220) -2719.7353

That is it. Fast, and easy.

Extracting Values in GRASS

Extracting raster values in GRASS is somewhat faster than in R, but it takes a little bit more planning in that you have to explicitly create the column that the raster values will go into.

Data Required

gr : A GRASS grid
pt : A GRASS point dataset

The Code

The basic flow of this is that you create an empty column in the point dataset with the right data type (i.e. varchar(10) string of length 10, double precision floating point numbers, int integers). Then, fill the column with the raster values.

v.db.addcol map=pt columns="grval double precision" v.what.rast vector=pt raster=gr column=grval

Power Analysis for mixed-effect models in R

2009-09-18T14:58:00.006-04:00

The power of a statistical test is the probability that a null hypothesis will be rejected when the alternative hypothesis is true. In lay terms, power is your ability to refine or "prove" your expectations from the data you collect. The most frequent motivation for estimating the power of a study is to figure out what sample size will be needed to observe a treatment effect. Given a set of pilot data or some other estimate of the variation in a sample, we can use power analysis to inform how much additional data we should collect.

I recently did a power analysis on a set of pilot data for a long-term monitoring study of the US National Park Service. I thought I would share some of the things I learned and a bit of R code for others that might need to do something like this. If you aren't into power analysis, the code below may still be useful as examples of how to use the error handling functions in R (withCallingHandlers, withRestarts), parallel programming using the snow package, and linear mixed effect regression using nlme. If you have any suggestions for improvement or if I got something wrong on the analysis, I'd love to hear from you.

1 The Study

The study system was cobblebars along the Cumberland river in Big South Fork National Park (Kentucky and Tennessee, United States). Cobblebars are typically dominated by grassy vegetation that include disjunct tall-grass prairie species. It is hypothesized that woody species will encroach onto cobblebars if they are not seasonally scoured by floods. The purpose of the NPS sampling was to observe changes in woody cover through time. The study design consisted of two-stages of clustering: the first being cobblebars, and the second being transects within cobblebars. The response variable was the percentage of the transect that was woody vegetation. Because of the clustered design, the inferential model for this study design has mixed-effects. I used a simple varying intercept model:

where y is the percent of each transect in woody vegetation sampled n times within J cobblebars, each with K transects. The parameter of inference for the purpose of monitoring change in woody vegetation through time is β, the rate at which cover changes as a function of time. α, γ, σ²_γ, and σ²_y are hyper-parameters that describe the hierarchical variance structure inherent in the clustered sampling design.

Below is the function code used I used to regress the pilot data. It should be noted that with this function you can log or logit transform the response variable (percentage of transect that is woody). I put this in because the responses are proportions (0,1) and errors should technically follow a beta-distribution. Log and logit transforms with Gaussian errors could approximate this. I ran all the models with transformed and untransformed response, and the results did not vary at all. So, I stuck with untransformed responses:

Model <- function(x = cobblebars,
  type = c("normal","log","logit")){
  ## Transforms
  if (type[1] == "log")
    x$prop.woody <- log(x$prop.woody)
  else if (type[1] == "logit")
    x$prop.woody <- log(x$prop.woody / (1 - x$prop.woody))

  mod <- lme(prop.woody ~ year,
             data = x,
             random = ~ 1 | cobblebar/transect,
             na.action = na.omit,
             control = lmeControl(opt = "optim",
               maxIter = 800, msMaxIter = 800)
             )
  mod$type <- type[1]

  return(mod)
}

Here are the results from this regression of the pilot data:

Linear mixed-effects model fit by REML
 Data: x 
        AIC       BIC   logLik
  -134.4319 -124.1297 72.21595

Random effects:
 Formula: ~1 | cobblebar
        (Intercept)
StdDev:  0.03668416

 Formula: ~1 | transect %in% cobblebar
        (Intercept)   Residual
StdDev:  0.02625062 0.05663784

Fixed effects: prop.woody ~ year 
                  Value  Std.Error DF   t-value p-value
(Intercept)  0.12966667 0.01881983 29  6.889896  0.0000
year        -0.00704598 0.01462383 29 -0.481815  0.6336
 Correlation: 
     (Intr)
year -0.389

Number of Observations: 60
Number of Groups: 
              cobblebar transect %in% cobblebar 
                      6                      30

2 We don't learn about power analysis and complex models

When I decided upon the inferential model the first thing that occurred to me was that I never learned in any statistics course I had taken how to do such a power analysis on a multi-level model. I've taken more statistics courses than I'd like to count and taught my own statistics courses for undergrads and graduate students, and the only exposure to power analysis that I had was in the context of simple t-tests or ANOVA. You learn about it in your first 2 statistics courses, then it rarely if ever comes up again until you actually need it.

I was, however, able to find a great resource on power analysis from a Bayesian perspective in the excellent book "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Andrew Gelman and Jennifer Hill. Andrew Gelman has thought and debated about power analysis and you can get more from his blog. The approach in the book is a simulation-based one and I have adopted it for this analysis.

3 Analysis Procedure

For the current analysis we needed to know three things: effect size, sample size, and estimates of population variance. We set effect size beforehand. In this context, the parameter of interest is the rate of change in woody cover through time β, and effect size is simply how large or small a value of β you want to distinguish with a regression. Sample size is also set a priori. In the analysis we want to vary sample size by varying the number of cobblebars, the number of transects per cobblebar or the number of years the study is conducted.

The population variance cannot be known precisely, and this is where the pilot data come in. By regressing the pilot data using the model we can obtain estimates of all the different components of the variance (cobblebars, transects within cobblebars, and the residual variance). Below is the R function that will return all the hyperparameters (and β) from the regression:

GetHyperparam<-function(x,b=NULL){
  ## Get the hyperparameters from the mixed effect model
  fe <- fixef(x)
  
  if(is.null(b))
    b<-fe[2] # use the data effect size if not supplied

  mu.a <- fe[1] 

  vc <- VarCorr(x)
  sigma.y <- as.numeric(vc[5, 2]) # Residual StdDev
  sigma.a <- as.numeric(vc[2, 2]) # Cobblebar StdDev
  sigma.g <- as.numeric(vc[4, 2]) # Cobblebar:transect StdDev

  hp<-c(b, mu.a, sigma.y, sigma.a, sigma.g)
  names(hp)<-c("b", "mu.a", "sigma.y", "sigma.a", "sigma.g")
  return(hp)
}

To calculate power we to regress the simulated data in the same way we did the pilot data, and check for a significant β. Since optimization is done using numeric methods there is always the chance that the optimization will not work. So, we make sure the regression on the fake data catches and recovers from all errors. The solution for error recovery is to simply try the regression on a new set of fake data. This function is a pretty good example of using the R error handling function withCallingHandlers and withRestarts.

fakeModWithRestarts <- function(m.o, n = 100,  ...){
  ## A Fake Model
  withCallingHandlers({
    i <- 0
    mod <- NULL
    while (i < n & is.null(mod)){
      mod <- withRestarts({
        f <- fake(m.orig = m.o, transform = F, ...)
        return(update(m.o, data = f))
      },
      rs = function(){
        i <<- i + 1
        return(NULL)
      })
    }
    if(is.null(mod))
      warning("ExceededIterations")
    return(mod)
  },
  error = function(e){
    invokeRestart("rs")
  },
  warning = function(w){
    if(w$message == "ExceededIterations")
      cat("\n", w$message, "\n")
    else
      invokeRestart("rs")
  })
}

To calculate the power of a particular design we run fakeModWithRestarts 1000 times and look at the proportion of significant β values:

dt.power <- function (m, n.sims = 1000, alpha=0.05, ...){
  ## Calculate power for a particular sampling design
  signif<-rep(NA, n.sims)
  for(i in 1:n.sims){
    lme.power <- fakeModWithRestarts(m.o = m, ...)
    if(!is.null(lme.power))
      signif[i] <- summary(lme.power)$tTable[2, 5] < alpha
  }
  power <- mean(signif, na.rm = T)
  return(power)
}

Finally, we want to perform this analysis on many different sampling designs. In my case I did all combinations of set of effect sizes, cobblebars, transects, and years. So, I generated the appropriate designs:

factoredDesign <- function(Elevs = 0.2/c(1,5,10,20),
                           Nlevs = seq(2, 10, by = 2),
                           Jlevs = seq(4, 10, by = 2),
                           Klevs = c(3, 5, 7), ...){
  ## Generates factored series of sampling designs for simulation
  ## of data that follow a particular model.
  ## Inputs:
  ##   Elevs - vector of effect sizes for the slope parameter.
  ##   Nlevs - vector of number of years to sample.
  ##   Jlevs - vector of number of cobblebars to sample.
  ##   Klevs - vector of number of transects to sample.
  ## Results:
  ##   Data frame with where columns are the factors and
  ##   rows are the designs.

  # Level lengths
  lE <- length(Elevs)
  lN <- length(Nlevs)
  lJ <- length(Jlevs)
  lK <- length(Klevs)

  # Generate repeated vectors for each factor
  E <- rep(Elevs, each = lN*lJ*lK)
  N <- rep(rep(Nlevs, each = lJ*lK), times = lE)
  J <- rep(rep(Jlevs, each = lK), times = lE*lN)
  K <- rep(Klevs, times = lE*lN*lJ)
  
  return(data.frame(E, N, J, K))
}

Once we know our effect sizes, the different sample sizes we want, and the estimates of population variance we can generate simulated dataset that are similar to the pilot data. To calculate power we simply simulate a large number of dataset and calculate the proportion of slopes, β that are significantly different from zero (p-value < 0.05). This procedure is repeated for all the effect sizes and sample sizes of interest. Here is the code for generating a simulated dataset. It also does the work of doing the inverse transform of the response variables if necessary.

fake <- function(N = 2, J = 6, K = 5, b = NULL, m.orig = mod,
                 transform = TRUE, ...){
  ## Simulated Data for power analysis
  ## N = Number of years
  ## J = Number of cobblebars
  ## K = Number of transects within cobblebars
  year <- rep(0:(N-1), each = J*K)
  cobblebar <- factor(rep(rep(1:J, each = K), times = N))
  transect <- factor(rep(1:K, times = N*J))

  ## Simulated parameters
  hp<-GetHyperparam(x=m.orig)
  if(is.null(b))
    b <- hp['b']
  g <- rnorm(J*K, 0, hp['sigma.g'])
  a <- rnorm(J*K, hp['mu.a'] + g, hp['sigma.a'])
  
  ## Simulated responses
  eta <- rnorm(J*K*N, a + b * year, hp['sigma.y'])
  if (transform){
    if (m.orig$type == "normal"){
      y <- eta
      y[y > 1] <- 1 # Fix any boundary problems.
      y[y < 0] <- 0
    }
    else if (m.orig$type == "log"){
      y <- exp(eta)
      y[y > 1] <- 1
    }
    else if (m.orig$type == "logit")
      y <- exp(eta) / (1 + exp(eta))
  }
  else{
    y <- eta
  }
  
  return(data.frame(prop.woody = y, year, transect, cobblebar))
}

Then I performed the power calculations on each of these designs. This could take a long time, so I set this procedure to use parallel processing if needed. Note that I had to re-~source~ the file with all the necessary functions for each processor.

powerAnalysis <- function(parallel = T, ...){
  ## Full Power Analysis
  
  ## Parallel
  if(parallel){
    closeAllConnections()
    cl <- makeCluster(7, type = "SOCK")
    on.exit(closeAllConnections())
    clusterEvalQ(cl, source("cobblebars2.r"))
  }
  
  ## The simulations
  dat <- factoredDesign(...)

  if (parallel){
    dat$power <- parRapply(cl, dat, function(x,...){
      dt.power(N = x[2], J = x[3], K = x[4], b = x[1], ...)
    }, ...)
  } else {
    dat$power <- apply(dat, 1, function(x, ...){
      dt.power(N = x[2], J = x[3], K = x[4], b = x[1], ...)
    }, ...)
  }

  return(dat)
}

The output of the powerAnalysis function is a data frame with columns for the power and all the sample design settings. So, I wrote a custom plotting function for this data frame:

plotPower <- function(dt){
  xyplot(power~N|J*K, data = dt, groups = E,
         panel = function(...){panel.xyplot(...)
                               panel.abline(h = 0.8, lty = 2)},
         type = c("p", "l"),
         xlab = "sampling years",
         ylab = "power",
         strip = strip.custom(var.name = c("C", "T"),
           strip.levels = c(T, T)),
         auto.key = T
         )
}

Below is the figure for the cobblebar power analysis. I won't go into detail on what the results mean since I am concerned here with illustrating the technique and the R code. Obviously, as the number of cobblebars and transects per year increase, so does power. And, as the effect size increases, observing it with a test is easier.

Author: Todd Jobe <toddjobe@unc.edu>

Date: 2009-09-18 Fri

HTML generated by org-mode 6.30trans in emacs 22

Ecology Journals LaTeX Class

2009-06-15T08:40:00.007-04:00

LaTeX is a document markup language used to create manuscripts that have nice typesetting. Professional looking documents can be created from simple text files. It is particularly useful for creating manuscripts that have math equations in them. Writing documents in LaTeX should allow authors to focus on content of the manuscript rather than the formatting. All La-tex documents are associated with a class file which sets the formatting for the document. Users simply associate a document with a class and the text is formatted appropriately. Unfortunately, many academic journals have very specific formatting requirements not built into LaTeX. The journals of the Ecological Society of America do not have LaTeX class files appropriate for their major journals: Ecology, Ecological Applications and Ecological Monographs (even though they will accept documents in LaTeX format). One user-created LaTeX class file for ecology is available here. However, this class formats a LaTeX file into a proof-style document, not a submission-style document. I have recently been writing a submission for Ecological Applications that is in LaTeX. I had to write my own class file for formatting this document according to the ESA guidelines (http://esapubs.org/esapubs/preparation.htm). I am sure that others have run into the same problem. So, I am making my class file publicly available. Please consider this class to be in alpha. I have not, yet, written support for multiple author addresses simply because I didn't need it for the current manuscript. Feel free to improve the class. Just let me know about it, so I can use it as well. To use this class in a LaTeX document, you can either install the class file into your LaTeX distribution or simply place the file in the same folder as your LaTeX document. Bibliography formatting using BibTex requires the natbib package be installed and the ecology.bst bibliography format file. I have packaged the ecology.cls class along with an example latex file, the proof pdf, an example figure, and an example bibligraphy into a tarball available below. Download: ecologyLatexClass.tar.gz

Add a column to a text file from a raster

2009-06-01T15:40:00.006-04:00

Text files are great

I have mixed feelings about software interoperability. Sure GRASS interfaces with R, and R can interface with Excel and Access if you have the ODBC drivers. But, often users do not have the energy to learn the best techniques for moving data back and forth between programs. Further, there always seems to be some quirk with data that prevents the transfer from going as smoothly as the user's manual examples do. Then, I remember the lowly text file. Whether you use comma separated values (csv) files or tab-delimited files, every software platform on the planet can read text files of records (lines) and fields (values on a line separated by commas,tabs or some other character). If you're trying to send data to someone else, you can be confident that your recipient can read them. It may not be the flashiest or fastest way to move data between people and systems, but it always works. Finally, when you use plain text files for data, you have many more options for working with the files. Tools such as bash, awk, sed, perl, and any other programming language you can think of will can be brought to bear on data analysis problems.

Attaching raster data to points in GIS

Switch gears for a moment from a common file format to a common analysis task. For researchers who work with spatial point patterns (like plots), a regular step in data analysis is to pull data from raster layers where they intersect the point data and add it as columns to the point data. Informally, these are known as "drill-down" operations. You have a stack of maps and you want to drill down through them vertically at point locations, grab the data, and add it to the points. This is an a relatively simple task in ArcGIS if you use Hawth's tools. It's also pretty easy in GRASS, though it involves a couple of steps. The function v.what.rast will put raster values into a column of a point dataset. The column must already exist in the dataset, which can be done through the command v.db.addcol and v.what.rast function just fills in the missing values with values where the raster intersects the points.

Adding raster values to text files

Now, back to text files. I recently worked with a graduate student who wanted elevation values from a lidar-derived digital elevation model (dem) on 4 samples totalling about 1100 plots. She is not a GRASS user, and so she sent me the locations of all the plots along with an identifier in csv format. I wanted to send her back the same csv files with elevation simply added as a field. I wrote a bash script to accomplish this quickly and repeatably for her files. In this script, all the GRASS functions are hidden from the user. Users simply input a csv file and the raster column you wish to add, and a csv file is output with that column added. The script requires the awk script that I posted previously, which recognizes the types of varibles present in the input csv. I have included a portion of the full script below. GRASS has a very particaular code style that it uses for running scripts natively (See g.parser for details). This makes the script very long. So, I've cut out the not-so-important parts. If you would like the full file you can get it here.

v.ascii.addrastcol

#!/bin/sh
##################################################################
#
# MODULE:       v.ascii.addrastcol
# AUTHOR(S):    R. Todd Jobe 
# PURPOSE:      Add a raster value field to a csv of points
# COPYRIGHT:    (C) 2009 by R. Todd Jobe
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
##################################################################
# . . . Grass header information removed
# . . . The cut code creates variables GIS_OPT_* and GIS_FLAG_*
# . . . Get the full code at:
# . . . http://www.unc.edu/~toddjobe/src/v.ascii.addrastcol
##################################################################
# Need space-delimited arguments for looping
GIS_OPT_INPUT=${GIS_OPT_INPUT//,/ }

# test for input raster map
eval `g.findfile element=cell file="$GIS_OPT_RAST"`
if [ ! "$file" ] ; then
   g.message -e "Raster map <$GIS_OPT_RAST> not found in mapset search path"
   exit 1
fi

for i in $GIS_OPT_INPUT; do
   # Check that input files exist
   if [ ! "$i" ] ; then
 echo "file $i does not exist" >&2
 exit $E_NOTFOUND
   fi

   # Create header line if necessary and get variable types
   if [ $GIS_FLAG_H -eq 1 ]; then
 skip=1
   else
 skip=0
   fi

   # Intermediate variables
   map=`basename $i .csv`
   tmp=`tempfile -s ".csv"`
   of="${map}${GIS_OPT_SUFFIX}"
   col=`typeVars.awk -v h=$GIS_FLAG_H -F, $i`

   # Get the raster type
   if `r.info -t $GIS_OPT_RAST | grep -q "FCELL"`; then
 rtype="double precision"
   else
 rtype="int"
   fi
  
   # Grass-specific calls
   v.in.ascii --overwrite input="$i" output="$map" fs="$GIS_OPT_FS" \
 skip=$skip column="$col" x="$GIS_OPT_X" y="$GIS_OPT_Y"
   v.db.addcol map="$map" \
 columns="$GIS_OPT_RAST $rtype"
   v.what.rast vector="$map" column="$GIS_OPT_RAST" \
 raster="$GIS_OPT_RAST"
   v.out.ascii input="$map" fs="," \
 columns="$GIS_OPT_RAST" | \
   awk -F, -v meters="$GIS_FLAG_M" -v head="$GIS_FLAG_H" \
 -v rast="$GIS_OPT_RAST" '
   BEGIN{ if(head == 1) print rast;}
   {
        # Convert Ft. elevation to meters if necessary
        if($4!="" && meters==1) $4=$4 * 0.30480060
        print $4
   }' >"$tmp"
   # Join raster values to input and write to output
   paste -d$GIS_OPT_FS "$i" "$tmp" >"$of"

   # Cleanup
   rm "$tmp"
   if [ $GIS_FLAG_R -eq 1 ] ; then
 g.remove -f --quiet vect=$map
   fi
done

To see the useage for the full script, you can simply start GRASS and from the command line type v.ascii.addrastcol --help. Here's a short example for a random set of 25 points on Ft. Bragg North Carolina.

GRASS 6.4.0RC3 (ftBragg):~ > head locs.csv
"","V1","V2","V3"
"1",594927.510605283,153893.107849400,1
"2",590884.780751329,155855.275043186,2
"3",594095.580107841,155284.234116311,3
"4",605646.865006463,154397.274762958,4
"5",603002.815879479,153284.801461238,5
"6",603453.085431226,152633.248500899,6
"7",602981.046299442,152156.288326646,7
"8",604160.50828444,151921.617483228,8
"9",603752.189821169,152062.642227756,9
GRASS 6.4.0RC3 (ftBragg):~ > v.ascii.addrastcol --quiet -hr \
input=locs.csv rast=iLandcover x=2 y=3 --quiet -r
GRASS 6.4.0RC3 (ftBragg):~ > head locs_isect.csv
"","V1","V2","V3",iLandcover
"1",594927.510605283,153893.107849400,1,3
"2",590884.780751329,155855.275043186,2,3
"3",594095.580107841,155284.234116311,3,1
"4",605646.865006463,154397.274762958,4,3
"5",603002.815879479,153284.801461238,5,3
"6",603453.085431226,152633.248500899,6,3
"7",602981.046299442,152156.288326646,7,3
"8",604160.50828444,151921.617483228,8,1
"9",603752.189821169,152062.642227756,9,3

Extract variable types from a text file using awk

2009-05-25T14:05:00.012-04:00

The most difficult tasks in data analysis are not, from my experience, the statistical or analytic tasks. Instead, the most time consuming and frustrating parts of analysis are simply extracting and transforming raw data into a format that can be analyzed by common software. A common task in this vein is to determine what type of variables are contained in a text file containing fields and records (e.g. a comma separated values file or .csv). Most commonly, these data types are one of either strings, integers or decimal numbers called doubles. Some software, such as R does a good job at automatically recognizing what kinds of data are contained in a text file through commands like read.table. Other software, such as the GIS software GRASS, may require you to explicitly input data types of each column of data. I find that to be time consuming, especially when you may have dozens of columns or dozens of text files to work through. I have written the following script in awk (technically, gawk) to automatically determine what type of data is contained in a columnar text file. Awk is a programming language specifically designed to parse and transform these text files. Awk scripts can often developed to be included as short snippets of larger bash scripts. Such is the case here. It could probably be done more easily in Perl, but I was already working on a bash script that this will go into, which I will post later. The task accomplished by the script is a simple one. Create a comma separated string that lists the variable names and data types for each column in a text file. This string will ultimately be used as input to the command v.in.ascii in GRASS, which creates a point vector dataset from a text file.

typeVars.awk

#!/bin/awk -f
##
## file: typeVars.awk
## created by: R. Todd Jobe <toddjobe@unc.edu>
## date: 2009.05.25
## This script will print the variable names and types from a 
## columnar text file.
## Specifically, it can generate a variable declaration for GRASS 
## input.
{ 
    # Set the types to default values if not defined
    if(str=="") str="var"
    if(it=="") it="integer"
    if(dbl=="") dbl="double precision"

    # Get the column name from the 1st row
    if(NR==1){
 for(i=1; i <= NF; i++){
     type[i,1]=$(i)
 }
    }else if(NR==2){

        # Get the column type from the 2nd row
 j=0
 for(i=1; i <= NF; i++){
     if($i ~ /[^-.0-9]/){
  type[i,2]=str
  strf[++j]=i
  if(str=="var"){
      max[i]=length($i)
      type[i,2]=sprintf("%s(%.0f)",str,max[i])
  }
     }else if($i ~ /\./){
  type[i,2]=dbl
     }else{
  type[i,2]=it
     }
 }
    }else{
        # Get the maximum column width for strings
 if(str=="var"){
     for( k in strf ){
  if(length($(k)) > max[k]) {
      max[k] = length ($k)
      type[strf[k],2]=sprintf("%s(%.0f)",str,max[k])
  }
     }
 }
    }
}
END{
    ORS=""
    out=sprintf("%s %s",type[1,1],type[1,2])
    for(l=2;l<i;l++){
 out=sprintf("%s, %s %s",out,type[l,1],type[l,2])
    }
    print out
}

An Example

Here's an example of using the typeVars.awk script. We begin with a csv file of different variable types. typeVars.awk is executed on this csv and the output is a single string of comma separated column names with variable types. By default the output is formatted for use in GRASS and strings are labelled with their maximum length.

bash-3.2$ cat trial.csv
astring,aint,adbl,plot,x,y
abc123,1234,1.23,zyx1234,54,1.45
abc1235,1234,2.587,z*x1234,360,1.45
abc12,12345,1.23,zy.1234,1,1.45
abc1,999999,1.23,zx1234,0,1.45
bash-3.2$ echo `typeVars.awk -F, trial.csv`
astring var(7), aint integer, adbl double precision, plot var(6), x integer, y double precision

Notice -F parameter is specified in the call to typeVars.awk. Whitespace is the default field separator in awk, so in order to parse a csv this needs to be set to a ",".

Time for a blog

2009-05-14T11:47:00.003-04:00

I've started this blog mainly as a place to post the various things I do with computers at my job. I am a postdoctoral associate in the Geography Department at the University of North Carolina at Chapel Hill. Though I work in a Geography department right now, I am really an ecologist. Specifically, I am a quantitative community ecologist. I think about how and why species occur together. The "quantitative" part of that description just means that I typically answer ecological questions with the tools of mathematics, computer simulation modelling, and statistics.

Anyway, I think a lot about the best ways to analyze and present ecological data, and people constantly ask me questions about it. I hope this blog can be a "pre-emptive strike" for answering these questions. I plan to post lots of source code and comment on my own particular style of doing things. Take what you like. Leave the rest.

Computational Ecology

Introduction to GDAL

Converting R contingency tables to data frames

Changing the font in Carbon Emacs

Installing a Font on Mac OS X

Figure out the long name of your font face in Carbon Emacs

Set the default font

Customization

.emacs.d file

Monospaced fonts for the uninitiated

Extracting Raster Values from Points in R and GRASS

Extract Values to Points in R

Data required

The Code

Extracting Values in GRASS

Data Required

The Code

Power Analysis for mixed-effect models in R

1 The Study

2 We don't learn about power analysis and complex models

3 Analysis Procedure

Ecology Journals LaTeX Class

Add a column to a text file from a raster

Text files are great

Attaching raster data to points in GIS

Adding raster values to text files

v.ascii.addrastcol

Extract variable types from a text file using awk

typeVars.awk

An Example

Time for a blog

`.emacs.d` file