2010-08-31

Introduction to GDAL

The geospatial data abstraction library (GDAL, www.gdal.org) is an open source library for translating geospatial data between different formats. It is the primary intermediary by which all open-source data analysis and GIS software is able to interact with ArcGIS datasets. For example, one can translate ArcGIS binary grids into GRASS rasters or import them as an SpatialGridDataFrame object in R. GDAL proper supports raster data types, but it has been effectively merged with another library, OGR, that supports vector data conversion.

There are some file formats that are not directly translatable by GDAL, notably ESRI proprietary Smart Data Compression (SDC) files. The GDAL website provides a list of supported vector formats and raster formats. ESRI binary grids, coverages, and personal geodatabases can be read but not written.

Since GDAL is a library, it is meant for developers who write code that reference the library, rather than for end users who actually translate datasets. The library API is in C, but there are bindings for R, Perl, Python, VB6, Ruby, Java and C#. The GDAL package does come with a few commandline utilities for working with geospatial data: GDAL Utilites. Among those tools that I commonly use

gdalinfo
Raster information
gdaltranslate.py
Convert rasters between formats
gdalmerge.py
Merge Tiled rasters into one
gdal2tiles.py
Makes a Google Maps- or Google Earth-compatible set of rasters

Binary distributions of GDAL are available for most platforms are available here, and can be installed without using the command line.

In the next post, we'll look at the R bindings for GDAL in the RGDAL package built by Tim Keitt at the University of Texas, and I will show some examples of working with raster datasets in R.

Labels:

2010-08-11

Converting R contingency tables to data frames

A contingency table presents the joint density of one or more categorical variables. Each entry in a contingency table is a count of the number of times a particular set of factors levels occurs in the dataset. For example, consider a list of plant species where each species is assigned a relative seed size (small, medium, or large) and a growth form (tree, shrub, or herb).

seed.sizes <- c("small", "medium", "large")
growth.forms <- c("tree", "shrub", "herb")
species.traits <- data.frame(
  seed.size = seed.sizes[c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3)],
  growth.form = growth.forms[c(3, 3, 2, 2, 1, 2, 2, 3, 1, 1, 1, 1)]
)
seed.sizegrowth.form
smallherb
smallherb
smallshrub
smallshrub
smalltree
mediumshrub
mediumshrub
mediumherb
mediumtree
largetree
largetree
largetree

A contingency table will tell us how many times each combination of seeds.sizes and growth.forms occur.

tbl <- table(species.traits)
herbshrubtree
003
121
221

The output contingency table are of class table. The behaviour of these objects is not quite like a data frame. In fact, trying to convert them to a data frame gives a non-intuitive result.

as.data.frame(tbl)
seed.sizegrowth.formFreq
largeherb0
mediumherb1
smallherb2
largeshrub0
mediumshrub2
smallshrub2
largetree3
mediumtree1
smalltree1

Coercion of the table into a data frame puts each factor of the contingency table into its own column along with the frequency, rather than keeping the same structure as original table object. If we wanted to turn the table into a data frame keeping the original structure we use as.data.frame.matrix. This function is not well-documented in R, and this is probably the only situation in which it would be used. But, it works.

as.data.frame.matrix(tbl)
herbshrubtree
003
121
221

Labels: