tag:blogger.com,1999:blog-85579180629884745952024-03-18T05:47:40.777-04:00Computational EcologySee Todd Jobe's thoughts about computers and ecology. R code. Arc-GIS and GRASS code. Bash scripts. Matlab code.toddjobehttp://www.blogger.com/profile/10065244958204541807noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-8557918062988474595.post-24590290614974429042010-08-31T05:22:00.001-04:002010-08-31T05:22:52.995-04:00Introduction to GDAL
<p>The geospatial data abstraction library (GDAL, <a href="http://www.gdal.org">www.gdal.org</a>) is an
open source library for translating geospatial data between
different formats. It is the primary intermediary by which all
open-source data analysis and GIS software is able to interact with
ArcGIS datasets. For example, one can translate ArcGIS binary grids
into GRASS rasters or import them as an <a href="http://www.dpi.inpe.br/gilberto/tutorials/software/R-contrib/sp/html/SpatialGridDataFrame-class.html">SpatialGridDataFrame</a> object
in R. GDAL proper supports raster data types, but it has
been effectively merged with another library, OGR, that supports
vector data conversion.
</p>
<p>
There are some file formats that are not directly translatable by
GDAL, notably ESRI proprietary Smart Data Compression (SDC) files.
The GDAL <a href="http://www.gdal.org">website</a> provides a list of supported <a href="http://www.gdal.org/ogr/ogr_formats.html">vector formats</a> and
<a href="http://www.gdal.org/formats_list.html">raster formats</a>. ESRI binary grids, coverages, and personal geodatabases
can be read but not written.
</p>
<p>
Since GDAL is a library, it is meant for developers who write code
that reference the library, rather than for end users who actually
translate datasets. The library API is in C, but there are bindings
for R, Perl, Python, VB6, Ruby, Java and C#. The GDAL package does
come with a few commandline utilities for working with geospatial
data: <a href="http://www.gdal.org/gdal_utilities.html">GDAL Utilites</a>. Among those tools that I commonly use
</p><dl>
<dt>gdalinfo</dt><dd>
Raster information
</dd>
<dt>gdal<sub>translate</sub>.py</dt><dd>
Convert rasters between formats
</dd>
<dt>gdal<sub>merge</sub>.py</dt><dd>
Merge Tiled rasters into one
</dd>
<dt>gdal2tiles.py</dt><dd>
Makes a Google Maps- or Google Earth-compatible
set of rasters
</dd>
</dl>
<p>Binary distributions of GDAL are available for most platforms are
available <a href="http://trac.osgeo.org/gdal/wiki/DownloadingGdalBinaries">here</a>, and can be installed without using the command line.
</p>
<p>
In the next post, we'll look at the R bindings for GDAL in the RGDAL
package built by Tim Keitt at the University of Texas, and I will
show some examples of working with raster datasets in R.
</p>toddjobehttp://www.blogger.com/profile/10065244958204541807noreply@blogger.com219tag:blogger.com,1999:blog-8557918062988474595.post-81018552688714575472010-08-11T09:41:00.001-04:002010-08-11T09:41:54.459-04:00Converting R contingency tables to data frames
<p>A contingency table presents the joint density of one or more
categorical variables. Each entry in a contingency table is a count
of the number of times a particular set of factors levels occurs in
the dataset. For example, consider a list of plant species where
each species is assigned a relative seed size (small, medium, or
large) and a growth form (tree, shrub, or herb).
</p>
<pre class="src src-R">seed.sizes <span style="color: #5f9ea0;"><-</span> c(<span style="color: #bc8f8f;">"small"</span>, <span style="color: #bc8f8f;">"medium"</span>, <span style="color: #bc8f8f;">"large"</span>)
growth.forms <span style="color: #5f9ea0;"><-</span> c(<span style="color: #bc8f8f;">"tree"</span>, <span style="color: #bc8f8f;">"shrub"</span>, <span style="color: #bc8f8f;">"herb"</span>)
species.traits <span style="color: #5f9ea0;"><-</span> data.frame(
seed.size = seed.sizes[c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3)],
growth.form = growth.forms[c(3, 3, 2, 2, 1, 2, 2, 3, 1, 1, 1, 1)]
)
</pre>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption></caption>
<colgroup><col align="left" /><col align="left" />
</colgroup>
<thead>
<tr><th scope="col">seed.size</th><th scope="col">growth.form</th></tr>
</thead>
<tbody>
<tr><td>small</td><td>herb</td></tr>
<tr><td>small</td><td>herb</td></tr>
<tr><td>small</td><td>shrub</td></tr>
<tr><td>small</td><td>shrub</td></tr>
<tr><td>small</td><td>tree</td></tr>
<tr><td>medium</td><td>shrub</td></tr>
<tr><td>medium</td><td>shrub</td></tr>
<tr><td>medium</td><td>herb</td></tr>
<tr><td>medium</td><td>tree</td></tr>
<tr><td>large</td><td>tree</td></tr>
<tr><td>large</td><td>tree</td></tr>
<tr><td>large</td><td>tree</td></tr>
</tbody>
</table>
<p>
A contingency table will tell us how many times each combination of
seeds.sizes and growth.forms occur.
</p>
<pre class="src src-R">tbl <span style="color: #5f9ea0;"><-</span> table(species.traits)
</pre>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption></caption>
<colgroup><col align="right" /><col align="right" /><col align="right" />
</colgroup>
<thead>
<tr><th scope="col">herb</th><th scope="col">shrub</th><th scope="col">tree</th></tr>
</thead>
<tbody>
<tr><td>0</td><td>0</td><td>3</td></tr>
<tr><td>1</td><td>2</td><td>1</td></tr>
<tr><td>2</td><td>2</td><td>1</td></tr>
</tbody>
</table>
<p>
The output contingency table are of class <code>table</code>. The behaviour of
these objects is not quite like a data frame. In fact, trying to
convert them to a data frame gives a non-intuitive result.
</p>
<pre class="src src-R">as.data.frame(tbl)
</pre>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption></caption>
<colgroup><col align="left" /><col align="left" /><col align="right" />
</colgroup>
<thead>
<tr><th scope="col">seed.size</th><th scope="col">growth.form</th><th scope="col">Freq</th></tr>
</thead>
<tbody>
<tr><td>large</td><td>herb</td><td>0</td></tr>
<tr><td>medium</td><td>herb</td><td>1</td></tr>
<tr><td>small</td><td>herb</td><td>2</td></tr>
<tr><td>large</td><td>shrub</td><td>0</td></tr>
<tr><td>medium</td><td>shrub</td><td>2</td></tr>
<tr><td>small</td><td>shrub</td><td>2</td></tr>
<tr><td>large</td><td>tree</td><td>3</td></tr>
<tr><td>medium</td><td>tree</td><td>1</td></tr>
<tr><td>small</td><td>tree</td><td>1</td></tr>
</tbody>
</table>
<p>
Coercion of the table into a data frame puts each factor of the
contingency table into its own column along with the frequency,
rather than keeping the same structure as original <code>table</code> object.
If we wanted to turn the table into a data frame keeping the
original structure we use <code>as.data.frame.matrix</code>. This function is
not well-documented in R, and this is probably the only situation in
which it would be used. But, it works.
</p>
<pre class="src src-R">as.data.frame.matrix(tbl)
</pre>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption></caption>
<colgroup><col align="right" /><col align="right" /><col align="right" />
</colgroup>
<thead>
<tr><th scope="col">herb</th><th scope="col">shrub</th><th scope="col">tree</th></tr>
</thead>
<tbody>
<tr><td>0</td><td>0</td><td>3</td></tr>
<tr><td>1</td><td>2</td><td>1</td></tr>
<tr><td>2</td><td>2</td><td>1</td></tr>
</tbody>
</table>
toddjobehttp://www.blogger.com/profile/10065244958204541807noreply@blogger.com251tag:blogger.com,1999:blog-8557918062988474595.post-14528482255978980422010-07-29T06:02:00.001-04:002010-07-29T06:02:53.173-04:00Changing the font in Carbon Emacs
<p>My current monospaced font of choice is <a href="http://dejavu-fonts.org/wiki/index.php%3Ftitle=Main_Page">Deja Vu Sans Mono</a>. It is
pretty easy to change the default font in Carbon Emacs. It is not
so easy to change the default font on Windows- or Linux-based
Emacs22, which is the current release version. It will be
relatively straightforward in Emacs23 via the <code>xft</code> package. But,
for your Mac folks out there…
</p>
<div id="outline-container-1" class="outline-3">
<h3 id="sec-1">Installing a Font on Mac OS X </h3>
<div class="outline-text-3" id="text-1">
<ol>
<li>
Download the font (Mac will accept almost any format).
</li>
<li>
Drag the font file into <code>/System/Fonts</code>
</li>
</ol>
</div>
</div>
<div id="outline-container-2" class="outline-3">
<h3 id="sec-2">Figure out the long name of your font face in Carbon Emacs </h3>
<div class="outline-text-3" id="text-2">
<ul>
<li>
M-x <code>mac-font-panel-mode</code>. Then pick your font
</li>
<li>
M-x <code>describe-font</code> and copy the long name.
</li>
</ul>
</div>
</div>
<div id="outline-container-3" class="outline-3">
<h3 id="sec-3">Set the default font </h3>
<div class="outline-text-3" id="text-3">
<p>There are a couple of ways to do this.
</p>
</div>
<div id="outline-container-3_1" class="outline-4">
<h4 id="sec-3_1">Customization </h4>
<div class="outline-text-4" id="text-3_1">
<ul>
<li>
M-x <code>customize-face</code> RET <code>default</code> RET
</li>
<li>
Fill in the fields of the font with the long name.
</li>
</ul>
</div>
</div>
<div id="outline-container-3_2" class="outline-4">
<h4 id="sec-3_2"><code>.emacs.d</code> file </h4>
<div class="outline-text-4" id="text-3_2">
<pre class="src src-elisp"><span style="color: #8c8c8c;">(</span>set-default-font <span style="color: #bc8f8f;">"-apple-inconsolata-medium-r-normal--11-110-72-72-m-110-iso10646-1"</span><span style="color: #8c8c8c;">)</span>
</pre>
</div>
</div>
</div>
<div id="outline-container-4" class="outline-3">
<h3 id="sec-4">Monospaced fonts for the uninitiated </h3>
<div class="outline-text-3" id="text-4">
<p>Programmers tend to use different fonts for code than what we see
in books an on the web. Word processors typically use proportional
fonts, where the horizontal space allowed for narrow letters like
"l" is less than the space allowed for wide letters like "W". In
code, we want the columns of each line to be aligned. So,
programmers tend to use monospaced fonts, where each letter takes
up the same amount of horizontal space. The most commonly known
fixed width-font is Courier. Another common fixed-width font for
Windows and Microsoft products in Consolas. <a href="http://www.codeproject.com/KB/work/FontSurvey.aspx">The Code Project</a> has
an exhaustive list of monospaced fonts.
</p>
<p>
One of the key attributes to look for in a monospaced font is the
distinction between the number one and the lowercase letter "l" and
the number zero and the uppercase "O". When you are trying to read
variable names that have mixtures of numbers and letters, you will be
thankful for a font that distinguishes between these.
</p></div>
</div>
toddjobehttp://www.blogger.com/profile/10065244958204541807noreply@blogger.com23tag:blogger.com,1999:blog-8557918062988474595.post-83464995495206831902010-07-26T16:56:00.001-04:002010-07-26T16:56:54.879-04:00Extracting Raster Values from Points in R and GRASS
<p>A common task in GIS analysis is to extract the value of a remotely
sensed environmental variable at a point location. For instance we
may wish to extract the elevation of a field plot from a digital
elevation model. The elevation data is a raster (i.e. grid) and the
plots are a point shapefile (or a simple text file of X, Y
locations). The command for doing this in ArcGIS is
<code>ExtractValuesToPoints</code> available in the Spatial Analyst package.
Situations may arise where ArcGIS is not the most efficient way of
extracting these values. So, here, I provide a brief overview of how to
extract raster values to points in R and GRASS.
</p>
<div id="outline-container-1" class="outline-3">
<h3 id="sec-1">Extract Values to Points in R </h3>
<div class="outline-text-3" id="text-1">
<p>
This is strikingly easy is R. My work usually requires more
statistical sophistication than is available in ArcGIS. As a
result, I have completely switched to doing the extraction in R. I
known I am going to end in R eventually, and it is easier to automate
than writing a long python script in ArcGIS.
</p>
</div>
<div id="outline-container-1_1" class="outline-4">
<h4 id="sec-1_1">Data required </h4>
<div class="outline-text-4" id="text-1_1">
<p>
For the purpose of this exercise. All the data must be have the
same spatial projection.
</p>
<dl>
<dt><a href="data/gr.asc"><code>gr.asc</code></a></dt><dd>
an ESRI ASCII grid. This could also be an ArcGIS
binary grid if you know how to use RGDAL. That perhaps will be
another post.
</dd>
<dt><a href="data/pt.zip"><code>pt.shp</code></a></dt><dd>
a point shapefile.
</dd>
</dl>
<p>You also need the <code>maptools</code> and <code>sp</code> packages.
</p>
</div>
</div>
<div id="outline-container-1_2" class="outline-4">
<h4 id="sec-1_2">The Code </h4>
<div class="outline-text-4" id="text-1_2">
<p><textarea cols="66" rows="4">library(maptools) # autoloads sp
gr <- readAsciiGrid("www.unc.edu/~toddjobe/data/gr.asc")
pt <- readShapePoints("www.unc.edu/~toddjobe/data/pt.shp")
overlay(gr, pt)
</textarea>
</p>
<p><textarea cols="66" rows="11"> coordinates gr.asc
1497 (569292, 1224170) 6094.6080
539 (567718, 1227840) 7964.5331
1023 (564565, 1225810) -293.6599
663 (562462, 1227260) -5351.7297
69 (563675, 1229710) -2923.9394
716 (563062, 1227180) -4241.8255
339 (567636, 1228780) 4781.6488
509 (561722, 1227870) -3958.7312
805 (560981, 1226690) 139.9091
155 (560884, 1229220) -2719.7353
</textarea>
</p>
<p>
That is it. Fast, and easy.
</p>
</div>
</div>
</div>
<div id="outline-container-2" class="outline-3">
<h3 id="sec-2">Extracting Values in GRASS </h3>
<div class="outline-text-3" id="text-2">
<p>
Extracting raster values in GRASS is somewhat faster than in R, but
it takes a little bit more planning in that you have to explicitly
create the column that the raster values will go into.
</p>
</div>
<div id="outline-container-2_1" class="outline-4">
<h4 id="sec-2_1">Data Required </h4>
<div class="outline-text-4" id="text-2_1">
<ul>
<li>
<code>gr</code> : A GRASS grid
</li>
<li>
<code>pt</code> : A GRASS point dataset
</li>
</ul>
</div>
</div>
<div id="outline-container-2_2" class="outline-4">
<h4 id="sec-2_2">The Code </h4>
<div class="outline-text-4" id="text-2_2">
<p>The basic flow of this is that you create an empty column in the
point dataset with the right data type (i.e. <code>varchar(10)</code> string
of length 10, <code>double precision</code> floating point numbers, <code>int</code>
integers). Then, fill the column with the raster values.
</p>
<p><textarea cols="66" rows="2">v.db.addcol map=pt columns="grval double precision"
v.what.rast vector=pt raster=gr column=grval
</textarea>
</p>
</div>
</div>
</div>
toddjobehttp://www.blogger.com/profile/10065244958204541807noreply@blogger.com88tag:blogger.com,1999:blog-8557918062988474595.post-43601652874955312202009-09-18T14:58:00.006-04:002009-09-18T15:39:23.826-04:00Power Analysis for mixed-effect models in R<p>The power of a statistical test is the probability that a null
hypothesis will be rejected when the alternative hypothesis is true.
In lay terms, power is your ability to refine or "prove" your
expectations from the data you collect. The most frequent motivation
for estimating the power of a study is to figure out what sample size
will be needed to observe a treatment effect. Given a set of pilot
data or some other estimate of the variation in a sample, we can use
power analysis to inform how much additional data we should collect.
</p>
<p>
I recently did a power analysis on a set of pilot data for a long-term
monitoring study of the US National Park Service. I thought I would
share some of the things I learned and a bit of R code for others that
might need to do something like this. If you aren't into power
analysis, the code below may still be useful as examples of how to use
the error handling functions in R (<code>withCallingHandlers</code>,
<code>withRestarts</code>), parallel programming using the <code>snow</code>
package, and linear mixed effect regression using <code>nlme</code>. If you
have any suggestions for improvement or if I got something wrong on
the analysis, I'd love to hear from you.
</p><div id="outline-container-1" class="outline-2">
<h2 id="sec-1"><span class="section-number-2">1</span> The Study </h2>
<div class="outline-text-2" id="text-1">
<p>The study system was cobblebars along the Cumberland river in Big
South Fork National Park (Kentucky and Tennessee, United States).
Cobblebars are typically dominated by grassy vegetation that include
disjunct tall-grass prairie species. It is hypothesized that woody
species will encroach onto cobblebars if they are not seasonally
scoured by floods. The purpose of the NPS sampling was to observe
changes in woody cover through time. The study design consisted of
two-stages of clustering: the first being cobblebars, and the second
being transects within cobblebars. The response variable was the
percentage of the transect that was woody vegetation. Because of
the clustered design, the inferential model for this study design
has mixed-effects. I used a simple varying intercept
model:
</p>
<p>
<img src="ltxpng/powerAnalysis_0001.png"/>
</p>
<p>
where <i>y</i> is the percent of each transect in woody vegetation sampled
<i>n</i> times within <i>J</i> cobblebars, each with <i>K</i> transects. The
parameter of inference for the purpose of monitoring change in woody
vegetation through time is <i>β</i>, the rate at which cover changes as
a function of time. <i>α</i>, <i>γ</i>, <i>σ<sup>2</sup><sub>γ</sub></i>, and
<i>σ<sup>2</sup><sub>y</sub></i> are hyper-parameters that describe the hierarchical
variance structure inherent in the clustered sampling design.
</p>
<p>
Below is the function code used I used to regress the pilot data. It
should be noted that with this function you can log or logit transform
the response variable (percentage of transect that is woody). I put
this in because the responses are proportions (0,1) and errors should
technically follow a beta-distribution. Log and logit transforms with
Gaussian errors could approximate this. I ran all the models with
transformed and untransformed response, and the results did not vary
at all. So, I stuck with untransformed responses:
</p>
<pre class="src src-R"><span style="color: #0000ff;">Model</span> <span style="color: #5f9ea0;"><-</span> <span style="color: #a020f0;">function</span>(x = cobblebars,
type = c(<span style="color: #bc8f8f;">"normal"</span>,<span style="color: #bc8f8f;">"log"</span>,<span style="color: #bc8f8f;">"logit"</span>)){
<span style="color: #b22222;">## </span><span style="color: #b22222;">Transforms
</span> <span style="color: #a020f0;">if</span> (type[1] == <span style="color: #bc8f8f;">"log"</span>)
x$prop.woody <span style="color: #5f9ea0;"><-</span> log(x$prop.woody)
<span style="color: #a020f0;">else</span> <span style="color: #a020f0;">if</span> (type[1] == <span style="color: #bc8f8f;">"logit"</span>)
x$prop.woody <span style="color: #5f9ea0;"><-</span> log(x$prop.woody / (1 - x$prop.woody))
mod <span style="color: #5f9ea0;"><-</span> lme(prop.woody ~ year,
data = x,
random = ~ 1 | cobblebar/transect,
na.action = na.omit,
control = lmeControl(opt = <span style="color: #bc8f8f;">"optim"</span>,
maxIter = 800, msMaxIter = 800)
)
mod$type <span style="color: #5f9ea0;"><-</span> type[1]
<span style="color: #a020f0;">return</span>(mod)
}
</pre>
<p>
Here are the results from this regression of the pilot data:
</p>
<pre class="src src-R">Linear mixed-effects model fit by REML
Data: x
AIC BIC logLik
-134.4319 -124.1297 72.21595
Random effects:
Formula: ~1 | cobblebar
(Intercept)
StdDev: 0.03668416
Formula: ~1 | transect %<span style="color: #a020f0;">in</span>% cobblebar
(Intercept) Residual
StdDev: 0.02625062 0.05663784
Fixed effects: prop.woody ~ year
Value Std.Error DF t-value p-value
(Intercept) 0.12966667 0.01881983 29 6.889896 0.0000
year -0.00704598 0.01462383 29 -0.481815 0.6336
Correlation:
(Intr)
year -0.389
Number of Observations: 60
Number of Groups:
cobblebar transect %<span style="color: #a020f0;">in</span>% cobblebar
6 30
</pre>
</div>
</div>
<div id="outline-container-2" class="outline-2">
<h2 id="sec-2"><span class="section-number-2">2</span> We don't learn about power analysis and complex models </h2>
<div class="outline-text-2" id="text-2">
<p>When I decided upon the inferential model the first thing that
occurred to me was that I never learned in any statistics course I
had taken how to do such a power analysis on a multi-level model.
I've taken more statistics courses than I'd like to count and taught
my own statistics courses for undergrads and graduate students, and
the only exposure to power analysis that I had was in the context of
simple t-tests or ANOVA. You learn about it in your first 2
statistics courses, then it rarely if ever comes up again until you
actually need it.
</p>
<p>
I was, however, able to find a great resource on power analysis from
a Bayesian perspective in the excellent book "Data Analysis Using
Regression and Multilevel/Hierarchical Models" by Andrew Gelman and
Jennifer Hill. Andrew Gelman has thought and debated about power
analysis and you can get more from his <a href="http://www.stat.columbia.edu/~gelman/blog/">blog</a>. The approach in the
book is a simulation-based one and I have adopted it for this
analysis.
</p>
</div>
</div>
<div id="outline-container-3" class="outline-2">
<h2 id="sec-3"><span class="section-number-2">3</span> Analysis Procedure </h2>
<div class="outline-text-2" id="text-3">
<p>For the current analysis we needed to know three things: effect
size, sample size, and estimates of population variance. We set
effect size beforehand. In this context, the parameter of interest
is the rate of change in woody cover through time <i>β</i>, and
effect size is simply how large or small a value of <i>β</i> you want
to distinguish with a regression. Sample size is also set <i>a priori</i>. In the analysis we want to vary sample size by varying the
number of cobblebars, the number of transects per cobblebar or the
number of years the study is conducted.
</p>
<p>
The population variance cannot be known precisely, and this is where
the pilot data come in. By regressing the pilot data using the
model we can obtain estimates of all the different components of the
variance (cobblebars, transects within cobblebars, and the residual
variance). Below is the R function that will return all the
hyperparameters (and <i>β</i>) from the regression:
</p>
<pre class="src src-R"><span style="color: #0000ff;">GetHyperparam</span><span style="color: #5f9ea0;"><-</span><span style="color: #a020f0;">function</span>(x,b=<span style="color: #228b22;">NULL</span>){
<span style="color: #b22222;">## </span><span style="color: #b22222;">Get the hyperparameters from the mixed effect model
</span> fe <span style="color: #5f9ea0;"><-</span> fixef(x)
<span style="color: #a020f0;">if</span>(is.null(b))
b<span style="color: #5f9ea0;"><-</span>fe[2] <span style="color: #b22222;"># </span><span style="color: #b22222;">use the data effect size if not supplied
</span>
mu.a <span style="color: #5f9ea0;"><-</span> fe[1]
vc <span style="color: #5f9ea0;"><-</span> VarCorr(x)
sigma.y <span style="color: #5f9ea0;"><-</span> as.numeric(vc[5, 2]) <span style="color: #b22222;"># </span><span style="color: #b22222;">Residual StdDev
</span> sigma.a <span style="color: #5f9ea0;"><-</span> as.numeric(vc[2, 2]) <span style="color: #b22222;"># </span><span style="color: #b22222;">Cobblebar StdDev
</span> sigma.g <span style="color: #5f9ea0;"><-</span> as.numeric(vc[4, 2]) <span style="color: #b22222;"># </span><span style="color: #b22222;">Cobblebar:transect StdDev
</span>
hp<span style="color: #5f9ea0;"><-</span>c(b, mu.a, sigma.y, sigma.a, sigma.g)
names(hp)<span style="color: #5f9ea0;"><-</span>c(<span style="color: #bc8f8f;">"b"</span>, <span style="color: #bc8f8f;">"mu.a"</span>, <span style="color: #bc8f8f;">"sigma.y"</span>, <span style="color: #bc8f8f;">"sigma.a"</span>, <span style="color: #bc8f8f;">"sigma.g"</span>)
<span style="color: #a020f0;">return</span>(hp)
}
</pre>
<p>
To calculate power we to regress the simulated data in the same way we
did the pilot data, and check for a significant <i>β</i>. Since
optimization is done using numeric methods there is always the chance
that the optimization will not work. So, we make sure the regression
on the fake data catches and recovers from all errors. The solution
for error recovery is to simply try the regression on a new set of
fake data. This function is a pretty good example of using the R
error handling function <code>withCallingHandlers</code> and
<code>withRestarts</code>.
</p>
<pre class="src src-R"><span style="color: #0000ff;">fakeModWithRestarts</span> <span style="color: #5f9ea0;"><-</span> <span style="color: #a020f0;">function</span>(m.o, n = 100, ...){
<span style="color: #b22222;">## </span><span style="color: #b22222;">A Fake Model
</span> withCallingHandlers({
i <span style="color: #5f9ea0;"><-</span> 0
mod <span style="color: #5f9ea0;"><-</span> <span style="color: #228b22;">NULL</span>
<span style="color: #a020f0;">while</span> (i < n & is.null(mod)){
mod <span style="color: #5f9ea0;"><-</span> withRestarts({
f <span style="color: #5f9ea0;"><-</span> fake(m.orig = m.o, transform = F, ...)
<span style="color: #a020f0;">return</span>(update(m.o, data = f))
},
rs = <span style="color: #a020f0;">function</span>(){
i <span style="color: #5f9ea0;"><<-</span> i + 1
<span style="color: #a020f0;">return</span>(<span style="color: #228b22;">NULL</span>)
})
}
<span style="color: #a020f0;">if</span>(is.null(mod))
<span style="color: #a020f0;">warning</span>(<span style="color: #bc8f8f;">"ExceededIterations"</span>)
<span style="color: #a020f0;">return</span>(mod)
},
error = <span style="color: #a020f0;">function</span>(e){
invokeRestart(<span style="color: #bc8f8f;">"rs"</span>)
},
<span style="color: #a020f0;">warning</span> = <span style="color: #a020f0;">function</span>(w){
<span style="color: #a020f0;">if</span>(w$<span style="color: #a020f0;">message</span> == <span style="color: #bc8f8f;">"ExceededIterations"</span>)
cat(<span style="color: #bc8f8f;">"\n"</span>, w$<span style="color: #a020f0;">message</span>, <span style="color: #bc8f8f;">"\n"</span>)
<span style="color: #a020f0;">else</span>
invokeRestart(<span style="color: #bc8f8f;">"rs"</span>)
})
}
</pre>
<p>
To calculate the power of a particular design we run
<code>fakeModWithRestarts</code> 1000 times and look at the proportion of
significant <i>β</i> values:
</p>
<pre class="src src-R"><span style="color: #0000ff;">dt.power</span> <span style="color: #5f9ea0;"><-</span> <span style="color: #a020f0;">function</span> (m, n.sims = 1000, alpha=0.05, ...){
<span style="color: #b22222;">## </span><span style="color: #b22222;">Calculate power for a particular sampling design
</span> signif<span style="color: #5f9ea0;"><-</span>rep(<span style="color: #228b22;">NA</span>, n.sims)
<span style="color: #a020f0;">for</span>(i <span style="color: #a020f0;">in</span> 1:n.sims){
lme.power <span style="color: #5f9ea0;"><-</span> fakeModWithRestarts(m.o = m, ...)
<span style="color: #a020f0;">if</span>(!is.null(lme.power))
signif[i] <span style="color: #5f9ea0;"><-</span> summary(lme.power)$tTable[2, 5] < alpha
}
power <span style="color: #5f9ea0;"><-</span> mean(signif, na.rm = T)
<span style="color: #a020f0;">return</span>(power)
}
</pre>
<p>
Finally, we want to perform this analysis on many different sampling
designs. In my case I did all combinations of set of effect sizes,
cobblebars, transects, and years. So, I generated the appropriate designs:
</p>
<pre class="src src-R"><span style="color: #0000ff;">factoredDesign</span> <span style="color: #5f9ea0;"><-</span> <span style="color: #a020f0;">function</span>(Elevs = 0.2/c(1,5,10,20),
Nlevs = seq(2, 10, by = 2),
Jlevs = seq(4, 10, by = 2),
Klevs = c(3, 5, 7), ...){
<span style="color: #b22222;">## </span><span style="color: #b22222;">Generates factored series of sampling designs for simulation
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">of data that follow a particular model.
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">Inputs:
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">Elevs - vector of effect sizes for the slope parameter.
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">Nlevs - vector of number of years to sample.
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">Jlevs - vector of number of cobblebars to sample.
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">Klevs - vector of number of transects to sample.
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">Results:
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">Data frame with where columns are the factors and
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">rows are the designs.
</span>
<span style="color: #b22222;"># </span><span style="color: #b22222;">Level lengths
</span> lE <span style="color: #5f9ea0;"><-</span> length(Elevs)
lN <span style="color: #5f9ea0;"><-</span> length(Nlevs)
lJ <span style="color: #5f9ea0;"><-</span> length(Jlevs)
lK <span style="color: #5f9ea0;"><-</span> length(Klevs)
<span style="color: #b22222;"># </span><span style="color: #b22222;">Generate repeated vectors for each factor
</span> E <span style="color: #5f9ea0;"><-</span> rep(Elevs, each = lN*lJ*lK)
N <span style="color: #5f9ea0;"><-</span> rep(rep(Nlevs, each = lJ*lK), times = lE)
J <span style="color: #5f9ea0;"><-</span> rep(rep(Jlevs, each = lK), times = lE*lN)
K <span style="color: #5f9ea0;"><-</span> rep(Klevs, times = lE*lN*lJ)
<span style="color: #a020f0;">return</span>(data.frame(E, N, J, K))
}
</pre>
<p>
Once we know our effect sizes, the different sample sizes we want,
and the estimates of population variance we can generate simulated
dataset that are similar to the pilot data. To calculate power we
simply simulate a large number of dataset and calculate the
proportion of slopes, <i>β</i> that are significantly different from
zero (p-value < 0.05). This procedure is repeated for all the
effect sizes and sample sizes of interest. Here is the code for
generating a simulated dataset. It also does the work of doing the
inverse transform of the response variables if necessary.
</p>
<pre class="src src-R"><span style="color: #0000ff;">fake</span> <span style="color: #5f9ea0;"><-</span> <span style="color: #a020f0;">function</span>(N = 2, J = 6, K = 5, b = <span style="color: #228b22;">NULL</span>, m.orig = mod,
transform = <span style="color: #228b22;">TRUE</span>, ...){
<span style="color: #b22222;">## </span><span style="color: #b22222;">Simulated Data for power analysis
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">N = Number of years
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">J = Number of cobblebars
</span> <span style="color: #b22222;">## </span><span style="color: #b22222;">K = Number of transects within cobblebars
</span> year <span style="color: #5f9ea0;"><-</span> rep(0:(N-1), each = J*K)
cobblebar <span style="color: #5f9ea0;"><-</span> factor(rep(rep(1:J, each = K), times = N))
transect <span style="color: #5f9ea0;"><-</span> factor(rep(1:K, times = N*J))
<span style="color: #b22222;">## </span><span style="color: #b22222;">Simulated parameters
</span> hp<span style="color: #5f9ea0;"><-</span>GetHyperparam(x=m.orig)
<span style="color: #a020f0;">if</span>(is.null(b))
b <span style="color: #5f9ea0;"><-</span> hp[<span style="color: #bc8f8f;">'b'</span>]
g <span style="color: #5f9ea0;"><-</span> rnorm(J*K, 0, hp[<span style="color: #bc8f8f;">'sigma.g'</span>])
a <span style="color: #5f9ea0;"><-</span> rnorm(J*K, hp[<span style="color: #bc8f8f;">'mu.a'</span>] + g, hp[<span style="color: #bc8f8f;">'sigma.a'</span>])
<span style="color: #b22222;">## </span><span style="color: #b22222;">Simulated responses
</span> eta <span style="color: #5f9ea0;"><-</span> rnorm(J*K*N, a + b * year, hp[<span style="color: #bc8f8f;">'sigma.y'</span>])
<span style="color: #a020f0;">if</span> (transform){
<span style="color: #a020f0;">if</span> (m.orig$type == <span style="color: #bc8f8f;">"normal"</span>){
y <span style="color: #5f9ea0;"><-</span> eta
y[y > 1] <span style="color: #5f9ea0;"><-</span> 1 <span style="color: #b22222;"># </span><span style="color: #b22222;">Fix any boundary problems.
</span> y[y < 0] <span style="color: #5f9ea0;"><-</span> 0
}
<span style="color: #a020f0;">else</span> <span style="color: #a020f0;">if</span> (m.orig$type == <span style="color: #bc8f8f;">"log"</span>){
y <span style="color: #5f9ea0;"><-</span> exp(eta)
y[y > 1] <span style="color: #5f9ea0;"><-</span> 1
}
<span style="color: #a020f0;">else</span> <span style="color: #a020f0;">if</span> (m.orig$type == <span style="color: #bc8f8f;">"logit"</span>)
y <span style="color: #5f9ea0;"><-</span> exp(eta) / (1 + exp(eta))
}
<span style="color: #a020f0;">else</span>{
y <span style="color: #5f9ea0;"><-</span> eta
}
<span style="color: #a020f0;">return</span>(data.frame(prop.woody = y, year, transect, cobblebar))
}
</pre>
<p>
Then I performed the power calculations on each of these designs. This
could take a long time, so I set this procedure to use parallel processing
if needed. Note that I had to re-~source~ the file with all the
necessary functions for each processor.
</p>
<pre class="src src-R"><span style="color: #0000ff;">powerAnalysis</span> <span style="color: #5f9ea0;"><-</span> <span style="color: #a020f0;">function</span>(parallel = T, ...){
<span style="color: #b22222;">## </span><span style="color: #b22222;">Full Power Analysis
</span>
<span style="color: #b22222;">## </span><span style="color: #b22222;">Parallel
</span> <span style="color: #a020f0;">if</span>(parallel){
closeAllConnections()
cl <span style="color: #5f9ea0;"><-</span> makeCluster(7, type = <span style="color: #bc8f8f;">"SOCK"</span>)
on.exit(closeAllConnections())
clusterEvalQ(cl, <span style="color: #5f9ea0;">source</span>(<span style="color: #bc8f8f;">"cobblebars2.r"</span>))
}
<span style="color: #b22222;">## </span><span style="color: #b22222;">The simulations
</span> dat <span style="color: #5f9ea0;"><-</span> factoredDesign(...)
<span style="color: #a020f0;">if</span> (parallel){
dat$power <span style="color: #5f9ea0;"><-</span> parRapply(cl, dat, <span style="color: #a020f0;">function</span>(x,...){
dt.power(N = x[2], J = x[3], K = x[4], b = x[1], ...)
}, ...)
} <span style="color: #a020f0;">else</span> {
dat$power <span style="color: #5f9ea0;"><-</span> apply(dat, 1, <span style="color: #a020f0;">function</span>(x, ...){
dt.power(N = x[2], J = x[3], K = x[4], b = x[1], ...)
}, ...)
}
<span style="color: #a020f0;">return</span>(dat)
}
</pre>
<p>
The output of the <code>powerAnalysis</code> function is a data frame with
columns for the power and all the sample design settings. So, I wrote
a custom plotting function for this data frame:
</p>
<pre class="src src-R"><span style="color: #0000ff;">plotPower</span> <span style="color: #5f9ea0;"><-</span> <span style="color: #a020f0;">function</span>(dt){
xyplot(power~N|J*K, data = dt, groups = E,
panel = <span style="color: #a020f0;">function</span>(...){panel.xyplot(...)
panel.abline(h = 0.8, lty = 2)},
type = c(<span style="color: #bc8f8f;">"p"</span>, <span style="color: #bc8f8f;">"l"</span>),
xlab = <span style="color: #bc8f8f;">"sampling years"</span>,
ylab = <span style="color: #bc8f8f;">"power"</span>,
strip = strip.custom(var.name = c(<span style="color: #bc8f8f;">"C"</span>, <span style="color: #bc8f8f;">"T"</span>),
strip.levels = c(T, T)),
auto.key = T
)
}
</pre>
<p>
Below is the figure for the cobblebar power analysis. I won't go into
detail on what the results mean since I am concerned here with
illustrating the technique and the R code. Obviously, as the number of
cobblebars and transects per year increase, so does power. And, as
the effect size increases, observing it with a test is easier.
</p>
<p>
<img src="ltxpng/powerAnalysis_0002.png"/>
</p>
</div>
</div>
<div id="postamble">
<p class="author"> Author: Todd Jobe
<a href="mailto:toddjobe@unc.edu"><toddjobe@unc.edu></a>
</p>
<p class="date"> Date: 2009-09-18 Fri</p>
<p class="creator">HTML generated by org-mode 6.30trans in emacs 22</p>
</div>toddjobehttp://www.blogger.com/profile/10065244958204541807noreply@blogger.com163tag:blogger.com,1999:blog-8557918062988474595.post-8519377830588291802009-06-15T08:40:00.007-04:002009-06-15T12:03:44.608-04:00Ecology Journals LaTeX ClassLaTeX is a document markup language used to create manuscripts that have nice typesetting. Professional looking documents can be created from simple text files. It is particularly useful for creating manuscripts that have math equations in them.
Writing documents in LaTeX should allow authors to focus on content of the manuscript rather than the formatting. All La-tex documents are associated with a class file which sets the formatting for the document. Users simply associate a document with a class and the text is formatted appropriately. Unfortunately, many academic journals have very specific formatting requirements not built into LaTeX.
The journals of the <a href="http://www.esa.org">Ecological Society of America</a> do not have LaTeX class files appropriate for their major journals: Ecology, Ecological Applications and Ecological Monographs (even though they will accept documents in LaTeX format). One user-created LaTeX class file for ecology is available <a href="http://floliveweb.free.fr/htm/tele.php?lang=fr">here</a>. However, this class formats a LaTeX file into a proof-style document, not a submission-style document.
I have recently been writing a submission for Ecological Applications that is in LaTeX. I had to write my own class file for formatting this document according to the ESA guidelines (<a href="http://esapubs.org/esapubs/preparation.htm">http://esapubs.org/esapubs/preparation.htm</a>). I am sure that others have run into the same problem. So, I am making my class file publicly available.
Please consider this class to be in alpha. I have not, yet, written support for multiple author addresses simply because I didn't need it for the current manuscript. Feel free to improve the class. Just let me know about it, so I can use it as well.
To use this class in a LaTeX document, you can either install the class file into your LaTeX distribution or simply place the file in the same folder as your LaTeX document. Bibliography formatting using BibTex requires the <a href="http://www.ctan.org/tex-archive/help/Catalogue/entries/natbib.html">natbib</a> package be installed and the <a href="http://www.math.ntnu.no/~jarlet/latex/ecology.bst">ecology.bst</a> bibliography format file. I have packaged the ecology.cls class along with an example latex file, the proof pdf, an example figure, and an example bibligraphy into a tarball available below.
Download: <a href="http://www.unc.edu/~toddjobe/src/ecologyLatexClass.tar.gz">ecologyLatexClass.tar.gz</a>toddjobehttp://www.blogger.com/profile/10065244958204541807noreply@blogger.com53tag:blogger.com,1999:blog-8557918062988474595.post-7502377211215338562009-06-01T15:40:00.006-04:002009-06-01T16:26:23.131-04:00Add a column to a text file from a raster<h2>Text files are great</h2>
I have mixed feelings about software interoperability. Sure GRASS interfaces with R, and R can interface with Excel and Access if you have the ODBC drivers. But, often users do not have the energy to learn the best techniques for moving data back and forth between programs. Further, there always seems to be some quirk with data that prevents the transfer from going as smoothly as the user's manual examples do. Then, I remember the lowly text file.
Whether you use comma separated values (csv) files or tab-delimited files, every software platform on the planet can read text files of records (lines) and fields (values on a line separated by commas,tabs or some other character). If you're trying to send data to someone else, you can be confident that your recipient can read them. It may not be the flashiest or fastest way to move data between people and systems, but it always works. Finally, when you use plain text files for data, you have many more options for working with the files. Tools such as bash, awk, sed, perl, and any other programming language you can think of will can be brought to bear on data analysis problems.
<h2>Attaching raster data to points in GIS</h2>
Switch gears for a moment from a common file format to a common analysis task. For researchers who work with spatial point patterns (like plots), a regular step in data analysis is to pull data from raster layers where they intersect the point data and add it as columns to the point data. Informally, these are known as "drill-down" operations. You have a stack of maps and you want to drill down through them vertically at point locations, grab the data, and add it to the points.
This is an a relatively simple task in ArcGIS if you use <a href="http://www.spatialecology.com/htools/tooldesc.php">Hawth's tools</a>. It's also pretty easy in GRASS, though it involves a couple of steps. The function <a href="http://grass.itc.it/gdp/html_grass63/v.what.rast.html"><code>v.what.rast</code></a> will put raster values into a column of a point dataset. The column must already exist in the dataset, which can be done through the command <a href="http://grass.itc.it/gdp/html_grass63/v.db.addcol.html"><code>v.db.addcol</code></a> and <code>v.what.rast</code> function just fills in the missing values with values where the raster intersects the points.
<h2>Adding raster values to text files</h2>
Now, back to text files. I recently worked with a graduate student who wanted elevation values from a lidar-derived digital elevation model (dem) on 4 samples totalling about 1100 plots. She is not a GRASS user, and so she sent me the locations of all the plots along with an identifier in csv format. I wanted to send her back the same csv files with elevation simply added as a field. I wrote a bash script to accomplish this quickly and repeatably for her files. In this script, all the GRASS functions are hidden from the user. Users simply input a csv file and the raster column you wish to add, and a csv file is output with that column added. The script requires the awk script that I <a href="http://www.unc.edu/%7Etoddjobe/blog/2009/05/extract-variable-types-from-text-file.html">posted previously</a>, which recognizes the types of varibles present in the input csv. I have included a portion of the full script below. GRASS has a very particaular code style that it uses for running scripts natively (See <a href="http://grass.itc.it/gdp/html_grass63/g.parser.html"><code>g.parser</code></a> for details). This makes the script very long. So, I've cut out the not-so-important parts. If you would like the full file you can get it <a href="http://www.unc.edu/%7Etoddjobe/src/v.ascii.addrastcol">here</a>.
<h2>v.ascii.addrastcol</h2>
<pre class="prettyprint">
#!/bin/sh
##################################################################
#
# MODULE: v.ascii.addrastcol
# AUTHOR(S): R. Todd Jobe <toddjobe@unc.edu>
# PURPOSE: Add a raster value field to a csv of points
# COPYRIGHT: (C) 2009 by R. Todd Jobe
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
##################################################################
# . . . Grass header information removed
# . . . The cut code creates variables GIS_OPT_* and GIS_FLAG_*
# . . . Get the full code at:
# . . . http://www.unc.edu/~toddjobe/src/v.ascii.addrastcol
##################################################################
# Need space-delimited arguments for looping
GIS_OPT_INPUT=${GIS_OPT_INPUT//,/ }
# test for input raster map
eval `g.findfile element=cell file="$GIS_OPT_RAST"`
if [ ! "$file" ] ; then
g.message -e "Raster map <$GIS_OPT_RAST> not found in mapset search path"
exit 1
fi
for i in $GIS_OPT_INPUT; do
# Check that input files exist
if [ ! "$i" ] ; then
echo "file $i does not exist" >&2
exit $E_NOTFOUND
fi
# Create header line if necessary and get variable types
if [ $GIS_FLAG_H -eq 1 ]; then
skip=1
else
skip=0
fi
# Intermediate variables
map=`basename $i .csv`
tmp=`tempfile -s ".csv"`
of="${map}${GIS_OPT_SUFFIX}"
col=`typeVars.awk -v h=$GIS_FLAG_H -F, $i`
# Get the raster type
if `r.info -t $GIS_OPT_RAST | grep -q "FCELL"`; then
rtype="double precision"
else
rtype="int"
fi
# Grass-specific calls
v.in.ascii --overwrite input="$i" output="$map" fs="$GIS_OPT_FS" \
skip=$skip column="$col" x="$GIS_OPT_X" y="$GIS_OPT_Y"
v.db.addcol map="$map" \
columns="$GIS_OPT_RAST $rtype"
v.what.rast vector="$map" column="$GIS_OPT_RAST" \
raster="$GIS_OPT_RAST"
v.out.ascii input="$map" fs="," \
columns="$GIS_OPT_RAST" | \
awk -F, -v meters="$GIS_FLAG_M" -v head="$GIS_FLAG_H" \
-v rast="$GIS_OPT_RAST" '
BEGIN{ if(head == 1) print rast;}
{
# Convert Ft. elevation to meters if necessary
if($4!="" && meters==1) $4=$4 * 0.30480060
print $4
}' >"$tmp"
# Join raster values to input and write to output
paste -d$GIS_OPT_FS "$i" "$tmp" >"$of"
# Cleanup
rm "$tmp"
if [ $GIS_FLAG_R -eq 1 ] ; then
g.remove -f --quiet vect=$map
fi
done
</toddjobe@unc.edu></pre>
To see the useage for the full script, you can simply start GRASS and from the command line type <code>v.ascii.addrastcol --help</code>. Here's a short example for a random set of 25 points on Ft. Bragg North Carolina.
<pre class="prettyprint">
GRASS 6.4.0RC3 (ftBragg):~ > head locs.csv
"","V1","V2","V3"
"1",594927.510605283,153893.107849400,1
"2",590884.780751329,155855.275043186,2
"3",594095.580107841,155284.234116311,3
"4",605646.865006463,154397.274762958,4
"5",603002.815879479,153284.801461238,5
"6",603453.085431226,152633.248500899,6
"7",602981.046299442,152156.288326646,7
"8",604160.50828444,151921.617483228,8
"9",603752.189821169,152062.642227756,9
GRASS 6.4.0RC3 (ftBragg):~ > v.ascii.addrastcol --quiet -hr \
input=locs.csv rast=iLandcover x=2 y=3 --quiet -r
GRASS 6.4.0RC3 (ftBragg):~ > head locs_isect.csv
"","V1","V2","V3",iLandcover
"1",594927.510605283,153893.107849400,1,3
"2",590884.780751329,155855.275043186,2,3
"3",594095.580107841,155284.234116311,3,1
"4",605646.865006463,154397.274762958,4,3
"5",603002.815879479,153284.801461238,5,3
"6",603453.085431226,152633.248500899,6,3
"7",602981.046299442,152156.288326646,7,3
"8",604160.50828444,151921.617483228,8,1
"9",603752.189821169,152062.642227756,9,3
</pre>toddjobehttp://www.blogger.com/profile/10065244958204541807noreply@blogger.com234tag:blogger.com,1999:blog-8557918062988474595.post-73824801699720458492009-05-25T14:05:00.012-04:002009-05-26T10:48:36.851-04:00Extract variable types from a text file using awkThe most difficult tasks in data analysis are not, from my experience, the statistical or analytic tasks. Instead, the most time consuming and frustrating parts of analysis are simply extracting and transforming raw data into a format that can be analyzed by common software. A common task in this vein is to determine what type of variables are contained in a text file containing fields and records (e.g. a comma separated values file or .csv). Most commonly, these data types are one of either strings, integers or decimal numbers called doubles. Some software, such as <a href="http://www.blogger.com/www.r-project.org">R</a> does a good job at automatically recognizing what kinds of data are contained in a text file through commands like <a href="http://wiki.r-project.org/rwiki/doku.php?id=rdoc:base:read.table">read.table</a>. Other software, such as the GIS software <a href="http://grass.osgeo.org/">GRASS</a>, may require you to explicitly input data types of each column of data. I find that to be time consuming, especially when you may have dozens of columns or dozens of text files to work through.
I have written the following script in <a href="http://www.gnu.org/software/gawk/">awk</a> (technically, gawk) to automatically determine what type of data is contained in a columnar text file. Awk is a programming language specifically designed to parse and transform these text files. Awk scripts can often developed to be included as short snippets of larger bash scripts. Such is the case here. It could probably be done more easily in Perl, but I was already working on a bash script that this will go into, which I will post later.
The task accomplished by the script is a simple one. Create a comma separated string that lists the variable names and data types for each column in a text file. This string will ultimately be used as input to the command <a href="http://grass.itc.it/gdp/html_grass63/v.in.ascii.html">v.in.ascii</a> in GRASS, which creates a point vector dataset from a text file.
<h2>typeVars.awk</h2>
<pre class="prettyprint">#!/bin/awk -f
##
## file: typeVars.awk
## created by: R. Todd Jobe <toddjobe@unc.edu>
## date: 2009.05.25
## This script will print the variable names and types from a
## columnar text file.
## Specifically, it can generate a variable declaration for GRASS
## input.
{
# Set the types to default values if not defined
if(str=="") str="var"
if(it=="") it="integer"
if(dbl=="") dbl="double precision"
# Get the column name from the 1st row
if(NR==1){
for(i=1; i <= NF; i++){
type[i,1]=$(i)
}
}else if(NR==2){
# Get the column type from the 2nd row
j=0
for(i=1; i <= NF; i++){
if($i ~ /[^-.0-9]/){
type[i,2]=str
strf[++j]=i
if(str=="var"){
max[i]=length($i)
type[i,2]=sprintf("%s(%.0f)",str,max[i])
}
}else if($i ~ /\./){
type[i,2]=dbl
}else{
type[i,2]=it
}
}
}else{
# Get the maximum column width for strings
if(str=="var"){
for( k in strf ){
if(length($(k)) > max[k]) {
max[k] = length ($k)
type[strf[k],2]=sprintf("%s(%.0f)",str,max[k])
}
}
}
}
}
END{
ORS=""
out=sprintf("%s %s",type[1,1],type[1,2])
for(l=2;l<i;l++){
out=sprintf("%s, %s %s",out,type[l,1],type[l,2])
}
print out
}</pre>
<h2>An Example</h2>
Here's an example of using the typeVars.awk script. We begin with a csv file of different variable types. typeVars.awk is executed on this csv and the output is a single string of comma separated column names with variable types. By default the output is formatted for use in GRASS and strings are labelled with their maximum length.
<pre class="prettyprint">bash-3.2$ cat trial.csv
astring,aint,adbl,plot,x,y
abc123,1234,1.23,zyx1234,54,1.45
abc1235,1234,2.587,z*x1234,360,1.45
abc12,12345,1.23,zy.1234,1,1.45
abc1,999999,1.23,zx1234,0,1.45
bash-3.2$ echo `typeVars.awk -F, trial.csv`
astring var(7), aint integer, adbl double precision, plot var(6), x integer, y double precision
</pre>
Notice -F parameter is specified in the call to typeVars.awk. Whitespace is the default field separator in awk, so in order to parse a csv this needs to be set to a ",".toddjobehttp://www.blogger.com/profile/10065244958204541807noreply@blogger.com348tag:blogger.com,1999:blog-8557918062988474595.post-22378184568605586422009-05-14T11:47:00.003-04:002009-05-26T11:22:27.658-04:00Time for a blog<div xmlns="http://www.w3.org/1999/xhtml">I've started this blog mainly as a place to post the various things I do with computers at my job. I am a postdoctoral associate in the Geography Department at the University of North Carolina at Chapel Hill. Though I work in a Geography department right now, I am really an ecologist. Specifically, I am a quantitative community ecologist. I think about how and why species occur together. The "quantitative" part of that description just means that I typically answer ecological questions with the tools of mathematics, computer simulation modelling, and statistics.<p>Anyway, I think a lot about the best ways to analyze and present ecological data, and people constantly ask me questions about it. I hope this blog can be a "pre-emptive strike" for answering these questions. I plan to post lots of source code and comment on my own particular style of doing things. Take what you like. Leave the rest.</p> </div>toddjobehttp://www.blogger.com/profile/10065244958204541807noreply@blogger.com66