2009-05-25

Extract variable types from a text file using awk

The most difficult tasks in data analysis are not, from my experience, the statistical or analytic tasks. Instead, the most time consuming and frustrating parts of analysis are simply extracting and transforming raw data into a format that can be analyzed by common software. A common task in this vein is to determine what type of variables are contained in a text file containing fields and records (e.g. a comma separated values file or .csv). Most commonly, these data types are one of either strings, integers or decimal numbers called doubles. Some software, such as R does a good job at automatically recognizing what kinds of data are contained in a text file through commands like read.table. Other software, such as the GIS software GRASS, may require you to explicitly input data types of each column of data. I find that to be time consuming, especially when you may have dozens of columns or dozens of text files to work through. I have written the following script in awk (technically, gawk) to automatically determine what type of data is contained in a columnar text file. Awk is a programming language specifically designed to parse and transform these text files. Awk scripts can often developed to be included as short snippets of larger bash scripts. Such is the case here. It could probably be done more easily in Perl, but I was already working on a bash script that this will go into, which I will post later. The task accomplished by the script is a simple one. Create a comma separated string that lists the variable names and data types for each column in a text file. This string will ultimately be used as input to the command v.in.ascii in GRASS, which creates a point vector dataset from a text file.

typeVars.awk

#!/bin/awk -f
##
## file: typeVars.awk
## created by: R. Todd Jobe <toddjobe@unc.edu>
## date: 2009.05.25
## This script will print the variable names and types from a 
## columnar text file.
## Specifically, it can generate a variable declaration for GRASS 
## input.
{ 
    # Set the types to default values if not defined
    if(str=="") str="var"
    if(it=="") it="integer"
    if(dbl=="") dbl="double precision"

    # Get the column name from the 1st row
    if(NR==1){
 for(i=1; i <= NF; i++){
     type[i,1]=$(i)
 }
    }else if(NR==2){

        # Get the column type from the 2nd row
 j=0
 for(i=1; i <= NF; i++){
     if($i ~ /[^-.0-9]/){
  type[i,2]=str
  strf[++j]=i
  if(str=="var"){
      max[i]=length($i)
      type[i,2]=sprintf("%s(%.0f)",str,max[i])
  }
     }else if($i ~ /\./){
  type[i,2]=dbl
     }else{
  type[i,2]=it
     }
 }
    }else{
        # Get the maximum column width for strings
 if(str=="var"){
     for( k in strf ){
  if(length($(k)) > max[k]) {
      max[k] = length ($k)
      type[strf[k],2]=sprintf("%s(%.0f)",str,max[k])
  }
     }
 }
    }
}
END{
    ORS=""
    out=sprintf("%s %s",type[1,1],type[1,2])
    for(l=2;l<i;l++){
 out=sprintf("%s, %s %s",out,type[l,1],type[l,2])
    }
    print out
}

An Example

Here's an example of using the typeVars.awk script. We begin with a csv file of different variable types. typeVars.awk is executed on this csv and the output is a single string of comma separated column names with variable types. By default the output is formatted for use in GRASS and strings are labelled with their maximum length.
bash-3.2$ cat trial.csv
astring,aint,adbl,plot,x,y
abc123,1234,1.23,zyx1234,54,1.45
abc1235,1234,2.587,z*x1234,360,1.45
abc12,12345,1.23,zy.1234,1,1.45
abc1,999999,1.23,zx1234,0,1.45
bash-3.2$ echo `typeVars.awk -F, trial.csv`
astring var(7), aint integer, adbl double precision, plot var(6), x integer, y double precision
Notice -F parameter is specified in the call to typeVars.awk. Whitespace is the default field separator in awk, so in order to parse a csv this needs to be set to a ",".

Labels: ,

2009-05-14

Time for a blog

I've started this blog mainly as a place to post the various things I do with computers at my job. I am a postdoctoral associate in the Geography Department at the University of North Carolina at Chapel Hill. Though I work in a Geography department right now, I am really an ecologist. Specifically, I am a quantitative community ecologist. I think about how and why species occur together. The "quantitative" part of that description just means that I typically answer ecological questions with the tools of mathematics, computer simulation modelling, and statistics.

Anyway, I think a lot about the best ways to analyze and present ecological data, and people constantly ask me questions about it. I hope this blog can be a "pre-emptive strike" for answering these questions. I plan to post lots of source code and comment on my own particular style of doing things. Take what you like. Leave the rest.