Extract variable types from a text file using awk
The most difficult tasks in data analysis are not, from my experience, the statistical or analytic tasks. Instead, the most time consuming and frustrating parts of analysis are simply extracting and transforming raw data into a format that can be analyzed by common software. A common task in this vein is to determine what type of variables are contained in a text file containing fields and records (e.g. a comma separated values file or .csv). Most commonly, these data types are one of either strings, integers or decimal numbers called doubles. Some software, such as R does a good job at automatically recognizing what kinds of data are contained in a text file through commands like read.table. Other software, such as the GIS software GRASS, may require you to explicitly input data types of each column of data. I find that to be time consuming, especially when you may have dozens of columns or dozens of text files to work through.
I have written the following script in awk (technically, gawk) to automatically determine what type of data is contained in a columnar text file. Awk is a programming language specifically designed to parse and transform these text files. Awk scripts can often developed to be included as short snippets of larger bash scripts. Such is the case here. It could probably be done more easily in Perl, but I was already working on a bash script that this will go into, which I will post later.
The task accomplished by the script is a simple one. Create a comma separated string that lists the variable names and data types for each column in a text file. This string will ultimately be used as input to the command v.in.ascii in GRASS, which creates a point vector dataset from a text file.
typeVars.awk
#!/bin/awk -f ## ## file: typeVars.awk ## created by: R. Todd Jobe <toddjobe@unc.edu> ## date: 2009.05.25 ## This script will print the variable names and types from a ## columnar text file. ## Specifically, it can generate a variable declaration for GRASS ## input. { # Set the types to default values if not defined if(str=="") str="var" if(it=="") it="integer" if(dbl=="") dbl="double precision" # Get the column name from the 1st row if(NR==1){ for(i=1; i <= NF; i++){ type[i,1]=$(i) } }else if(NR==2){ # Get the column type from the 2nd row j=0 for(i=1; i <= NF; i++){ if($i ~ /[^-.0-9]/){ type[i,2]=str strf[++j]=i if(str=="var"){ max[i]=length($i) type[i,2]=sprintf("%s(%.0f)",str,max[i]) } }else if($i ~ /\./){ type[i,2]=dbl }else{ type[i,2]=it } } }else{ # Get the maximum column width for strings if(str=="var"){ for( k in strf ){ if(length($(k)) > max[k]) { max[k] = length ($k) type[strf[k],2]=sprintf("%s(%.0f)",str,max[k]) } } } } } END{ ORS="" out=sprintf("%s %s",type[1,1],type[1,2]) for(l=2;l<i;l++){ out=sprintf("%s, %s %s",out,type[l,1],type[l,2]) } print out }
An Example
Here's an example of using the typeVars.awk script. We begin with a csv file of different variable types. typeVars.awk is executed on this csv and the output is a single string of comma separated column names with variable types. By default the output is formatted for use in GRASS and strings are labelled with their maximum length.bash-3.2$ cat trial.csv astring,aint,adbl,plot,x,y abc123,1234,1.23,zyx1234,54,1.45 abc1235,1234,2.587,z*x1234,360,1.45 abc12,12345,1.23,zy.1234,1,1.45 abc1,999999,1.23,zx1234,0,1.45 bash-3.2$ echo `typeVars.awk -F, trial.csv` astring var(7), aint integer, adbl double precision, plot var(6), x integer, y double precisionNotice -F parameter is specified in the call to typeVars.awk. Whitespace is the default field separator in awk, so in order to parse a csv this needs to be set to a ",".