Monday, February 14, 2011

Handling Data

When you start an experiment using a PEBL task, you should give some thought to how you will eventually save and analyze the data.  I frequently get requests for help understanding PEBL output after the data were collected, when it is too late for easy elegant fixes and only difficult ugly things are possible.  Importantly, especially for the pre-packaged PEBL tasks, there is no guarantee that it will save data in a format you know  how to use, or save all the right information in the right way.   Some of the  tasks in the battery were built as basic demos that researchers could adapt to suit their needs, and so may not record any data at all!  Before you conduct an experiment using a PEBL task, you should always verify that its output is what you want.

Most of the standard PEBL tasks record a trial-by-trial data record into a text file (sometimes a comma-separated .csv file, other times just space separated).  Also, some of the files include a header indicating the meaning of each column.  Note that each row will be tagged with all of its identifying information, even things like subject code, which won't change for at all for a subject.

Some of the PEBL tasks include a report as well.  This is a human-readable text file containing basic summary data.  This might provide simple statistics for each subject, but not in a standardized format, so you need to have somebody enter them all by hand or use awk or some other text processing language to make the data obey. For many users, neither of these are ideal.  Here are some tools and tricks you can use to spend less time managing data after your experiment is complete.




First, let's think about what to do with data. Let's say you are about to run an experiment with 100 subjects, doing something like PEBL's WCST task.  Here  are some possibilities for a data analysis plan you should think about BEFORE you even run one subject.

1. Read in files to data analysis package individually
One approach you might take (and one I often do) is to simply read in data files to your stats package programmatically.  So you have 100 files, labeled something like bcst-1.txt through bcst-100.txt.  In the statistical computing language R, you simply need to do something like:

subs <- 1:100
for(i in subs)
    dat <- read.table(paste("bcst-",i,".txt",sep=""))
   ## now, add dat to  master file, or extract relevant data from data.
}
If you are using something like SAS or SPSS, you can probably figure out a way to do something similar, but it is really simple in R.  Now, you can sometimes avoid this step if you also save your data to a master experiment file, but even then if you are running the study on multiple computers, you will need to combine data files or read them in individually, which leads me to...


2. Concatenate files
Sometimes it is simpler to concatenate the data files prior to reading into your stats package.  This can be done (tediously) by sitting a research assistant in front of a computer, and having them copy-and-paste files together in a text editor, or into an excel spreadsheet.  But there are easier ways.

Concatenating file is REALLY simple from the command line (on linux or osx, or if you install mingw on windows) or even from a batch file on Windows.  In fact, it is so simple and fast,  that even though there are 'data merging' window-based applications out there, they are probably all inferior.  Here is a simple tutorial.


Unix-based file concatenation
You can use this approach on linux, OSX, or if you install MinGW/Cygwin on windows.  Using a combination of 'cat', 'grep', 'echo', and 'head', you can do most of what you want; if you need to, you can also use things like awk, sed, paste, colrm, and tail to do absolutely anything you need.  These are unix utilities designed to process logfiles, which is essentially what your data files are.

1. Navigate to the directory your data files are in.  The $ indicates the command prompt--don't type it.
$ cd Data/exp222/data

2. Concatenate data files.
If your data files have no header line, you are in luck.  Just do the following to create merged-bcst.txt:

$ cat bcst-*.txt > merged-bcst.txt

This will not impact the original files at all, and will create a new file called merged-bcst.txt which will just contain a copy of every file. One trick is to name your merged file with a  different root name (or in a different subdirectory) so that if you collect more data and run the command again, it won't accidentally include your merged data set (duplicating your data).  Thus, I try to avoid doing things like cat bcst-*.txt > bcst-all.txt, because I've been burned by it in the past.

But what if you have a header on each file?  You can use the 'tail' command to remove the first row:
$  tail -n+2 -q bcst-*.txt > merged-bcst.txt

The -n+2 starts with the second line.

You can also use  grep for this, which searches for matches.  With the -v command, it does reverse matches, so grep -v for something that only appears in the headers, and you get everything else. grep is much more powerful, and will allow you to extract just certain conditions from your data, so it has many other uses which I may cover some other day.

Add a header

But what if you want to put the header back on, so your data file will have column headers and will read into your stats package with readable variable names?  The simplest way to do it is make a file containing just the header, and concatenate it with the master data file.  You can do this by hand (save in header.txt), or using the 'head' command on a valid data file:


To get the header out of one of the files:
$ head -n 1 bcst-1.txt > header.txt

Then, concatenate the header and the data:
$ cat header.txt merged-bcst.txt > mergedhead-bcst.txt

Windows-based file concatenation
Some of the same things can be done in windows, but it is less powerful, probably unless you install something like powershell.  But Windows has a bit more clunky command-line options that are not a joy to use.  Again, if you install mingw, you can do any of the file merging described above. But you can sometimes get by by creating a batch file using some of windows primitive file merging commands.  Create a batch file by opening notepad and saving it as something like process.bat. (make sure you don't hide file extensions, so that you can be sure you file is not named process.bat.txt).

Windows concatenation is similar, with slightly different semantics. The easiest thing to do is use the copy command.  Type the following into your process.bat file, then save it.

copy bcst-*.txt  merged-bcst.txt

Then double-click on the .bat file to run it.  The  will ensure it matches all your raw data files.

This won't remove header lines, but to do that:
1. read the merged-bcst.txt file into a spreadsheet
2. Add a column, and number the cells in that column 1....N (to keep track of the original order)
3. Sort the excel file by the second (or some other) column.  Now, all the headers should be together.
4. Delete all but one of the header rows.
5. Sort by the first column, restoring the original order.

!!!Update!!!



Bluefive software distributes a neat windows utility called TxtCollector.  It will find all the files of a specific type in your directory and combine them all together into a single file.  Try it!


3. Create Other Files

These steps are not really burdensome, but why do it if you don't have to?

First, consider adding code to a PEBL script that makes master logs of one or another type.  So, along with the 'standard' data saving method where each participant gets his or her own file, and each line contains a  specific record of each trial, you might want to do things like:
  1. Save a single master logfile with timestamps and subject codes, indicating when the study started and ended.
  2. Save a 'demographics' file collecting the types of information that NIMH wants you to (the function GetNIMHDemographics() will do that for you)
  3. Save a master data record which saves every trial of every participant to the same file
  4. Save a master summary file which computes and records a few key IVs and DVs per participant
Any of these can be accomplished by just understanding a few PEBL commands.  In PEBL, files are typically read and written by opening up the file, then writing to the object created.  A file can be opened in one of two ways using FileOpenWrite() and FileOpenAppend().  FileOpenWrite() opens a file fresh, deleting anything that was already in the file.  FileOpenAppend() opens so that writing will simply append to the end of the current file.  If you are creating files tied to subject codes, it is a good fallback to use FileOpenAppend(), because then you won't lose data if you happen to re-use a subject code accidentally.

Then, to save data to the file, use the FilePrint() or FilePrint_() functions.  These functions take two argumments, a file object (returned by FileOpenWrite()), and a text string to write.

For example to create a basic log file that records subject code and the time of the study, add lines like this to an experiment after the subject code has been collected:

  file <- FileOpenAppend("test-logfile.txt")
  FilePrint(file, gSubNum + " " + TimeStamp())
  FileClose(file)


So, these are just a few tricks to make handling data easier with PEBL, both before and after you run the experiment.   A couple lines on the command-line or in a batch file can literally save you hours of manual copy-pasting, as can a few carefully-placed lines within a script file.

3 comments:

DrT said...

Your writing is crystal clear and I enjoy your posts. However, I am Looking for very basic info on how to read/interpret the "human readable" reports (I am a MD treater, not a psychologist or programmer)I would gladly buy the PEBL Manual, but I can see it will not tell me what the headings in the reports mean. Where can I go for this basic info? Thanks in advance.DrT

Shane Mueller said...

Thanks. Some, but not all, of the tests have high-level real human-readable reports. Most record raw data files, which you really wouldn't be able to make sense out of in a clinical situation. The PEBL manual doesn't include anything about the test battery, but many of the tests and their output data are documented in the PEBL wiki (see links from http://pebl.sf.net/battery.html).

If you are interested in better documentation or reporting, you can request specifics here or on the PEBL email list, or request better documentation or reporting via fundry.com.

TheCanadianExperience said...

Dr. M,

With the tests that I'm using with Dr. Piper, I am continuing to "clean" the data handling in the programs. I'm working on maintaining coherent data sets that are both excel and spss ready. I will continue to update you as I have time. Thanks for all your responsiveness and help.
-Reid