On Splitting Files…

Edit (2012/03/10):  Updated the awk script to use “$0″ instead of “$1″ to prevent problems with spaces in FASTQ identifiers.  Hat tip to Thasso via his comment below.

Note:  this is an older post I’ve had sitting around for a while–hope it helps someone.

In bioinformatics, we often have to deal with huge FASTQ formatted files.  And by “huge,” I mean on the order of several gigabytes.  For instance, one of my current projects has me working with 16 pairs of Illumina read files, each file of which is around 8GB. With the way the various sequencing companies keep producing more and more data from each sequencing run, those files are undoubtedly just going to continue growing.  Usually, our tools take care of things quite well–BWA and the various other alignment programs out there regularly take those huge FASTQ files and convert them into SAM files over the course of several hours or days.

However, a few days ago a group mate and I were trying to pull a bunch of sequences out of one of those huge Illumina files, but it was taking forever to get through everything.  One of us then had an epiphany–if we split the Illumina file into pieces, we could pull the reads we wanted out in parallel, and thus would get our results much faster than just waiting on the serial script we’d been using.   Computationally speaking, it would probably take more processing time, but that’s an acceptable cost for us.

Problem is, neither of us had a good way to accurately split the files without writing a script.  We could have used split, in retrospect, but we didn’t really think about it all that much, especially since I’d been doing a similar task on one of my other projects.  I’d been using awk to split a list of files into pieces exactly like we needed, with the exception that I could just switch files whenever I switched lines.  With FASTQ files, however, you have to keep a block of 4 lines together before changing files, which added a little complexity.  I felt up to the challenge, though, and pretty soon I had a handy little script working.  Here it is–a round-robin style FASTQ splitter written in awk.

[sourcecode language="bash"]
# splits a FASTQ file into pieces, with each read being added to the files in a
# round-robin fashion
# don’t forget to change the file name being catted
# change the NUM_FILES variable to be the number of files you want to split it
# into

# to verify this works right, diff the original against the output of the
# following, possibly ignoring newline differences
# ls | grep "-split.fa" | sort -n | xargs -i{} cat {} > merged
awk ‘
# Change these to the number of files you want and the number of lines to
#  print in each file before moving on to the next file
NUM_FILES = 250;
lines += 1;
if (lines == 1)
files += 1
print $0 > files "-split.fa";
if (lines == NUM_LINES)
lines = 0;
if (files == NUM_FILES)
files = 0
}’ $1

There’s a couple ways you can use it.

  1. Drop the full script into a new file (say, fastq-splitter.sh) and then call it like you would any other shell script.  The file you want to process should be the only parameter.
  2. Paste the script straight into your terminal (you are running this on a Unix-like box, right?), change the “$1″ to your file name, and you’re done.

Oh, and one warning:  Don’t forget to adjust the number of files you need and the number of lines you need to put in each file by setting the variables at the beginning of the script.

2 thoughts on “On Splitting Files…

  1. Hey,

    thanks for the snippet ! works nicely. Minor suggestion: Change the $1 in the awk script to $0 to print the full line instead. fastq read id *might* contain spaces, and if the fastqs are produced with Casava 1.8, the id WILL contain a space and you’ll loose information (see http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers )

    • Very good call, Thasso. The files I was working with did not have spaces in the ids, so it wasn’t a problem for me, but I do see how that could cause trouble. I’ll update the script. Thanks!

Comments are closed.