This is the second part of a series of posts on the RAMMCAP suite of bioinformatics tools.

Last time, we left off with a freshly compiled version of RAMMCAP ready for testing.  Like last time, we’ll start with the README in the current directory, which is the rammcap directory inside the main directory (named RAMMCAP-20091106 in my case) from the RAMMCAP download.

The new README has its own test script, so I’m going to follow this one. The first thing to run is the CD-HIT-EST program.


The CD-HIT-EST program is the clustering program. It takes in a FASTA-formatted file and clusters that data according to a greedy algorithm that uses simple word counting and indexing to help speed things up considerably. Basically, it:

  1. Sorts the sequences from longest to shortest.
  2. Begins clustering:
    1. Compare the current sequence to the list of known clusters.
      1. If it matches an existing cluster, it is added to that cluster and made the “representative” sequence for that cluster if it is longer than the existing “representative” sequence.
      2. If it doesn’t match an existing cluster, a new cluster is made with this sequence as the representative sequence.
    2. Repeat until all the sequences are clustered.

(None of which I knew at the time I ran this for the first time, but it’s information that makes sense here.)

I pull out the really long test command (written below), prepare myself, and hit Enter. It takes a long time to complete.

../cd-hit/cdhit-est -i testdb -o testdb_95 -M 50 -B 1 -d 0 -n 10 
     -l 11 -r 1 -p 1 -g 1 -G 0 -c 0.95 -aS 0.8 > testdb_95.log

While that’s going, here’s what that command means:

  • ../cd-hit/cd-hit-est – use the CD-HIT-EST command, which is the CD-HIT command for DNA/RNA comparison tasks. The original CD-HIT was written for protein comparison.
  • -i testdb – use the testdb file as the input file. This file is a FASTA file with 1,000,000 sequences at most 361 bases long pulled from various metagenomic samples by the author.
  • -o testdb_95 – write the output to a file called testdb_95
  • -M 50 – I have 50 MB of RAM free (which I don’t think really makes sense, but that’s what the README says)–turns out it’s actually the maximum available memory, not the amount of free memory according to the rest of the documentation.
  • -B 1 – sequences are stored on the hard drive (1) instead of in RAM (0)
  • -d 0 – The length of the description in .clster file. Since it’s 0, it just takes the FASTA sequence description up to the first space.
  • -n 10 – the word length
  • -l 11 – length of the throw_away_sequences
  • -r 1 – compare both strands of DNA
  • -p 1 – print the alignment overlap in the .clstr file (if 0, it’s not printed)
  • -g 1 – clusters the strings into the most similar cluster, not the first one CD HIT finds, which is the default (0).
  • -G 0 – don’t use global sequence identity, i.e., treat each input sequence individually when calculating identity.
  • -c 0.95 – cluster at 95% sequence identity
  • -aS 0.8 – alignment coverage for the shorter sequence. In this case, it means that the alignment between the longer and shorter sequence must cover at least 80% of the shorter sequence.
  • > testdb_95.log – write the standard output to the testdb_95.log file
  • There are other options available, which I’m not going to go into much for right now.

Long story short, it takes around 200 minutes to complete its test data processing. That’s roughly 3 hours and 20 minutes, which I think is pretty long for a test, though I did limit the amount of memory it could use to 50 MB–upping that limit will probably speed things up substantially, but according to what I read in the README I couldn’t. In retrospect, though, I think upping the limit is just fine, especially with the information I’m reading in the other files. (UPDATE: Upping the memory limit to 400MB, the default, drops the execution time down to about 67 minutes. Eight fold memory increase, three fold time decrease–not great, but not too shabby.)


The next program to run is the ORF_FINDER. This program apparently is a program to scan the sequences for ORFs (not surprising given the name), which are what most people mean when they talk about their “genes” and their DNA (kind of).

../orf_finder/orf_finder -l 30 -L 30 -t 11  -i testdb -o testorf

This command takes far fewer options.

  • ../orf_finder/orf_finder – runs the orf_finder command.
  • -l 30 – the minimal length of the ORF.
  • -L 30 – the minimal length of the ORF between stop codons. I’m not sure exactly how this differs from the -l option. Maybe you need to move at least 30 bases past the end of the previous stop codon before starting to look for another stop codon?
  • -t 11 – Translation table?. Again, not sure what this represents.
  • -i testdb – Use the testdb file as input.
  • -o testorf – Write the output to the testorf file.
  • Again, there are other options I’m not going to talk about much.

In addition to having far fewer options, ORF_FINDER takes far less time to execute. Roughly a minute and a half on the same dataset, which is a pretty huge upgrade. ORF finding is a much simpler task than sequence clustering, obviously. Now for the fun part–clustering the ORFs with the real CD-HIT program.


Now, CD-HIT works basically the same way that CD-HIT-EST does, except it scans for amino acids instead of nucleotides. The full commands I’m running are:

../cd-hit/cdhit -i testorf -o testorf_95 -M 400 -d 0 -n 5 -p 1 
     -g 1 -G 0 -c 0.95 -aS 0.8 > testorf_95.log
../cd-hit/cdhit -i testorf_95 -o testorf_60 -M 400 -d 0 -n 4 -p 1 
     -g 1 -G 0 -c 0.60 -aS 0.8 > testorf_60.log
../cd-hit/clstr_rev.pl testorf_95.clstr testorf_60.clstr 
     > testorf_60full.clstr

It has much the same options as CD-HIT-EST, so I’m not going go into a huge amount of detail on those. Instead, notice how I’m going to:

  1. Cluster the ORFS at 95% identity.
  2. Re-cluster the ORFs with less stringent criteria (60% identity and a shorter word length) to help cluster the non-redundant sequences.
  3. Combine the two clustering runs into a single cluster file.

Running things this way (according to the documentation) can help generate a sort of hierarchical structure of clusters. This makes sense, since the ones at 95% identity at the very least are closely related and may actually be redundant sequences while those at 60% are more distant relatives and may be homologs from divergent species or something similar.

The first run of CD-HIT took around 98 minutes, roughly an extra half an hour than just clustering the sequences as a whole. I’m guessing that since the ORFs are more similar to each other than just the sequences as a whole that they take longer to cluster.

The second run of CD-HIT took around 235 minutes, probably because the less stringent criteria took longer to process. Combining the two cluster files only took around 15 seconds, so at least that’s an easy task.

That takes care of the basic tools included with RAMMCAP.  I’ll explore some of the graphical tools in a later post.

RAMMCAP: Introduction, Download, and Install

This is the first part of a series of posts on the RAMMCAP suite of bioinformatics tools.

The Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) is a tool used for analysis of metagenomic data. What it does is try to cluster and functionally annotate a set of metagenomic data, which means it takes the data, groups like pieces of data together into clusters, and then tries to figure out what the various clusters do. It’s made up of several tools, only two of which I’ve actually used and I’ll talk about later. Camera, the organization behind RAMMCAP, provides a web service where you can use RAMMCAP without installing it, but it has data limits that I’ll break very easily with my datasets plus they require registration, which seems non-functional right now (at least, I can’t get a new account set up and I’ve tried several times over the course of the last few days). So, I downloaded it a few weeks ago and worked on getting it to run over the course of several days.  This post, and the others in this series, is a record of that process, including some initial miss-steps.  If you have any questions or see other places where I stepped wrongly, leave a comment and let me know.

First Impressions

First off, the RAMMCAP download (found on the page linked above) is huge–the source code alone was a roughly 760 MB download, which extracts to around 3 gigs. That three gigs might contain some duplicate data–the folder structure is pretty disorganized and a casual glance shows a lot of folders with the same names. There are a lot of symlinks, though, so I could be wrong there. (The more I see, the more I’m convinced I’m right, though.)

Second, it looks like the source bundled a bunch of tools along with the main RAMMCAP code, including versions of BLAST, HMMER, and Metagene. That added a lot to the bulk of the download (roughly 450 MB). There’s also a huge amount of data here, including a version of the Pfam and TIGRFAM libraries (772 and 444 MB, respectively), and a couple other tools I haven’t heard of before that might be part of RAMMCAP.

Compiling, Phase 1

The README file in the main directory contains basic information on how to compile some of the tools including CD-HIT, ORF_FINDER, and CD-HIT-454, as well as the HMMERHEAD extension to the HMMER, which is optional. The instructions are pretty basic–just do the standard “make clean;make” and things should be good. I wrote a little build script to handle this, just in case I need to do this again for some reason. Everything seems to build fine with the exception of HMMERHEAD, but I’m just going to ignore that for now. Time for testing this puppy out.

Testing, Phase 1

The README indicates that there should be an examples folder somewhere with some basic test data I can use, but I don’t see it anywhere. Looking around…not seeing it. Turns out, it’s inside the rammcap directory inside the main directory.

Compiling, Phase 2

Inside the rammcap directory, I find a new README with some major differences from the one outside this directory, plus what looks like symlinks with the same names as some of the directories outside. Looks like they point to the same directories as the other ones, but I’ll recompile things, anyway, just in case. Good thing I wrote that build script.

Except that the build instructions aren’t the same–I don’t have to build CD-HIT-454, but I do need to make sure gnuplot and ImageMagick are installed. They are, which is good, because I’d either have to contact one of the tech guys to install it on this machine or I’d have to install it to my user directory, which I’ve done with several tools I need or don’t want to do without. Once I pull out that CD-HIT-454 reference, the build works fine.

Since my second round of testing was a much larger task, I’ll leave that for a later post.

Linux Screen Command

Normally, when you log into a remote server via SSH, you get one command prompt and so there’s essentially only one thing you can do at a time without a lot of extra hassle.  However, most of the time I (and presumably others like me) need to do lots of things, potentially in different directories spread out all over the place.  Since switching back and forth between directories all the time is only one of those hassles I mentioned, more often than not I end up with multiple SSH connections to the same server.

Like a couple days ago–I had a connection open to edit my files, one to handle my version control (Mercurial, if you’re curious–more on that later), and one to keep tabs on my data.  Juggling multiple connections like that is easy in a GUI–a couple of quick “Alt-Tabs” and you’ve gone from your editor to your data directory to your version control and back again.  However, all those connections can eat up your bandwidth and many other less-than-fun things, especially when there’s a handy little tool around to do the same kind of thing built right in to the Linux (and presumably other Unix-like OS’s)–the screen command.

The screen command essentially allows you to create multiple “windows” or “screens” you can use within the same connection.  You get multiple command prompts you can work with, and a handy keyboard shortcut similar to but slightly less convenient than “Alt-Tab” to switch between your “screens”.  Another benefit to using screen is that your sessions can survive across SSH sessions–you can disconnect from your screen, log off the server, go away to do something more fun for a couple hours, log back in, and resume right where you left off–same environmental variables, same directory, same command history, etc.  You can even use screen to run long-running commands (say, a command that takes hours or days to run) without having to stay logged in the whole time.  Pretty cool, huh?

You can get most of the information you need to use the screen command effectively out of the man page (this one’s not bad), but I’ll summarize the pieces I find most useful here.

First off, you need to create a new screen, ideally with a name, so you can resume them more easily later and not have to remember a weird id number.  To make a screen named “screenTest” for example, run the following command:

     screen -S screenTest

You should be switched over to a new command prompt and see something like “[screen 0: bash]” in your SSH session’s title, assuming your SSH client supports that kind of change.

The next thing you need to know is how to create what I call a “subscreen,” though I don’t think that’s an official term.  (I’d run a simple command like “ls” now so you can tell the difference between your “subscreens.”)  Hit CTRL+A, release them both, and hit C.  Once again, you should see a new command prompt and something like “[screen 1: bash]” in your window’s title.  Switch back and forth between your “subscreens” by hitting CTRL+A twice just to see that you can, or until you’re convinced there’s a difference between the two.  If you have more than two subscreens, you need to know about a couple more things about subscreen navigation:

  1. CTRL+A twice navigates to the most recently used subscreen.
  2. CTRL+A, N navigates to the “next” subscreen in your list of subscreens.
  3. CTRL+A, P navigates to the “previous” subscreen in your list.
  4. CTRL+A, 0 navigates to the screen numbered 0.  You can use any number between 0-9 for the same effect.
  5. CTRL+A, ” opens up a list of the screens you can move between.
  6. CTRL+A, ‘ prompts you for a screen number and allows you to bounce straight to that screen.

Now, it’s time to drop out of your screens as if you were going to log out.  Hit CTRL+A followed by D, and you should be “detached” from your current screen.  You’ll have returned to your original command prompt, with a little “[detached]” message right after your screen command.

Let’s pretend you wandered away from your computer and came back, ready to start work again.  You need to resume your previous session, so run this command.

     screen -r screenTest

Voilà!  Your screen should open up, just like it was when you detached from it earlier (unless you were running a long command, in which case the output is probably different, but that’s beside the point).  Having fun yet?

One more thing that you might need to know is how to connect to a screen that’s already attached somewhere else, or that didn’t get detached properly somehow (I’ve had it happen once or twice).  This one’s pretty obvious, as well.

     screen -d screenTest

If you pass the “-r” flag to that command, as well, you will detach the other session and connect to it yourself, which can be helpful if you’re moving between computers or something like that.

When it’s time to nuke a screen completely (like if you were a smart person and set up separate screens for each project you’re working on, and just finished one of them), getting rid of it is really easy.  Attach to it, and hit CTRL+D until you close out all of the subscreens.  Be careful you don’t nuke your real session, though, because it’s really easy to do.

One caveat–if you use screen a lot, get used to piping things through less, since the scroll bar of your window will do strange things while you’re running screen.

Ok, that takes care of the basics.  Have fun!