EA-UTILS is a collection of command-line tools for processing biological
sequencing data.

EA-UTILS was written for Illumina-based pipelines, but should
work with any data in the FASTQ format.

EA-UTILS includes:

  • fastq-mcf: Scans a sequence file for adapters, and,
    based on a log-scaled threshold, determines a set of clipping
    parameters and performs clipping. Also does skewing detection
    and quality filtering;
  • fastq-multx: Demultiplexes a fastq. Capable of
    auto-determining barcode id's based on a master set fields. Keeps
    multiple reads in-sync during demultiplexing. Can verify that the
    reads are in-sync as well, and fail if they're not;
  • fastq-join: Similar to audy's stitch program, but in C,
    more efficient and supports some automatic benchmarking and
    tuning. It uses the same "squared distance for anchored alignment"
    as other tools;
  • varcall: Takes a pileup and calculates variants in a more
    easily parameterized manner than some other tools;
  • sam-stats: Basic SAM/BAM stats. Like other tools, but
    produces what I want to look at, in a format suitable for passing
    to other programs;
  • fastq-stats: Basic FASTQ stats. Counts duplicates. Option
    for per-cycle stats, or not (irrelevant for many sequencers);
  • determine-phred: Returns the PHRED scale of the input file.
    Works with SAM's, FASTQ's or pileups and gzipped files;
  • Chrdex.pm and Sqldex.pm: obsoleted by the cpan module
    Text::Tidx. Sqldex may not actually be obsolete, because Tidx uses
    more RAM and is slower for very small jobs. But for Exome and
    RNA-Seq work, Text::Tidx beats both;
  • qsh: Runs a BASH script file like a "cluster aware
    makefile". Only processing newer things, dieing if things go wrong,
    and sending jobs to a queue manager if they're big. That way you
    don't have to write makefiles, or wrap things in "qsub" calls for
    every little program. Not really ready yet.
  • grun: Fast, lightweight grid queue software. Keeps the job
    queue on disk at all times. Very fast. Works well by now;
  • gwrap: BASH wrapper shell that downloads all dependencies
    that are not the local system. Good for EC2 nodes. Linux only.
    Will use it if we ever go to EC2.
  • gtf2bed: Converter that bundles up a GFF's exons and makes a
    UCSC-styled BED file with thin/thick properly set from the
    start/stop sites;
  • randomFQ: takes a FASTQ (can be gzipped or paired-end) and
    randomly subsets to a user defined number of reads;

Note that EA-UTILS includes an executable command called qsub,
which conflicts with the command used on ARC clusters to submit jobs.
Thus, a user who interactively loads EA-UTILS will temporarily lose
the ability to submit jobs to the queue!

Web Site:

The EA-UTILS home page at github:




On any ARC cluster, check the installation details
by typing "module spider ea-utils".

EA-UTILS requires that the appropriate modules be loaded before it can
be used. One version of the appropriate commands for use on Cascades is:

module purge
module load gcc/5.2.0
module load ea-utils/1.04.807


The following batch file demonstrates the use of EA-UTILS:

#! /bin/bash
#PBS -l walltime=0:05:00
#PBS -l nodes=1:ppn=1
#PBS -W group_list=cascades
#PBS -q open_q
#PBS -j oe
module purge
module load gcc/5.2.0
module load ea-utils/1.04.807
sam-stats unmapped_first.sam

A complete set of files to carry out a similar process are available in