EA-UTILS is a collection of command-line tools for processing biological
EA-UTILS was written for Illumina-based pipelines, but should
work with any data in the FASTQ format.
fastq-mcf: Scans a sequence file for adapters, and,
based on a log-scaled threshold, determines a set of clipping
parameters and performs clipping. Also does skewing detection
and quality filtering;
fastq-multx: Demultiplexes a fastq. Capable of
auto-determining barcode id's based on a master set fields. Keeps
multiple reads in-sync during demultiplexing. Can verify that the
reads are in-sync as well, and fail if they're not;
fastq-join: Similar to audy's stitch program, but in C,
more efficient and supports some automatic benchmarking and
tuning. It uses the same "squared distance for anchored alignment"
as other tools;
varcall: Takes a pileup and calculates variants in a more
easily parameterized manner than some other tools;
sam-stats: Basic SAM/BAM stats. Like other tools, but
produces what I want to look at, in a format suitable for passing
to other programs;
fastq-stats: Basic FASTQ stats. Counts duplicates. Option
for per-cycle stats, or not (irrelevant for many sequencers);
determine-phred: Returns the PHRED scale of the input file.
Works with SAM's, FASTQ's or pileups and gzipped files;
Chrdex.pm and Sqldex.pm: obsoleted by the cpan module
Text::Tidx. Sqldex may not actually be obsolete, because Tidx uses
more RAM and is slower for very small jobs. But for Exome and
RNA-Seq work, Text::Tidx beats both;
qsh: Runs a BASH script file like a "cluster aware
makefile". Only processing newer things, dieing if things go wrong,
and sending jobs to a queue manager if they're big. That way you
don't have to write makefiles, or wrap things in "qsub" calls for
every little program. Not really ready yet.
grun: Fast, lightweight grid queue software. Keeps the job
queue on disk at all times. Very fast. Works well by now;
gwrap: BASH wrapper shell that downloads all dependencies
that are not the local system. Good for EC2 nodes. Linux only.
Will use it if we ever go to EC2.
gtf2bed: Converter that bundles up a GFF's exons and makes a
UCSC-styled BED file with thin/thick properly set from the
randomFQ: takes a FASTQ (can be gzipped or paired-end) and
randomly subsets to a user defined number of reads;
Note that EA-UTILS includes an executable command called qsub,
which conflicts with the command used on ARC clusters to submit jobs.
Thus, a user who interactively loads EA-UTILS will temporarily lose
the ability to submit jobs to the queue!
The EA-UTILS home page at github:
Comparison of Sequencing Utility Programs,
The Open Bioinformatics Journal,
Volume 7, Number 1, 2013,
On any ARC cluster, check the installation details
by typing "module spider ea-utils".
EA-UTILS requires that the appropriate modules be loaded before it can
be used. One version of the appropriate commands for use on Cascades is:
module purge module load gcc/5.2.0 module load ea-utils/1.04.807
The following batch file demonstrates the use of EA-UTILS:
#! /bin/bash # #PBS -l walltime=0:05:00 #PBS -l nodes=1:ppn=1 #PBS -W group_list=cascades #PBS -q open_q #PBS -j oe # cd $PBS_O_WORKDIR # module purge module load gcc/5.2.0 module load ea-utils/1.04.807 # sam-stats unmapped_first.sam
A complete set of files to carry out a similar process are available in