Manual for Hobbes 1.x

Hobbes is a software package for efficiently mapping DNA snippets (reads) against a reference DNA sequence. It can map short and long reads, and supports Hamming distance (only substitutions) and edit distance (substitutions/insertions/deletions). Hobbes accepts both single-end and paired-end reads for alignment, and can run on multiple CPU cores using multithreading. It supports three input formats (Fastq, Fasta, and text file) and two output formats (Bowtie-like format, SAM format). Ambiguous bases such as the 'N' character are treated as mismatches.

System Requirements

  • GNU Linux. We developed and tested Hobbes under Ubuntu.
  • GCC. Hobbes uses GCC builtin functions.
  • CMake. Required for compiling Hobbes (sudo apt-get install cmake).
  • libbz2 and libz (sudo apt-get install libbz2-dev libbz2-dev).
  • Boost-iostreams (sudo apt-get install libboost-iostreams-dev).
  • A CPU supporting the popcount instruction (check your flags in /proc/cpuinfo).
  • Compiling Hobbes

  • Download the Hobbes source from here.
  • Extract the contents of the archive (tar -xf file).
  • cd into the Hobbes source root directory (hobbes1.3).
  • Run "compile_16bitvector.sh" to use 16 bits for Hobbes' bitvectors.
  • Run "compile_32bitvector.sh" to use 32 bits for Hobbes' bitvectors.
  • Alternatively, run "cmake ." followed by "make".
  • The Hobbes binaries are placed in the "build" directory.
  • Constructing a Hobbes Index

    Usage

    ./hobbes-index --sref <input fasta file> -i <output index file> -p <number of threads> -g <gram length>

    Example

    ./hobbes-index --sref hg18.fa -i hg18.hix -p 4 -g 11

    Options

    -g <int> Use given gram length to build a Hobbes index. We recommend a gram length of 11. We support gram lengths up to 16, but the index size will increase dramatically after gram length 13.
    -i <file> Create Hobbes index into given file.
    --sref <file> Reference sequence file to index in fasta format.
    --dref <dir> Uses all fasta files in given directory as reference sequence. File names become chromosome names.
    -p <int> Use given number of parallel pthreads to construct the index.
    --noprogress Disable progress indicator.

    Mapping Reads with Hobbes

    Single-End Reads

    1) Hamming distance (substitutions only):

    ./hobbes -q <input fastq file> --sref <fasta reference file> -i <hobbes index file> -a --hamming -v <hamming distance> -u <number of reads> -p <number of threads>

    2) Edit distance (substitutions/insertions/deletions):

    ./hobbes -q <input fastq file> --sref <fasta reference file> -i <hobbes index file> -a --indel -v <edit distance> -u <number of reads> -p <number of threads>

    Examples:

    ./hobbes -q reads.fq --sref hg18.fa -i hg18.hix -a --hamming -v 2 -u 10000 -p 4
    ./hobbes -q reads.fq --sref hg18.fa -i hg18.hix -a --indel -v 2 -u 10000 -p 4

    Paired-End Reads

    1) Hamming distance (substitutions only):

    ./hobbes --pe --seqfq1 <first read fastq file> --seqfq2 <second read fastq file> --sref <fasta reference file> -i <hobbes index file> -a --hamming -v <hamming distance> --min <minimum insert size> --max <maximum insert size> -u <number of reads> -p <number of threads>

    2) Edit distance (substitutions/insertions/deletions):

    ./hobbes --pe --seqfq1 <first read fastq file> --seqfq2 <second read fastq file> --sref <fasta reference file> -i <hobbes index file> -a --indel -v <edit distance> --min <minimum insert size> --max <maximum insert size> -u <number of reads> -p <number of threads>

    Examples:

    ./hobbes --pe --seqfq1 reads1.fq --seqfq2 reads1.fq --sref hg18.fa -i hg18Hobbes -a --indel -v 2 --min 50 --max 150 -u 10000 -p 4
    ./hobbes --pe --seqfq1 reads1.fq --seqfq2 reads1.fq --sref hg18.fa -i hg18.hix -a --indel -v 2 --min 50 --max 150 -u 10000 -p 4

    Read Input Options

    -q <file> Map single-end reads in given fastq file.
    -r <file> Map single-end reads in given line-by-line text file.
    -f <file> Map single-end reads in given fasta file.
    -c <string> Map given single-end read (only maps a single read).
    --seqfq1 <file> First fastq file for paired-end reads. Requires --pe.
    --seqfq2 <file> Second fastq file for paired-end reads. Requires --pe.
    --gzip Reads file is compressed with gzip.
    --bzip2 Reads file is compressed with bzip2.

    Reference Sequence Options

    --sref <file> Reference sequence file in fasta format.
    --dref <dir> Uses all fasta files in given directory as reference sequence. File names become chromosome names.

    Hobbes Index Options

    -i <file> Use given Hobbes index to perform mapping.

    Mapping Options

    Hobbes can find all or at most k mappings per read. Note that the running time varies accordingly.
    If a read has exact mappings, Hobbes guarantees to find them. Otherwise, it finds mapping(s) within the specified distance.
    By default, Hobbes maps against the forward and reverse reference, (see --norc and --nofw).

    -a Find all mapping locations.
    -k <int> Find upto 'k' mappings per read (k-mapping mode is only supported in single-end mapping).
    --hamming Map reads using using Hamming distance. This is the fastest mode of Hobbes.
    --indel Map reads using edit distance. Uses heuristics to speed up the search, and is not guaranteed to find the best possible mapping locations (but very often it does).
    -v <int> Distance threshold. Finds reads within given distance threshold (use --hamming for Hammign distance and --indel for edit distance).
    --pe Enable paired-end mapping mode. See --seqfq1 and --seqfq2.
    --min <int> Minimum insert size for paired-end mappings.
    --min <int> Maximum insert size for paired-end mappings.
    -u <int> Aligns given number of reads (first ones). By default all the reads are aligned. Required for useful progress indication.
    --norc Maps against forward reference only.
    --nofw Maps against reverse reference only.
    --cigar Provides the cigar of mapping results. Slower than the regular mode.

    Output Options

    Hobbes supports 2 output formats: a SAM output format (default output format), and a Bowtie-like format. Currently, we only support the SAM format for paired-end mapping.

    -B Enables the Bowtie-like output format.
    --sam-nohead Enables the SAM output format. Suppresses the header lines (starting with '@').
    --sam-nosq Enables the SAM output format. Suppresses the @SQ header lines.
    --mapout <file> Prints the mappings to a specified file. By default it is printed to the SAM ouput.

    Other options

    -p <int> Runs given number of parallel pthreads to perform the mapping.
    --noprogress Disable progress indicator.
    --version Prints version information.
    --help Prints usage information.

    SAM Output Format

    This format outputs Hobbes supported SAM format. Each line has one read's mappings. Reads that are unmapped are also printed with appropriate flags set. Each line has the following tab separated fields:

    1. Name of the read mapped.

    2. SAM bitwise FLAG.

    3. RNAME: Reference sequence name of the mapping. If @SQ header lines are present, RNAME must be present in one of the SQ-SN tag. '*' if there is no mapping.

    4. POS: 1-based leftmost mapping POSition of the first matching base. The first base in a reference sequence has coordinate 1. POS is set as 0 for an unmapped read.

    5. MAPQ: Mapping Quality. A value 255 indicates that the mapping quality is not available. Since we don't support this yet, it's set to 255.

    6. CIGAR: CIGAR string.

    7. RNEXT: Reference sequence name of the NEXT fragment in the template. Currently unavailable and hence set to`*'.

    8. PNEXT: Position of the NEXT fragment in the template. In single-end alignment, it's set to 0; in paired-end alignment, it's the positon of it's mate pair.

    9. TLEN: Signed insert size, it is set as 0 for single-end reads or when the information is unavailable

    10. SEQ: Read sequence

    11. QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). Currently, the values are set only if the input file is a fastq file. Else, it is set to '*'.

    12. Optional fields: Fields are tab-separated. For descriptions of all possible optional fields, see the SAM format specification. The fields relevant to hobbes are,

      1. NM:i:<N> - Mapped read has hamming/edit distance of <N>.

      2. XA:Z: <format> - Alternative hits; <format>: (chr,pos,CIGAR,NM;)*, pos has a prefix of '+' if mapping is reported in forward reference strand, and '-' for the reverse reference strand.

    Bowtie Output Format

    This format outputs one mapping per line. Each line has the following tab separated fields:

    1. Name of the read mapped.

    2. SAM bitwise FLAG.

    3. RNAME: Reference sequence name of the mapping. If @SQ header lines are present, RNAME must be present in one of the SQ-SN tag. '*' if there is no mapping.

    4. POS: 1-based leftmost mapping POSition of the first matching base. The first base in a reference sequence has coordinate 1. POS is set as 0 for an unmapped read.

    5. MAPQ: MAPping Quality. A value 255 indicates that the mapping quality is not available. Since we don't support this yet, it's set to 255.

    6. CIGAR: CIGAR string.

    7. RNEXT: Reference sequence name of the NEXT fragment in the template. Currently unavailable and hence set to`*'.

    8. PNEXT: Position of the NEXT fragment in the template. In single-end alignment, it's set to 0; in paired-end alignment, it's the positon of it's mate pair.

    9. TLEN: Signed insert size, it is set as 0 for single-end reads or when the information is unavailable

    10. SEQ: Read sequence

    11. QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). Currently, the values are set only if the input file is a fastq file. Else, it is set to '*'.

    12. NM:i:<N> - Mapped read has hamming/ edit distance of <N>.

    2015 ISG | Website maintained by Jongik Kim | Created by Yun Huang | Original design Andreas Viklund