Getting Started: Files and Data Types

What kind of data can ipyrad assemble?

ipyrad can assemble any type of data that is generated using a restriction digest method (RAD, ddRAD, GBS) or related amplification-based process (e.g., NextRAD, RApture), all of which yield data that is (mostly) anchored on at least one side so that reads align fairly closely. ipyrad is not optimized for constructing long contigs from shotgun sequence data (i.e., genome assembly), but can construct reasonably sized contigs from partially merged paired-end reads, or partially overlapping reads. ipyrad is flexible to different data types and can combine reads of various lengths, so that data from different sequencing runs or projects can be easily combined.

Filtering/Trimming data

It is generally good practice to run the program fastqc on your raw data when you first get it to obtain an idea of the quality of your reads, and the presence of adapter contamination. You do not need to trim your reads before starting an assembly, since ipyrad includes a built-in and recommended trimming step during step 2 of assembly (using the software tool cutadapt). If you do choose to trim your data beforehand, however, it should not cause any problems.

Step 2 of the ipyrad assembly will apply different filters depending on your parameter settings to filter and trim data based on quality scores and/or the occurrence of barcode+adapter combinations. For paired-end data ipyrad will merge overlapping reads (using vsearch for denovo assembly or simply based on mapping positions for reference-mapped assembly).

Fastq Data Files and File Names

Depending how and where your sequence data were generated you may receive data as one giant file, or in many smaller files. The files may contain data from all of your individuals mixed together, or as separate files for each Sample. If they are mixed up then the data need to be demultiplexed based on barcodes or indices. Step 1 of ipyrad can take data of either format, and will either demultiplex the reads or simply count/load the pre-demultiplexed data. See the Demultiplexing section for details. lessq

Supported data types

There is increasingly a large variety of ways to generate reduced representation genomic data sets using either restriction digestion or primer sets, and ipyrad aims to be flexible enough to handle all of these types. Because it is difficult to keep up with all of the names, we use our own terminology, described below, to group together data types that can be analyzed using the same bioinformatic methods. If you have a data type that is not described below and you’re not sure if it can be analyzed in ipyrad let us know here.

rad – This category includes data types which use a single cutter to generate DNA fragments for sequencing based on a single cut site. e.g., RAD-seq, NextRAD.

ddrad – This category is very similar data types which select fragments that were digested by two different restriction enzymes which cut the fragment on either end. During assembly this type of data is analyzed differently from the rad data type by more stringent filtering that looks for occurrences of the second (usually more common) cutter. e.g., double-digest RAD-seq.

gbs – This category includes any data type which selects fragments that were digested by a single enzyme that cuts both ends of DNA fragments. This data type requires reverse-complement clustering because the forward vs reverse adapters can attach to either end of each fragment, and thus when shorter fragments are sequenced from either end the resulting reads often overlap partially or completely. When analyzing GBS data we strongly recommend using a stringent setting for the filters_adapters parameter. e.g., genotyping-by-sequencing (Elshire et al.), EZ-RAD (Toonin et al.).

pairddrad – This category is for paired-end data from fragments that were generated through restriction digestion using two different enzymes. During step 3 the paired-reads will be tested for paired read merging if they overlap partially. Because two different cutters are used reverse-complement clustering is not necessary. e.g., double-digest RAD-seq (w/ paired-end sequencing).

pairgbs – This category is for paired-end data from fragments that were generated by digestion with a single enzyme that cuts both ends of the fragment. Because the forward adapter might bind to either end of these fragments,approximately half of the matches are expected to be reverse-complemented with perfect overlap. Paired reads are checked for merging before clustering/mapping. e.g., genotyping-by-sequencing, EZ-RAD, (w/ paired-end sequencing).

2brad – This category is for a special class of sequenced fragments generated using a type IIb restriction enzyme. The reads are usually very short in length, and are treated slightly differently in steps 1, 3, and 6. Essentially it is treated like ‘gbs’ during steps 3 and 6 (reverse complement matching). (We are looking for people to do more testing of this method on empirical data).

pair3rad – This category is for 3Rad/RadCap data that uses combinatorial barcodes and unique identifiers for removing PCR duplicates. This data is always paired end, since one barcode is ligated to each read. PCR clones are removed in step 3, after merging but before dereplication. The pair3rad datatype is used for both 3Rad and RadCap types because these datatypes only differ in how they are generated, not how they are demultiplexed and filtered. See Glenn et al 2016, and Hoffberg et al 2016