Eaton & Ree (2013) single-end RAD data set

Here we demonstrate a denovo assembly for an empirical RAD data set to give a general idea of the results you might expect to recover. This example was run on a 8-core laptop with 16GB RAM, and takes about 1 hour to run completely.

We will use the 13 taxa Pedicularis data set from Eaton and Ree (2013) (open access link). This data set is composed of single-end 75bp reads from a RAD-seq library prepared with the PstI enzyme. This data set also serves as an example for several of our analysis tools to demonstrate methods for analyzing RAD-seq results. So after you finish this assembly head over there to check out fun ways to analyze the data.

Download the data set (Pedicularis)

These data are archived on the NCBI sequence read archive (SRA) under accession id SRP021469. We’ve written a convenient wrapper for sra-tools that allows ipyrad to download data from SRA, SRP, ERA, etc., IDs easily (more info here). Run the code below to download and decompress the fastq data files, which will save them into a directory called example_empirical_data/, or whatever you wish to name it. The directory will be created if it doesn’t already exist. The compressed file size is approximately 1.1GB.

# first we need to download an additional tool
>>> conda install sra-tools -c bioconda

# then, use ipyrad to download the fastq data from the SRA database
>>> ipyrad --download SRP021469 example_empirical_data/

The –download function will print the following output:

Fetching project data...
           Run    spots  mates                ScientificName              SampleName
0   SRR1754715   696994      0           Pedicularis superba           29154_superba
1   SRR1754720  1452316      0       Pedicularis thamnophila            30556_thamno
2   SRR1754730  1253109      0      Pedicularis cyathophylla      30686_cyathophylla
3   SRR1754729   964244      0       Pedicularis przewalskii       32082_przewalskii
4   SRR1754728   636625      0       Pedicularis thamnophila            33413_thamno
5   SRR1754727  1002923      0       Pedicularis przewalskii       33588_przewalskii
6   SRR1754731  1803858      0               Pedicularis rex               35236_rex
7   SRR1754726  1409843      0               Pedicularis rex               35855_rex
8   SRR1754725  1391175      0               Pedicularis rex               38362_rex
9   SRR1754723   822263      0               Pedicularis rex               39618_rex
10  SRR1754724  1707942      0               Pedicularis rex               40578_rex
11  SRR1754722  2199740      0  Pedicularis cyathophylloides  41478_cyathophylloides
12  SRR1754721  2199613      0  Pedicularis cyathophylloides  41954_cyathophylloides
Parallel connection | latituba: 8 cores
[####################] 100% 0:01:43 | downloading/extracting fastq data

13 fastq files downloaded to /home/deren/Documents/ipyrad/sandbox/pedicularis/example_empirical_data

Setup a params file

Always start an ipyrad assembly by using the -n {name} argument to create a new named Assembly. I’ll use the name pedicularis to indicate taxa being assembled.

>>> ipyrad -n pedicularis

This will print the message:

New file 'params-pedicularis.txt' created in /home/deren/Documents/ipyrad/sandbox

In this example, the data come to us already demultiplexed so we are going to simply set the sorted_fastq_path to tell ipyrad the location of our data files. Published data sets will typically be available in this way, already demultiplexed. You can select multiple files at once using regular expressions, in this example we use an asterisk (*.gz) to select all files in the directory ending in .gz. We also set a project_dir, which is useful for grouping all our results into a single directory. For this we’ll use name the project directory “analysis-ipyrad”. If this folder doesn’t exist then ipyrad will create it. Take note when entering the values below into your params file that they properly correspond to parameters 1 and 4, respectively. Use any text editor to edit the params file:

# Use your text editor to enter the following values:
# The wildcard (*) tells ipyrad to select all files ending in .gz
analysis-ipyrad                    ## [1] [project_dir] ...
example_empirical_data/*.fastq     ## [4] [sorted_fastq_path] ...

We’ll add a few additional options as well to: filter for adapters (param 16); trim the 3’ edge of R1 aligned loci by 5bp (param 26; this is optional, but helps to remove poorly aligned 3’ edges); and produce all output formats (param 27):

# enter the following params as well
2                                ## [16] [filter_adapters] ...
0, 5, 0, 0                       ## [26] [trim_loci] ...
*                                ## [27] [output_formats] ...

We’ll leave the remaining parameters at their default values.

Step 1: Load the fastq data

Start an ipyrad assembly by running step 1. When the data location is entered as a sorted_fastq_path (param 4), as opposed to the raw_fastq_path (param 2), step 1 simply counts the number of reads for each Sample and parses the file names to extract names for each Sample. For example, the file 29154_superba.fastq.gz will be assigned to Sample 29154_superba. We use the -s argument followed by 1 to tell ipyrad to run step 1. We also pass it the -r argument so that it will print a results summary when finished.

>>> ipyrad -p params-pedicularis.txt -s 1 -r

Shows the current state of the assembly:

-------------------------------------------------------------
 ipyrad [v.0.9.14]
 Interactive assembly and analysis of RAD-seq data
-------------------------------------------------------------
 Parallel connection | latituba: 8 cores

 Step 1: Loading sorted fastq data to Samples
 [####################] 100% 0:00:06 | loading reads
 13 fastq files loaded to 13 Samples.

 Parallel connection closed.

 Summary stats of Assembly pedicularis
 ------------------------------------------------
                                    state  reads_raw
 29154_superba_SRR1754715               1     696994
 30556_thamno_SRR1754720                1    1452316
 30686_cyathophylla_SRR1754730          1    1253109
 32082_przewalskii_SRR1754729           1     964244
 33413_thamno_SRR1754728                1     636625
 33588_przewalskii_SRR1754727           1    1002923
 35236_rex_SRR1754731                   1    1803858
 35855_rex_SRR1754726                   1    1409843
 38362_rex_SRR1754725                   1    1391175
 39618_rex_SRR1754723                   1     822263
 40578_rex_SRR1754724                   1    1707942
 41478_cyathophylloides_SRR1754722      1    2199740
 41954_cyathophylloides_SRR1754721      1    2199613


 Full stats files
 ------------------------------------------------
 step 1: ./analysis-ipyrad/pedicularis_s1_demultiplex_stats.txt
 step 2: None
 step 3: None
 step 4: None
 step 5: None
 step 6: None
 step 7: None

Run the remaining assembly steps

Because the point of this tutorial is to demonstrate run times and statistics, I will leave the rest of the parameters at their defaults and simply run all remaining steps. Further below I will explain in more detail the stats files for each step and the meaning of the stats values.

## run steps 2-7
>>> ipyrad -p params-pedicularis.txt -s 234567 -r

Which produces abundant progress messages:

-------------------------------------------------------------
 ipyrad [v.0.9.14]
 Interactive assembly and analysis of RAD-seq data
-------------------------------------------------------------
 Parallel connection | latituba: 8 cores

 Step 2: Filtering and trimming reads
 [####################] 100% 0:01:59 | processing reads

 Step 3: Clustering/Mapping reads within samples
 [####################] 100% 0:00:12 | dereplicating
 [####################] 100% 0:07:34 | clustering/mapping
 [####################] 100% 0:00:01 | building clusters
 [####################] 100% 0:00:00 | chunking clusters
 [####################] 100% 0:12:11 | aligning clusters
 [####################] 100% 0:00:04 | concat clusters
 [####################] 100% 0:00:03 | calc cluster stats

 Step 4: Joint estimation of error rate and heterozygosity
 [####################] 100% 0:00:31 | inferring [H, E]

 Step 5: Consensus base/allele calling
 Mean error  [0.00314 sd=0.00090]
 Mean hetero [0.02171 sd=0.00385]
 [####################] 100% 0:00:03 | calculating depths
 [####################] 100% 0:00:05 | chunking clusters
 [####################] 100% 0:09:12 | consens calling
 [####################] 100% 0:00:06 | indexing alleles

 Step 6: Clustering/Mapping across samples
 [####################] 100% 0:00:03 | concatenating inputs
 [####################] 100% 0:02:06 | clustering across
 [####################] 100% 0:00:02 | building clusters
 [####################] 100% 0:01:31 | aligning clusters

 Step 7: Filtering and formatting output files
 [####################] 100% 0:00:14 | applying filters
 [####################] 100% 0:00:09 | building arrays
 [####################] 100% 0:00:10 | writing conversions
 [####################] 100% 0:01:27 | indexing vcf depths
 [####################] 100% 0:00:24 | writing vcf output

 Parallel connection closed.

 Summary stats of Assembly pedicularis
 ------------------------------------------------
                                    state  reads_raw  ...  error_est  reads_consens
 29154_superba_SRR1754715               6     696994  ...   0.003211          29903
 30556_thamno_SRR1754720                6    1452316  ...   0.003184          43870
 30686_cyathophylla_SRR1754730          6    1253109  ...   0.003297          45856
 32082_przewalskii_SRR1754729           6     964244  ...   0.003079          34733
 33413_thamno_SRR1754728                6     636625  ...   0.003317          26228
 33588_przewalskii_SRR1754727           6    1002923  ...   0.003267          38137
 35236_rex_SRR1754731                   6    1803858  ...   0.002206          46683
 35855_rex_SRR1754726                   6    1409843  ...   0.004316          46234
 38362_rex_SRR1754725                   6    1391175  ...   0.002350          46081
 39618_rex_SRR1754723                   6     822263  ...   0.003636          37259
 40578_rex_SRR1754724                   6    1707942  ...   0.002229          48255
 41478_cyathophylloides_SRR1754722      6    2199740  ...   0.001721          47976
 41954_cyathophylloides_SRR1754721      6    2199613  ...   0.005028          64654

 [13 rows x 8 columns]


 Full stats files
 ------------------------------------------------
 step 1: ./analysis-ipyrad/pedicularis_s1_demultiplex_stats.txt
 step 2: ./analysis-ipyrad/pedicularis_edits/s2_rawedit_stats.txt
 step 3: ./analysis-ipyrad/pedicularis_clust_0.85/s3_cluster_stats.txt
 step 4: ./analysis-ipyrad/pedicularis_clust_0.85/s4_joint_estimate.txt
 step 5: ./analysis-ipyrad/pedicularis_consens/s5_consens_stats.txt
 step 6: ./analysis-ipyrad/pedicularis_across/pedicularis_clust_database.fa
 step 7: ./analysis-ipyrad/pedicularis_outfiles/pedicularis_stats.txt

Take a look at the stats summary

Each assembly that finishes step 7 will create a stats.txt output summary in the ‘assembly_name’_outfiles/ directory. This includes information about which filters removed data from the assembly, how many loci were recovered per sample, how many samples had data for each locus, and how many variable sites are in the assembled data.

>>> cat ./analysis-ipyrad/pedicularis_outfiles/pedicularis_stats.txt

Shows the contents of the stats file:

## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci                  0              0          80481
filtered_by_rm_duplicates             828            828          79653
filtered_by_max_indels               1290           1290          78363
filtered_by_max_SNPs                  946            914          77449
filtered_by_max_shared_het            718            699          76750
filtered_by_min_sample              35889          35672          41078
total_filtered_loci                 39671          39403          41078


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

                                   sample_coverage
29154_superba_SRR1754715                     21095
30556_thamno_SRR1754720                      31418
30686_cyathophylla_SRR1754730                26754
32082_przewalskii_SRR1754729                 14507
33413_thamno_SRR1754728                      18504
33588_przewalskii_SRR1754727                 16928
35236_rex_SRR1754731                         32588
35855_rex_SRR1754726                         32462
38362_rex_SRR1754725                         33372
39618_rex_SRR1754723                         27673
40578_rex_SRR1754724                         33260
41478_cyathophylloides_SRR1754722            31255
41954_cyathophylloides_SRR1754721            28381


## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci

    locus_coverage  sum_coverage
1                0             0
2                0             0
3                0             0
4             5347          5347
5             3809          9156
6             3488         12644
7             3096         15740
8             3330         19070
9             4217         23287
10            5102         28389
11            5422         33811
12            4562         38373
13            2705         41078


The distribution of SNPs (var and pis) per locus.
## var = Number of loci with n variable sites (pis + autapomorphies)
## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)
## ipyrad API location: [assembly].stats_dfs.s7_snps
## The "reference" sample is included if present unless 'exclude_reference=True'

     var  sum_var    pis  sum_pis
0   1806        0  10232        0
1   3528     3528   9935     9935
2   4758    13044   7601    25137
3   5297    28935   5109    40464
4   5183    49667   3212    53312
5   4650    72917   2029    63457
6   3893    96275   1225    70807
7   3340   119655    757    76106
8   2529   139887    456    79754
9   1967   157590    302    82472
10  1480   172390    132    83792
11  1145   184985     68    84540
12   796   194537     16    84732
13   541   201570      4    84784
14   151   203684      0    84784
15    13   203879      0    84784
16     0   203879      0    84784
17     0   203879      0    84784
18     0   203879      0    84784
19     1   203898      0    84784


## Final Sample stats summary
                                   state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  hetero_est  error_est  reads_consens  loci_in_assembly
29154_superba_SRR1754715               7     696994               689996          126896             34145    0.024641   0.003211          29903             21095
30556_thamno_SRR1754720                7    1452316              1440314          192920             50491    0.022635   0.003184          43870             31418
30686_cyathophylla_SRR1754730          7    1253109              1206947          225144             52464    0.020622   0.003297          45856             26754
32082_przewalskii_SRR1754729           7     964244               955480          142366             41046    0.027211   0.003079          34733             14507
33413_thamno_SRR1754728                7     636625               626084          165338             30754    0.024820   0.003317          26228             18504
33588_przewalskii_SRR1754727           7    1002923               993873          148920             44642    0.025917   0.003267          38137             16928
35236_rex_SRR1754731                   7    1803858              1787366          401906             52694    0.019709   0.002206          46683             32588
35855_rex_SRR1754726                   7    1409843              1397068          164312             54484    0.025071   0.004316          46234             32462
38362_rex_SRR1754725                   7    1391175              1379626          124417             51061    0.016379   0.002350          46081             33372
39618_rex_SRR1754723                   7     822263               813990          138973             42451    0.022817   0.003636          37259             27673
40578_rex_SRR1754724                   7    1707942              1695523          210842             54539    0.019760   0.002229          48255             33260
41478_cyathophylloides_SRR1754722      7    2199740              2185364          162093             53191    0.015180   0.001721          47976             31255
41954_cyathophylloides_SRR1754721      7    2199613              2176210          286667             72791    0.017415   0.005028          64654             28381


## Alignment matrix statistics:
snps matrix size: (13, 203898), 35.26% missing sites.
sequence matrix size: (13, 2840602), 35.60% missing sites.

Take a peek at the .loci output

This is the first place I look when an assembly finishes. It provides a clean view of the data with variable sites (-) and parsimony informative SNPs (*) highlighted. Use the unix commands less or head to look at this file briefly. Each locus is labelled with a number corresponding to the locus order before filters are applied in step 7. If you branch this assembly and run step 7 again with a different set of parameters you may recover fewer or more total loci.

## head -n 50 prints just the first 50 lines of the file to stdout
>>> head -n 50 analysis-ipyrad/pedicularis_outfiles/pedicularis.loci
29154_superba_SRR1754715              TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGGGTACCCTGCGAACTTCCAAATTCACCCTCATCG
30556_thamno_SRR1754720               TAGGGTGGGTCKCGTTCAAGGTATTCGAACAACAGAGTACCCTGCGAACTTCCAAATTCACCCTCATCG
30686_cyathophylla_SRR1754730         TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGGGTACCCTGCGAACTTCCAAATTCACCCTCATCG
32082_przewalskii_SRR1754729          TAGGGTGGGTCTCGTTCAAGGTATTCGAACAAGAGGGTACCCTGCGAACTTCCAAATTCACCCTCNTCG
33413_thamno_SRR1754728               TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCCTGCGAACTTCCAAATTCACCCTCATCG
33588_przewalskii_SRR1754727          TAGGGTGGGTCTCNTTCAAGGTATTCGAACAASAGGGTACCCTGCGAACTTCCAAATTCACCCTCATCG
35236_rex_SRR1754731                  TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCCTGCGAACTTCCAAATTCACCCTCATCG
35855_rex_SRR1754726                  TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCCTGCGAACTTCCAAATTCACCCTCATCG
38362_rex_SRR1754725                  TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCCTGCGAACTTCCAAATTCACCCTCATCG
39618_rex_SRR1754723                  TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCCTGCGAACTTCCAAATTCACCCTCATCG
40578_rex_SRR1754724                  TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCCTGCGAACTTCCAAATTCACCCTCATCA
41478_cyathophylloides_SRR1754722     TAGGGTGGGTCTCGTTCAAGGTATTCGAACAATAGGGTACCCAGCGAACTTCCAAATTCACCCTCATCG
41954_cyathophylloides_SRR1754721     TAGGGTGGGTCTCGTTCAAGGTATTCGAACAATAGGGTACCCAGCGAACTTCCAAATTCACCCTCATCG
//                                               -                    *  *      *                         -|0|
29154_superba_SRR1754715              ACGACGTCTCTCCCCGAGCCGGCTATCAGGAGACGGATTTTCGAGATGGGGGGTCGTTTTGCTGTTTGT
30556_thamno_SRR1754720               ACGACGTCTCTCCCCGAGCCGGCTATCAGGAGACGGATTTTCGAGATGGGGGGTCGTTTTGCTGTTTGT
33413_thamno_SRR1754728               ACGACGTCTCTCCCCGAGMCGGCTATCAGGAGACGGATTTTCGAGATGGGGGGTCGTTTTGCTGTTTGT
40578_rex_SRR1754724                  ACGACGTCTCTCCCCGAGCCGGCTATCAGGAGACGGATTTTCGAGATGGGGGGTCGTTTTGCTGTTTGT
//                                                      -                                                  |1|
29154_superba_SRR1754715              CTTGGCACTGAATTAGCAGAACTTCAACAATTAAGTCTCCAGTATAATTGAATTYGATTTAATTTAATT
30686_cyathophylla_SRR1754730         CTTGGCACTGAATTAGCAGAACTTCAACAATTAAGTCTCCAGTATAACTGAATTTGATTTAATTTAATT
35236_rex_SRR1754731                  CTTGGCACTGAAGTAGCAGAACTTCAACAATTAAGTCTCCAGTATAATTGAATTTGATTTAATTTAATG
35855_rex_SRR1754726                  CTTGGCACTGAAGTAGCAGAACTTCAACAATTAAGTCTCCGGTATAATTGAATTTGATTTAATTTAATG
38362_rex_SRR1754725                  CTTGGCACTGAAGTAGCAGAACTTCAACAATTAAGTCTCAGGTATAATTGAATTTGATTTAATTTAATT
40578_rex_SRR1754724                  CTTGGCACTCAAGTAGCAGAACTTCAACAATTAAGTCTCCGGTATAATTGAATTTGATTTAATTTAATG
41478_cyathophylloides_SRR1754722     CTTGGCACTGAAGTAGCAGAACTTCAACAATTAAGTCTCCAGTATAATTGAATTTGATTTAATTTAATT
41954_cyathophylloides_SRR1754721     CTTGGCACTGAAGTAGCAGAACTTCAACAATTAAGTCTCCAGTATAATTGAATTTGATTTAATTTAATT
//                                             -  *                          -*      -      -             *|2|
29154_superba_SRR1754715              GATCCTGAAATGACAASAAACATAACANGGGGGTAATTTTTTGTAATTAT---CCCTTAGA-TAAACTATACA
33413_thamno_SRR1754728               GATCCTGAAACGACAACAAACATAACACGGGGGTAATYTTTTGTAATTAT---CCCTTMGA-TAAACTATACA
35236_rex_SRR1754731                  GATCCTGAAACGACAACAAACATAACACGGGGGTAATTTTTTGTAATTAT---CCCTTAGA-TAAACTATACA
38362_rex_SRR1754725                  GATCCTGAAACGACAACAAACATAACACGGGGGTAATTTTTTGTAATTAT---CCCTTAGA-TAAACTATACA
39618_rex_SRR1754723                  GATCCTGAAACGACAACAAACATAACACGGGGGTAATTTTTTGTAATTAT---CCCTTAGA-TAAACTATACA
40578_rex_SRR1754724                  GATCCTGAAACGACAACAAACATAACAYGGGGGTAATTTTTTGTAATTAY---CCCTTAGA-TAAACTATACA
41478_cyathophylloides_SRR1754722     GATCCTGAAATGACAACAAACATAACAGGGGGGTAATTTTTTGTAATTATCCCCCCTTAGATTAAACTA----
41954_cyathophylloides_SRR1754721     GATCCTGAAATGACAACAAACATAACAGGGGGGTAATTTTTTGTAATTATCCCCCCTTAGATTAAACT-----
//                                              *     -          *         -           -        -              |3|
29154_superba_SRR1754715              AAAACAGGATGAGTGCATATCTCTCGTTCTAACTACTGCAATGCTAGGNAAATAAAATACAGACTAAAA
30686_cyathophylla_SRR1754730         AAAACAGGATGAGTGCATATCTCTCGTTCTAACTACTGCAATGCTAGGTAAATAAAATACAGACTAAAA
32082_przewalskii_SRR1754729          AAAACAGGATGAGTGCATATCTCTCGTACTAACTACTGCAATGCTAGGTAAATAAAATACAGACTAAAA
33588_przewalskii_SRR1754727          AAAACAGGATGAGTGCATATCTCTCGTACTAACTACTGCAATGCTAGGTAAATAAAATACAGACTAAAA
41478_cyathophylloides_SRR1754722     AAAACAGGATGAGTGCATATCTCTCGTTTTAACTACTGCAATGCTAGGTAAATAAAATAGAGACTAAAA
41954_cyathophylloides_SRR1754721     AAAACAGGATGAGTGCATATCTCTCGTTTTAACTACTGCAATGCTAGGTAAATAAAATAGAGACTAAAA
//                                                               **                              *         |4|

peek at the .phy files

This is the concatenated sequence file of all loci in the data set. It is typically used in phylogenetic analyses, like in the program raxml. This super matrix is 13 taxon deep by 2.44 Mbp long.

## cut -c 1-80 prints only the first 80 characters of the file
>>> cut -c 1-80 analysis-ipyrad/pedicularis_outfiles/pedicularis.phy

Shows a small piece of the data:

13 2840602
29154_superba_SRR1754715              TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGGGTACCC
30556_thamno_SRR1754720               TAGGGTGGGTCKCGTTCAAGGTATTCGAACAACAGAGTACCC
30686_cyathophylla_SRR1754730         TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGGGTACCC
32082_przewalskii_SRR1754729          TAGGGTGGGTCTCGTTCAAGGTATTCGAACAAGAGGGTACCC
33413_thamno_SRR1754728               TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCC
33588_przewalskii_SRR1754727          TAGGGTGGGTCTCNTTCAAGGTATTCGAACAASAGGGTACCC
35236_rex_SRR1754731                  TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCC
35855_rex_SRR1754726                  TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCC
38362_rex_SRR1754725                  TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCC
39618_rex_SRR1754723                  TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCC
40578_rex_SRR1754724                  TAGGGTGGGTCTCGTTCAAGGTATTCGAACAACAGAGTACCC
41478_cyathophylloides_SRR1754722     TAGGGTGGGTCTCGTTCAAGGTATTCGAACAATAGGGTACCC
41954_cyathophylloides_SRR1754721     TAGGGTGGGTCTCGTTCAAGGTATTCGAACAATAGGGTACCC

peek at the .snps.phy file

This is similar to the phylip file format, but only variable site columns are included. All SNPs are in the file, in contrast to the .u.snps.phy file, which randomly selects only a single SNP per locus.

## cut -c 1-80 prints only the first 80 characters of the file
>>> cut -c 1-80 analysis-ipyrad/pedicularis_outfiles/pedicularis.snps.phy

Just the snps:

13 203898
29154_superba_SRR1754715              TCGTGCGTCATYTTSNTTATCCGAYYACTGTGTAAGCCGGGG
30556_thamno_SRR1754720               KCATGCNNNNNNNNNNNNNNNNGACCGNNGTGTAAGCCGGGG
30686_cyathophylla_SRR1754730         TCGTGNGTCACTTNNNNNNTCCGACCACTGTGTAAGCCAGGG
32082_przewalskii_SRR1754729          TGGTGNNNNNNNNNNNNNNACCNNNNNNNNNNNNNNNTGCAA
33413_thamno_SRR1754728               TCATGMNNNNNNNCCCYTMNNNGACCGNNGTGTAAGCNNNNN
33588_przewalskii_SRR1754727          TSGTGNNNNNNNNNNNNNNACCATCCGNNACATAAGCTGCAA
35236_rex_SRR1754731                  TCATGNGGCATTGCCCTTANNNGACCGYTGTGTRRGCNNNNN
35855_rex_SRR1754726                  TCATGNGGCGTTGNNNNNNNNNGACCGCTGTGTAAGCNNNNN
38362_rex_SRR1754725                  TCATGNGGAGTTTCCCTTANNNGACCGNNTTGTAAGANNNNN
39618_rex_SRR1754723                  TCATGNNNNNNNNCCCTTANNNGACCGNNNNNNNNNNNNNNN
40578_rex_SRR1754724                  TCATACCGCGTTGCCYTYANNNGACCGCWGTGAAARCNNNNN
41478_cyathophylloides_SRR1754722     TTGAGNGGCATTTTCGTTATTGGACCGCTGTGTAAGCCGGGG
41954_cyathophylloides_SRR1754721     TTGAGNGGCATTTTCGTTATTGGACCGCTGTGTAAGCCGGGG

peek at the .vcf.gz file

The VCF output for ipyrad contains the full sequence information for all samples as well as the sequencing depth information for all base calls that were made. This file should be easily parsable if users want to extract information or modify it so that this file can be used in other software such as GATK. We are working on developing our own population-aware genotype caller that will correct low-depth base calls at this stage. Stay tuned.

## gunzip -c decompresses the file and passes it to the pipe (|)
## head -n 50 reads data from the pipe and show the first 50 lines.
## and we pipe this to 'cut', which shows only the first 80 rows of data
## for easier viewing.

>>> head -n 50 analysis-ipyrad/pedicularis_outfiles/pedicularis.vcf | cut -c 1-80
##fileformat=VCFv4.0
##fileDate=2017/02/14
##source=ipyrad_v.0.7.28
##reference=pseudo-reference (most common base at site)
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Dat
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM  POS ID  REF ALT QUAL  FILTER  INFO  FORMAT  29154_superba 30556_thamno  30
locus_1 2 . A C,G 13  PASS  NS=5;DP=49  GT:DP:CATG  0/0:9:0,9,0,0 1/1:7:7,0,0,0
locus_1 3 . T A 13  PASS  NS=5;DP=49  GT:DP:CATG  0/0:9:0,0,9,0 1/1:7:0,7,0,0 0
locus_1 19  . A G 13  PASS  NS=5;DP=49  GT:DP:CATG  0/0:9:0,9,0,0 1/1:7:0,0,0,7
locus_1 21  . C T 13  PASS  NS=5;DP=49  GT:DP:CATG  1/1:9:0,0,9,0 0/0:7:7,0,0,0
locus_1 30  . T C 13  PASS  NS=5;DP=49  GT:DP:CATG  0/0:9:0,0,9,0 0/0:7:0,0,7,0
locus_1 36  . C T 13  PASS  NS=5;DP=49  GT:DP:CATG  0/0:9:9,0,0,0 1/1:7:0,0,7,0
locus_2 15  . A G 13  PASS  NS=11;DP=210  GT:DP:CATG  0/0:12:0,12,0,0 0/0:24:0,2
locus_2 16  . A T,C 13  PASS  NS=11;DP=210  GT:DP:CATG  0/0:12:0,12,0,0 0/0:24:0
locus_2 18  . A T 13  PASS  NS=11;DP=210  GT:DP:CATG  0/0:12:0,12,0,0 0/0:24:0,2
locus_2 20  . T C 13  PASS  NS=11;DP=210  GT:DP:CATG  1/1:12:12,0,0,0 0/0:24:0,0
locus_2 29  . T C 13  PASS  NS=11;DP=209  GT:DP:CATG  0/0:12:0,0,12,0 0/0:23:0,0
locus_2 30  . A G 13  PASS  NS=11;DP=209  GT:DP:CATG  0/0:12:0,12,0,0 0/0:23:0,2
locus_2 47  . T A 13  PASS  NS=11;DP=210  GT:DP:CATG  0/0:12:0,0,12,0 0/0:24:0,0
locus_3 46  . A T 13  PASS  NS=6;DP=69  GT:DP:CATG  1/1:10:0,0,10,0 ./.:0:0,0,0,
locus_3 62  . C A 13  PASS  NS=6;DP=68  GT:DP:CATG  0/0:10:10,0,0,0 ./.:0:0,0,0,
locus_6 11  . G A 13  PASS  NS=4;DP=67  GT:DP:CATG  1/1:11:0,11,0,0 0/0:7:0,0,0,
locus_6 29  . G A 13  PASS  NS=4;DP=67  GT:DP:CATG  1/1:11:0,11,0,0 0/0:7:0,0,0,
locus_6 34  . C A 13  PASS  NS=4;DP=67  GT:DP:CATG  1/1:11:0,11,0,0 0/0:7:7,0,0,
locus_6 35  . G T 13  PASS  NS=4;DP=67  GT:DP:CATG  0/0:11:0,0,0,11 0/0:7:0,0,0,
locus_6 40  . T C 13  PASS  NS=4;DP=67  GT:DP:CATG  0/0:11:0,0,11,0 1/1:7:7,0,0,
locus_9 19  . A C 13  PASS  NS=13;DP=224  GT:DP:CATG  0/0:12:0,12,0,0 0/0:22:0,2
locus_9 25  . A G 13  PASS  NS=13;DP=224  GT:DP:CATG  0/0:12:0,12,0,0 0/0:22:0,2
locus_11  4 . C T 13  PASS  NS=10;DP=137  GT:DP:CATG  0/0:11:11,0,0,0 0/0:17:17,
locus_11  13  . C T 13  PASS  NS=10;DP=137  GT:DP:CATG  1/1:11:0,1,10,0 0/0:17:17
locus_11  21  . G A 13  PASS  NS=10;DP=137  GT:DP:CATG  0/0:11:0,0,0,11 0/0:17:0,
locus_11  23  . A G 13  PASS  NS=10;DP=137  GT:DP:CATG  0/0:11:0,11,0,0 1/1:17:0,
locus_11  24  . A T 13  PASS  NS=10;DP=137  GT:DP:CATG  1/1:11:0,0,11,0 0/0:17:0,
locus_11  38  . G T 13  PASS  NS=10;DP=137  GT:DP:CATG  0/0:11:0,0,0,11 0/0:17:0,
locus_11  42  . A G 13  PASS  NS=10;DP=137  GT:DP:CATG  0/0:11:0,11,0,0 0/0:17:0,
locus_11  54  . A G 13  PASS  NS=10;DP=137  GT:DP:CATG  0/0:11:0,11,0,0 0/0:17:0,
locus_11  55  . A G 13  PASS  NS=10;DP=137  GT:DP:CATG  0/0:11:0,11,0,0 0/0:17:0,
locus_12  6 . C T 13  PASS  NS=7;DP=94  GT:DP:CATG  1/0:7:4,0,3,0 ./.:0:0,0,0,0
locus_12  20  . C T 13  PASS  NS=7;DP=94  GT:DP:CATG  0/0:7:7,0,0,0 ./.:0:0,0,0,0
locus_12  31  . T C 13  PASS  NS=7;DP=94  GT:DP:CATG  0/0:7:0,0,7,0 ./.:0:0,0,0,0
locus_12  33  . G A 13  PASS  NS=7;DP=94  GT:DP:CATG  1/1:7:0,7,0,0 ./.:0:0,0,0,0
locus_12  37  . G T 13  PASS  NS=5;DP=52  GT:DP:CATG  0/0:7:0,0,0,7 ./.:0:0,0,0,0
locus_12  43  . C G 13  PASS  NS=7;DP=94  GT:DP:CATG  1/1:7:0,0,0,7 ./.:0:0,0,0,0
locus_12  45  . G C,A 13  PASS  NS=7;DP=94  GT:DP:CATG  0/0:7:0,0,0,7 ./.:0:0,0,0
locus_12  47  . G T 13  PASS  NS=7;DP=94  GT:DP:CATG  0/0:7:0,0,0,7 ./.:0:0,0,0,0