Output Formats

By default ipyrad will write out all output formats it is capable of generating. Converting between the various formats is very fast, but if you want to save yourself the cpu and disk space, you can enable only specific output formats with the output_formats

Variant Call Format *.vcf.gz

VCF is a standard format for storing and manipulating sequence data. The format is too complicated to go into here, but you can see a good explanation on the 1000 Genomes Project site. The VCF format output by ipyrad includes full genotype information for all bases in all loci, including information about genotype quality. Many useful conversions and filtering options for this format are available in the software vcftools.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1A_0    1B_0    1C_0    1D_0    2E_0    2F_0    2G_0    2H_0    3I_0    3J_0    3K_0    3L_0
0  0       .       G       .       13      PASS    NS=12;DP=235    GT:CATG 0/0:0,0,0,19    0/0:0,0,0,22    0/0:0,0,0,20    0/0:0,0,0,19    0/0:0,0,0,18    0/0:0,0,0,22    0/0:0,0,0,20    0/0:0,0,0,21    0/0:0,0,0,15    0/0:0,0,0,14    0/0:0,0,0,24    0/0:0,0,0,21
0  1       .       T       .       13      PASS    NS=12;DP=235    GT:CATG 0/0:0,0,19,0    0/0:0,0,22,0    0/0:0,0,20,0    0/0:0,0,19,0    0/0:0,0,18,0    0/0:0,0,22,0    0/0:0,0,20,0    0/0:0,0,21,0    0/0:0,0,15,0    0/0:0,0,14,0    0/0:0,0,24,0    0/0:0,0,21,0
0  2       .       T       .       13      PASS    NS=12;DP=235    GT:CATG 0/0:0,0,19,0    0/0:0,0,22,0    0/0:0,0,20,0    0/0:0,0,19,0    0/0:0,0,18,0    0/0:0,0,22,0    0/0:0,0,19,1    0/0:0,0,21,0    0/0:0,0,15,0    0/0:0,0,14,0    0/0:0,0,24,0    0/0:0,0,21,0

ipyrad format *.loci

This is a custom format that is easy to read, showing each individual locus with variable sites indicated. Custom scripts can easily parse this file for loci containing certain amounts of taxon coverage or variable sites. Also it is the most easily readable file for assuring that your analyses are working properly. A (-) indicates a variable site, and a (*) indicates the site is phylogenetically informative. Integers enclosed by | indicate the locus number. Example:

1A_0     GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
1B_0     GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
1C_0     GTTATCCGTAGCGATTATTACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
1D_0     GTTATCCGTAGCGATTATCACCTCAGTTAKATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGGCGGACGCAGCTAGTC
2E_0     GTTATCCGTAGCGATTATTACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
2F_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGKACGCAGCTAGTC
2G_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
2H_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
3I_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
3J_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGSGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
3K_0     GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
3L_0     GTTATCGGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACACAGCTAGTC
//             -           *        * -                     -                     -  -  -         |0|
1A_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
1B_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATMAGCTAGGCTTCGAGTCGTATC
1C_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
1D_0     ACAGCTCTGTTACATRCATCTGTCCATACTCCCTGGTTCGTAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
2E_0     ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
2F_0     ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
2G_0     ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
2H_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGYGATCAGCTAGGCTTCGAGTCGTATS
3I_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
3J_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
3K_0     ACAGCTCTGTTACATGCATCTGTCMATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
3L_0     ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTAYC
//               *      -        -               -   *                  -   -                   --|1|

For paired-end data the two linked loci are shown separated by a ‘nnnn’ separator, any merged reads will of course not contain the ‘nnnn’:

1A0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTAnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
1B0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTAnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
1C0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGAAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
1D0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
2E0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTSnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT
2F0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT
2G0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTTTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT
2H0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACTTCCCGGTATCCGACCT
3I0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT
3J0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT
3K0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT
3L0     GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGAGACYAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
//                                                                 -                          *       -     -  -        *           -    *           |0|
1A0     GACAAATCTTACATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
1B0     GACAAATCTTAGATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
1C0     GACAAATCTTAGATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTAATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
1D0     GACAAATCTTAGTTTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGAACCAAACGCAGGTGGAGGACCCAAGAAC
2E0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
2F0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
2G0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
2H0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
3I0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
3J0     GACAAATCTCAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
3K0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
3L0     GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
//               - -- *                                         -                                                       -                            |1|

PHYLIP *.phy

This is a phylip formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, with missing data for any sample filled in with N’s. This format is used in RAxML among other phylogenetic programs. The header here indicates there are 12 samples and 89023 bases in the sequence. Because of this the output is truncated here for clarity (indicated by the ellipses).

12 89023
1A_0     GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT...
1B_0     GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT...
1C_0     GTTATCCGTAGCGATTATTACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT...
1D_0     GTTATCCGTAGCGATTATCACCTCAGTTAKATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGGCGGACGCAGCTAGTCACAGCTCTGTTACATRCATCTGTCCATACTCCCTGGTTCGTAATCAT...

*.snps.phy & *.u.snps.phy

Additionally we provide a two different PHYLIP formatted version that include only variable sites (SNPs). Paired loci are treated as a single locus, meaning SNPs from the two reads are not separated in this file (they’re linked). The *.snps.phy file contains all SNPs from all loci concatenated together, with missing values filled by N’s. The *.u.snps.phy contains one SNP sampled from each locus. If multiple SNPs in a locus, SNP sites that contain the least missing data across taxa are sampled, if equal amounts of missing data, they are randomly sampled. The header indicates this file contains 12 samples and 990 bases per sample. The output below is truncated for clarity.

12 990
1A_0     GAATGACATCCTCAAACACCCTGGATACGGACAACGAAATTGCACTCATCAGACAAAGAAATTACWGAGGAACCCATGAGAGACCGCCTYCARYA...
1B_0     GAAASRCATACTCAAACACCCTKGATACGGACAACGAAATTGCACTCATCAGACAAAGAAATTACAGAGGAACCCAAGAGAGACCGCCTTCAATA...

MAP/PARTITION (*.snps.map)

Because the concatenated SNPs file does not include information about which SNPs come from which locus we provide a _map_ file with this information. This is used by the program _tetrad_ to randomly sample single SNPs from among loci.

1       rad0_snp0       0       1
1       rad0_snp1       0       2
1       rad0_snp2       0       3
1       rad0_snp3       0       4
1       rad0_snp4       0       5
2       rad1_snp0       0       6
2       rad1_snp1       0       7
2       rad1_snp2       0       8
2       rad1_snp3       0       9
3       rad2_snp0       0       10
3       rad2_snp1       0       11
3       rad2_snp2       0       12
3       rad2_snp3       0       13
3       rad2_snp4       0       14
3       rad2_snp5       0       15
3       rad2_snp6       0       16

EIGENSTRAT *.geno & *.u.geno

This is a SNP based format. Each line corresponds to one snp with one column per sample. The value in the sample column indicates the number of copies of the reference allele each individual has. 9 indicates missing data. Below you will see standard .geno output from the simulated data, so there are 12 columns, one per sample. This format is used by EIGENSTRAT, SMARTPCA, and ADMIXTURE, among other programs.

There is an additional *.u.geno file output that includes only unlinked SNPS, with one SNP being randomly chosen per locus and the rest ignored.

222222222220
220202222222
000222222222
222122222222
222222222122
222022222222
222221222222
222222222220
222200022222
222122222222

G-PhoCS *.gphocs

This is a full sequence based format that is very similar to the native ipyrad .loci format. It is appropriate for use with the Bayesian MCMC demographic inference program G-PhoCS: http://compgen.cshl.edu/GPhoCS/

499

locus0 10 90
A_0    CTACGATAGAGAAATCACTCTTTTCTTCAGGGSTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT
B_0    CTACGATAGAGAAATCACTCTTTTCTTCAGGGGTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT
C_0    CTACGATAGAGAAATCACTCTTTTCTTCAGGGGTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT

STRUCTURE *.str & *.u.str

This is another SNP based format, that includes either all variable sites (*.str) or one randomly selected variable site per locus (*.u.str). These files are suitable input files for the population structure analysis program STRUCTURE, as well as a few others. The output below is truncated for clarity.

1A_0                        3   3   0   2   2   1   2   2   2   2   3   3   0   1   3   1   3   0   ...
1A_0                        3   3   0   2   2   1   2   2   2   2   3   3   0   1   3   1   3   0   ...
1B_0                        3   3   0   2   2   1   2   2   2   2   3   3   0   1   3   1   3   0   ...
1B_0                        3   3   0   2   2   1   2   2   2   2   3   3   0   1   0   1   3   0   ...

NEXUS *.nex

This is a nexus formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, but printed in an interleaved format, with missing data for any sample filled in with N’s, and with data information appended to the beginning. This format is used in BEAST among other phylogenetic programs.

<TODO: Unimplemented>