Output Formats
By default ipyrad will write out all output formats it is capable of
generating. Converting between the various formats is very fast, but
if you want to save yourself the cpu and disk space, you can enable
only specific output formats with the output_formats
Variant Call Format *.vcf.gz
VCF is a standard format for storing and manipulating sequence data. The format is too complicated to go into here, but you can see a good explanation on the 1000 Genomes Project site. The VCF format output by ipyrad includes full genotype information for all bases in all loci, including information about genotype quality. Many useful conversions and filtering options for this format are available in the software vcftools.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1A_0 1B_0 1C_0 1D_0 2E_0 2F_0 2G_0 2H_0 3I_0 3J_0 3K_0 3L_0
0 0 . G . 13 PASS NS=12;DP=235 GT:CATG 0/0:0,0,0,19 0/0:0,0,0,22 0/0:0,0,0,20 0/0:0,0,0,19 0/0:0,0,0,18 0/0:0,0,0,22 0/0:0,0,0,20 0/0:0,0,0,21 0/0:0,0,0,15 0/0:0,0,0,14 0/0:0,0,0,24 0/0:0,0,0,21
0 1 . T . 13 PASS NS=12;DP=235 GT:CATG 0/0:0,0,19,0 0/0:0,0,22,0 0/0:0,0,20,0 0/0:0,0,19,0 0/0:0,0,18,0 0/0:0,0,22,0 0/0:0,0,20,0 0/0:0,0,21,0 0/0:0,0,15,0 0/0:0,0,14,0 0/0:0,0,24,0 0/0:0,0,21,0
0 2 . T . 13 PASS NS=12;DP=235 GT:CATG 0/0:0,0,19,0 0/0:0,0,22,0 0/0:0,0,20,0 0/0:0,0,19,0 0/0:0,0,18,0 0/0:0,0,22,0 0/0:0,0,19,1 0/0:0,0,21,0 0/0:0,0,15,0 0/0:0,0,14,0 0/0:0,0,24,0 0/0:0,0,21,0
ipyrad format *.loci
This is a custom format that is easy to read, showing each individual locus
with variable sites indicated. Custom scripts can easily parse this file for
loci containing certain amounts of taxon coverage or variable sites. Also it
is the most easily readable file for assuring that your analyses are working
properly. A (-) indicates a variable site, and a (*) indicates the site is
phylogenetically informative. Integers enclosed by |
indicate the locus
number. Example:
1A_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
1B_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
1C_0 GTTATCCGTAGCGATTATTACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
1D_0 GTTATCCGTAGCGATTATCACCTCAGTTAKATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGGCGGACGCAGCTAGTC
2E_0 GTTATCCGTAGCGATTATTACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
2F_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGKACGCAGCTAGTC
2G_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
2H_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
3I_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
3J_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGSGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
3K_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC
3L_0 GTTATCGGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACACAGCTAGTC
// - * * - - - - - |0|
1A_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
1B_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATMAGCTAGGCTTCGAGTCGTATC
1C_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
1D_0 ACAGCTCTGTTACATRCATCTGTCCATACTCCCTGGTTCGTAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
2E_0 ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
2F_0 ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
2G_0 ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
2H_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGYGATCAGCTAGGCTTCGAGTCGTATS
3I_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
3J_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
3K_0 ACAGCTCTGTTACATGCATCTGTCMATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC
3L_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTAYC
// * - - - * - - --|1|
For paired-end data the two linked loci are shown separated by a ‘nnnn’ separator, any merged reads will of course not contain the ‘nnnn’:
1A0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTAnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
1B0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTAnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
1C0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGAAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
1D0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
2E0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTSnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT
2F0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT
2G0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTTTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT
2H0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACTTCCCGGTATCCGACCT
3I0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT
3J0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT
3K0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT
3L0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGAGACYAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT
// - * - - - * - * |0|
1A0 GACAAATCTTACATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
1B0 GACAAATCTTAGATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
1C0 GACAAATCTTAGATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTAATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
1D0 GACAAATCTTAGTTTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGAACCAAACGCAGGTGGAGGACCCAAGAAC
2E0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
2F0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
2G0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
2H0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
3I0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
3J0 GACAAATCTCAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
3K0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
3L0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC
// - -- * - - |1|
PHYLIP *.phy
This is a phylip formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, with missing data for any sample filled in with N’s. This format is used in RAxML among other phylogenetic programs. The header here indicates there are 12 samples and 89023 bases in the sequence. Because of this the output is truncated here for clarity (indicated by the ellipses).
12 89023
1A_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT...
1B_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT...
1C_0 GTTATCCGTAGCGATTATTACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT...
1D_0 GTTATCCGTAGCGATTATCACCTCAGTTAKATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGGCGGACGCAGCTAGTCACAGCTCTGTTACATRCATCTGTCCATACTCCCTGGTTCGTAATCAT...
*.snps.phy & *.u.snps.phy
Additionally we provide a two different PHYLIP formatted version that
include only variable sites (SNPs). Paired loci are treated as a single
locus, meaning SNPs from the two reads are not separated in this file
(they’re linked). The *.snps.phy
file contains all SNPs from all
loci concatenated together, with missing values filled by N
’s. The
*.u.snps.phy
contains one SNP sampled from each locus. If multiple
SNPs in a locus, SNP sites that contain the least missing data across
taxa are sampled, if equal amounts of missing data, they are randomly
sampled. The header indicates this file contains 12 samples and 990
bases per sample. The output below is truncated for clarity.
12 990
1A_0 GAATGACATCCTCAAACACCCTGGATACGGACAACGAAATTGCACTCATCAGACAAAGAAATTACWGAGGAACCCATGAGAGACCGCCTYCARYA...
1B_0 GAAASRCATACTCAAACACCCTKGATACGGACAACGAAATTGCACTCATCAGACAAAGAAATTACAGAGGAACCCAAGAGAGACCGCCTTCAATA...
MAP/PARTITION (*.snps.map)
Because the concatenated SNPs file does not include information about which SNPs come from which locus we provide a _map_ file with this information. This is used by the program _tetrad_ to randomly sample single SNPs from among loci.
1 rad0_snp0 0 1
1 rad0_snp1 0 2
1 rad0_snp2 0 3
1 rad0_snp3 0 4
1 rad0_snp4 0 5
2 rad1_snp0 0 6
2 rad1_snp1 0 7
2 rad1_snp2 0 8
2 rad1_snp3 0 9
3 rad2_snp0 0 10
3 rad2_snp1 0 11
3 rad2_snp2 0 12
3 rad2_snp3 0 13
3 rad2_snp4 0 14
3 rad2_snp5 0 15
3 rad2_snp6 0 16
EIGENSTRAT *.geno & *.u.geno
This is a SNP based format. Each line corresponds to one snp with one column per
sample. The value in the sample column indicates the number of copies of the
reference allele each individual has. 9 indicates missing data. Below you will
see standard .geno
output from the simulated data, so there are 12
columns, one per sample. This format is used by EIGENSTRAT, SMARTPCA, and
ADMIXTURE, among other programs.
There is an additional *.u.geno
file output that includes only unlinked
SNPS, with one SNP being randomly chosen per locus and the rest ignored.
222222222220
220202222222
000222222222
222122222222
222222222122
222022222222
222221222222
222222222220
222200022222
222122222222
G-PhoCS *.gphocs
This is a full sequence based format that is very similar to the native ipyrad .loci format. It is appropriate for use with the Bayesian MCMC demographic inference program G-PhoCS: http://compgen.cshl.edu/GPhoCS/
499
locus0 10 90
A_0 CTACGATAGAGAAATCACTCTTTTCTTCAGGGSTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT
B_0 CTACGATAGAGAAATCACTCTTTTCTTCAGGGGTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT
C_0 CTACGATAGAGAAATCACTCTTTTCTTCAGGGGTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT
STRUCTURE *.str & *.u.str
This is another SNP based format, that includes either all variable
sites (*.str
) or one randomly selected variable site per locus
(*.u.str
). These files are suitable input files for the population
structure analysis program STRUCTURE, as well as a few others. The output
below is truncated for clarity.
1A_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 3 1 3 0 ...
1A_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 3 1 3 0 ...
1B_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 3 1 3 0 ...
1B_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 0 1 3 0 ...
NEXUS *.nex
This is a nexus formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, but printed in an interleaved format, with missing data for any sample filled in with N’s, and with data information appended to the beginning. This format is used in BEAST among other phylogenetic programs.
<TODO: Unimplemented>