ipyrad-analysis toolkit: tetrad

tetrad is a species tree inference tool based on the SVDQuartets algorithm of Chifman and Kubatko. It uses the theory of phylogenetic invariants to resolve quartet trees from SNPs for all sets of quartets in a larger tree, and then joins the quartets together into a supertree using QMC. Here I demonstrate how to call tetrad from the ipyrad-analysis tools, which is convenient for keeping all of your analyses in jupyter notebooks. Alternatively, you could also call tetrad as a command-line tool (see the tetrad docs).

The following features of tetrad are highlighted:

Parallelization: tetrad can be massively parallelized on an HPC cluster. It approximately scales linearly with the number of available cores. Using the flexible ipyparallel backend, you can even parallelize over multiple nodes using MPI ([more details coming soon]).

Bootstrap sampling: In contrast to SVDquartets, tetrad is designed particularly to work well with RAD-seq data, though it can apply to any SNP data set. Bootstrap replicates resample the number of loci with replacement to the same size as the original data set.

SNP subsampling: The underlying model assumes that the examined SNPs are unlinked, and tetrad uses a very efficient method to maximize the number of unlinked SNPs used in the analysis. A very crude way to achieve unlinked SNPs would be to sample one SNP per locus before beginning the analysis. However, a site that is variable in the context of all your samples is not necessarily informative for any given quartet of samples. Instead, tetrad subsamples a single SNP from every locus separately for every quartet set in the analysis, and repeats this independently in every bootstrap replicate. This way the maximum amount of unlinked SNP information is used in every quartet inference.

Required software

[1]:
# conda install ipyrad -c bioconda
# conda install tetrad -c eaton-lab -c conda-forge
[2]:
import ipyrad.analysis as ipa
import toytree

Input data

The input data file should be the .snps.hdf5 file produced by ipyrad (v.0.9.13 or newer). If you did not assemble your data in ipyrad then you can convert your SNPs data to this format from any VCF using the vcf_to_hdf5 tool.

[3]:
# the path to your sequence data in HDF5 format
data = "/home/deren/Documents/virentes-reference/analysis-ipyrad/ref_min4_outfiles/ref_min4.snps.hdf5"

Initialize the analysis object

Here you can enter parameters for the analysis. By default only a subsample of the total quartets will be inferred. To instead infer all quartets simply enter a very large number for the nquartets parameter and it will use the maximum. When you initialize the object it will print the size of your dataset, the number of loci, and the number of quartets. The number of loci is of interest because this is the maximum number of SNPs that will be used in an analysis.

[7]:
# init analysis object with input data and (optional) parameter options
tet = ipa.tetrad(
    name="virentes-min4",
    data=data,
    nquartets=1e6,
    nboots=16,
)
loading snps array [37 taxa x 1182005 snps]
max unlinked SNPs per quartet [nloci]: 88938
quartet sampler [full]: 66045 / 66045

Call run

[8]:
tet.run(auto=True)
Parallel connection | d9118d19223a: 40 cores
initializing quartet sets database
[####################] 100% 0:01:48 | inferring full tree * | mean SNPs/quartet: 78381
[####################] 100% 0:01:32 | bootstrap inference 1 | mean SNPs/quartet: 77664
[####################] 100% 0:01:33 | bootstrap inference 2 | mean SNPs/quartet: 78121
[####################] 100% 0:01:33 | bootstrap inference 3 | mean SNPs/quartet: 78356
[####################] 100% 0:01:33 | bootstrap inference 4 | mean SNPs/quartet: 78399
[####################] 100% 0:01:32 | bootstrap inference 5 | mean SNPs/quartet: 77797
[####################] 100% 0:01:33 | bootstrap inference 6 | mean SNPs/quartet: 78836
[####################] 100% 0:01:32 | bootstrap inference 7 | mean SNPs/quartet: 78550
[####################] 100% 0:01:32 | bootstrap inference 8 | mean SNPs/quartet: 77943
[####################] 100% 0:01:32 | bootstrap inference 9 | mean SNPs/quartet: 77753
[####################] 100% 0:01:33 | bootstrap inference 10 | mean SNPs/quartet: 78646
[####################] 100% 0:01:33 | bootstrap inference 11 | mean SNPs/quartet: 78200
[####################] 100% 0:01:32 | bootstrap inference 12 | mean SNPs/quartet: 78073
[####################] 100% 0:01:32 | bootstrap inference 13 | mean SNPs/quartet: 78056
[####################] 100% 0:01:33 | bootstrap inference 14 | mean SNPs/quartet: 78491
[####################] 100% 0:01:32 | bootstrap inference 15 | mean SNPs/quartet: 78709
[####################] 100% 0:01:32 | bootstrap inference 16 | mean SNPs/quartet: 78777

Show the full tree with bootstrap supports

[22]:
tre = toytree.tree(tet.trees.tree).root(["HE", "NI"])
tre.draw(node_labels="support", use_edge_lengths=False);
FLSF33SCCU3FLBA140TXWV2LALC2FLSA185FLCK216FLMO62FLSF54FLAB109FLWO6FLCK18FLSF47HNDA09CRL0030BZBB1MXSA3017CUVN10CRL0001CUCA4CUSV6CUMM5BJVL19BJSL25BJSB3MXED8MXGT4TXMD3TXGR3ENARDODUreferenceCHHENI1810018251003737754343621001001006837100100100128110010081100100100100100100100100100100100100

Show the majority-rule consensus tree with bootstrap supports

[23]:
tre = toytree.tree(tet.trees.cons).root(["HE", "NI"])
tre.draw(node_labels="support", use_edge_lengths=False);
FLSF33LALC2FLBA140SCCU3TXWV2FLCK216FLMO62FLSA185FLWO6FLSF54FLAB109FLCK18FLSF47CRL0030HNDA09BZBB1MXSA3017CRL0001CUVN10CUCA4CUSV6CUMM5BJVL19BJSL25BJSB3MXED8MXGT4TXGR3TXMD3ENARDODUreferenceCHNIHE3825381003862751005662100100100563810010010069318110010081100100100100100100100100100100100100

Show variation over the bootstrap replicates

[21]:
mtre = toytree.mtree(tet.trees.boots)
mtre.treelist = [i.root(["HE", "NI"]) for i in mtre.treelist]
mtre.draw_cloud_tree(
    height=600,
    width=400,
    use_edge_lengths=False,
    html=True,
);
FLBA140LALC2FLSF33SCCU3TXWV2FLMO62FLCK216FLSA185FLWO6FLCK18FLSF54FLAB109FLSF47HNDA09CRL0030BZBB1MXSA3017CRL0001CUVN10CUSV6CUCA4CUMM5BJVL19BJSB3BJSL25MXED8MXGT4TXGR3TXMD3DODUENARreferenceCHHENI

Continuing from a checkpoint

If you want to run more bootstrap replicates you can simply reinit an analysis object with the same name and set the number of bootstrap replicates to a higher value and call .run() again. If you try calling .run() again and you have already completed all of the requested results then it will simply print that it is finished. If you want it to overwrite existing results with the same name then you can use the force arg in run.

[24]:
# analysis is finished so it will not run
tet.run()
16 bootstrap result trees already exist for virentes-min4.

Here I set the number of requested bootstrap replicates to 20 and call .run() again. You can see that the analysis continues from 17, since we already completed 16 bootstrap replicates earlier, and will go until it completes 20 bootstraps.

[26]:
# increase nboots and continue from existing analysis object
tet.params.nboots = 20
tet.run(auto=True)
Parallel connection | d9118d19223a: 40 cores
[####################] 100% 0:01:49 | bootstrap inference 17 | mean SNPs/quartet: 78350
[####################] 100% 0:01:34 | bootstrap inference 18 | mean SNPs/quartet: 77653
[####################] 100% 0:01:32 | bootstrap inference 19 | mean SNPs/quartet: 78560
[####################] 100% 0:01:32 | bootstrap inference 20 | mean SNPs/quartet: 78380

Alternatively, maybe you are returning to this analysis after a while and decide you want to do more bootstraps. You can re-load the analysis object by entering the same name and working_dir as in the original analysis, and adding the load=True argument. I set the number of bootstraps to 25 now. This will load the results from before and add new results when you call .run().

[29]:
# # re-init analysis object (will load existing results at this name)
# tet = ipa.tetrad(
#     name="virentes-min4",
#     data=data,
#     nquartets=1e6,
#     nboots=25,
#     load=True,
# )
# tet.run(auto=True)