ipyrad-analysis toolkit: abba-baba

The baba tool can be used to measure abba-baba statistics across many different hypotheses on a tree, to easily group individuals into populations for measuring abba-baba using allele frequencies, and to summarize or plot the results of many analyses.

Load packages

[4]:

import ipyrad.analysis as ipa
import ipyparallel as ipp
import toytree
import toyplot

[2]:

print(ipa.__version__)
print(toyplot.__version__)
print(toytree.__version__)

0.9.51
0.18.0
1.1.2

Set up and connect to the ipyparallel cluster

Depending on the number of tests, abba-baba analysis can be computationally intensive, so we will first set up a clustering backend and attach to it.

[14]:

# In a terminal on your computer you must launch the ipcluster instance by hand, like this:
# `ipcluster start -n 40 --cluster-id="baba" --daemonize`

# Now you can create a client for the running ipcluster
ipyclient = ipp.Client(cluster_id="baba")

# How many cores are you attached to?
len(ipyclient)

[14]:

A tree-based hypothesis

abba-baba tests are explicitly a tree-based test, and so ipyrad requires that you enter a tree hypothesis in the form of a newick file. This is used by the baba tool to auto-generate hypotheses.

Load in your .loci data file and a tree hypothesis

We are going to use the shape of our tree topology hypothesis to generate 4-taxon tests to perform, therefore we’ll start by looking at our tree and making sure it is properly rooted.

[5]:

## ipyrad and raxml output files
locifile = "./analysis-ipyrad/pedic_outfiles/pedic.loci"
newick = "./analysis-raxml/RAxML_bipartitions.pedic"

[6]:

## parse the newick tree, re-root it, and plot it.
rtre = toytree.tree(newick).root(wildcard="prz")
rtre.draw(
    height=350,
    width=400,
    node_labels=rtre.get_node_values("support")
    )

## store rooted tree back into a newick string.
newick = rtre.write()

Short tutorial: calculating abba-baba statistics

To give a gist of what this code can do, here is a quick tutorial version, each step of which we explain in greater detail below. We first create a 'baba' analysis object that is linked to our data file, in this example we name the variable bb. Then we tell it which tests to perform, here by automatically generating a number of tests using the generate_tests_from_tree() function. And finally, we calculate the results and plot them.

[7]:

## create a baba object linked to a data file and newick tree
bb = ipa.baba(data=locifile, newick=newick)

[ ]:

## generate all possible abba-baba tests meeting a set of constraints
bb.generate_tests_from_tree(
    constraint_dict={
        "p4": ["32082_przewalskii", "33588_przewalskii"],
        "p3": ["33413_thamno"],
    })

[8]:

## show the first 3 tests
bb.tests[:3]

[8]:

[{'p1': ['41478_cyathophylloides'],
  'p2': ['29154_superba', '30686_cyathophylla'],
  'p3': ['33413_thamno'],
  'p4': ['32082_przewalskii', '33588_przewalskii']},
 {'p1': ['41954_cyathophylloides'],
  'p2': ['29154_superba', '30686_cyathophylla'],
  'p3': ['33413_thamno'],
  'p4': ['32082_przewalskii', '33588_przewalskii']},
 {'p1': ['41478_cyathophylloides'],
  'p2': ['29154_superba'],
  'p3': ['33413_thamno'],
  'p4': ['32082_przewalskii', '33588_przewalskii']}]

[9]:

## run all tests linked to bb
bb.run(ipyclient)

[####################] 100%  calculating D-stats  | 0:02:58 |

[10]:

## show first 5 results
bb.results_table.head()

[10]:

	dstat	bootmean	bootstd	Z	ABBA	BABA	nloci
0	0.089	0.089	0.036	2.485	436.125	365.000	8721
1	0.096	0.098	0.037	2.640	425.062	350.250	8267
2	0.101	0.102	0.044	2.301	329.938	269.375	6573
3	0.114	0.114	0.043	2.623	319.312	254.125	6255
4	0.124	0.124	0.039	3.188	400.250	312.188	8026

Look at the results

By default we do not attach the names of the samples that were included in each test to the results table since it makes the table much harder to read, and we wanted it to look very clean. However, this information is readily available in the .test() attribute of the baba object as shown below. Also, we have made plotting functions to show this information clearly as well.

[11]:

## save all results table to a tab-delimited CSV file
bb.results_table.to_csv("bb.abba-baba.csv", sep="\t")

## show the results table sorted by index score (Z)
sorted_results = bb.results_table.sort_values(by="Z", ascending=False)
sorted_results.head()

[11]:

	dstat	bootmean	bootstd	Z	ABBA	BABA	nloci
16	0.290	0.290	0.030	9.531	606.312	333.812	8937
15	0.239	0.238	0.028	8.492	608.281	373.365	9266
17	0.199	0.199	0.032	6.311	550.062	367.312	9033
19	0.204	0.205	0.033	6.120	545.375	360.938	8925
20	0.160	0.161	0.030	5.383	499.766	362.047	9351

[12]:

## get taxon names in the sorted results order
sorted_taxa = bb.taxon_table.iloc[sorted_results.index]

## show taxon names in the first few sorted tests
sorted_taxa.head()

[12]:

	p1	p2	p3	p4
16	[35236_rex]	[30556_thamno]	[33413_thamno]	[32082_przewalskii, 33588_przewalskii]
15	[35236_rex, 39618_rex, 38362_rex]	[30556_thamno]	[33413_thamno]	[32082_przewalskii, 33588_przewalskii]
17	[39618_rex, 38362_rex]	[30556_thamno]	[33413_thamno]	[32082_przewalskii, 33588_przewalskii]
19	[38362_rex]	[30556_thamno]	[33413_thamno]	[32082_przewalskii, 33588_przewalskii]
20	[35236_rex]	[40578_rex, 35855_rex]	[33413_thamno]	[32082_przewalskii, 33588_przewalskii]

Plotting and interpreting results

Interpreting the results of D-statistic tests is actually very complicated. You cannot treat every test as if it were independent because introgression between one pair of species may cause one or both of those species to appear as if they have also introgressed with other taxa in your data set. This problem is described in great detail in this paper (Eaton et al. 2015). A good place to start, then, is to perform many tests and focus on those which have the strongest signal of admixture. Then, perform additional tests, such as partitioned D-statistics (described further below) to tease apart whether a single or multiple introgression events are likely to have occurred.

In the example plot below we find evidence of admixture between the sample 33413_thamno (black) with several other samples, but the signal is strongest with respect to 30556_thamno (tests 12-19). It also appears that admixture is consistently detected with samples of (40578_rex & 35855_rex) when contrasted against 35236_rex (tests 20, 24, 28, 34, and 35). Take note, the tests are indexed starting at 0.

[13]:

## plot results on the tree
bb.plot(height=850, width=700, pct_tree_y=0.2, pct_tree_x=0.5, alpha=4.0);

generating tests

Because tests are generated based on a tree file, it will only generate tests that fit the topology of the test. For example, the entries below generate zero possible tests because the two samples entered for P3 (the two thamnophila subspecies) are paraphyletic on the tree topology, and therefore cannot form a clade together.

[14]:

## this is expected to generate zero tests
aa = bb.copy()
aa.generate_tests_from_tree(
    constraint_dict={
        "p4": ["32082_przewalskii", "33588_przewalskii"],
        "p3": ["33413_thamno", "30556_thamno"],
    })

0 tests generated from tree

If you want to get results for a test that does not fit on your tree you can always write the result out by hand instead of auto-generating it from the tree. Doing it this way is fine when you have few tests to run, but becomes burdensome when writing many tests.

[15]:

## writing tests by hand for a new object
aa = bb.copy()
aa.tests = [
    {"p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["33413_thamno", "30556_thamno"],
     "p2": ["40578_rex", "35855_rex"],
     "p1": ["39618_rex", "38362_rex"]},
    {"p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["33413_thamno", "30556_thamno"],
     "p2": ["40578_rex", "35855_rex"],
     "p1": ["35236_rex"]},
    ]
## run the tests
aa.run(ipyclient)
aa.results_table

[####################] 100%  calculating D-stats  | 0:00:23 |

[15]:

	dstat	bootmean	bootstd	Z	ABBA	BABA	nloci
0	0.050	0.050	0.022	2.291	939.172	850.500	15820
1	0.163	0.163	0.021	7.900	984.953	708.797	15576

Further investigating results with 5-part tests

You can also perform partitioned D-statistic tests like below. Here we are testing the direction of introgression. If the two thamnophila subspecies are in fact sister species then they would be expected to share derived alleles that arose in their ancestor and which would be introduced from together if either one of them introgressed into a P. rex taxon. As you can see, test 0 shows no evidence of introgression, whereas test 1 shows that the two thamno subspecies share introgressed alleles that are present in two samples of rex relative to sample “35236_rex”.

More on this further below in this notebook.

[16]:

## further investigate with a 5-part test
cc = bb.copy()
cc.tests = [
    {"p5": ["32082_przewalskii", "33588_przewalskii"],
     "p4": ["33413_thamno"],
     "p3": ["30556_thamno"],
     "p2": ["40578_rex", "35855_rex"],
     "p1": ["39618_rex", "38362_rex"]},
    {"p5": ["32082_przewalskii", "33588_przewalskii"],
     "p4": ["33413_thamno"],
     "p3": ["30556_thamno"],
     "p2": ["40578_rex", "35855_rex"],
     "p1": ["35236_rex"]},
    ]
cc.run(ipyclient)

[####################] 100%  calculating D-stats  | 0:00:23 |

[17]:

## the partitioned D results for two tests
cc.results_table

[17]:

		Dstat	bootmean	bootstd	Z	ABxxA	BAxxA	nloci
0	p3	-0.037	-0.035	0.041	0.885	230.852	248.352	8933
	p4	0.044	0.044	0.053	0.840	160.125	146.531	8933
	shared	0.020	0.020	0.025	0.801	449.754	431.895	8933
1	p3	0.176	0.178	0.046	3.862	252.953	177.109	8840
	p4	0.135	0.134	0.052	2.612	159.172	121.266	8840
	shared	0.177	0.177	0.025	7.060	514.859	359.703	8840

[17]:

## and view the 5-part test taxon table
cc.taxon_table

[17]:

	p1	p2	p3	p4	p5
0	[39618_rex, 38362_rex]	[40578_rex, 35855_rex]	[30556_thamno]	[33413_thamno]	[32082_przewalskii, 33588_przewalskii]
1	[35236_rex]	[40578_rex, 35855_rex]	[30556_thamno]	[33413_thamno]	[32082_przewalskii, 33588_przewalskii]

Full Tutorial

Creating a `baba` object

The fundamental object for running abba-baba tests is the ipa.baba() object. This stores all of the information about the data, tests, and results of your analysis, and is used to generate plots. If you only have one data file that you want to run many tests on then you will only need to enter the path to your data once. The data file must be a '.loci' file from an ipyrad analysis. In general, you will probably want to use the largest data file possible for these tests (min_samples_locus=4), to maximize the amount of data available for any test. Once an initial baba object is created you create different copies of that object that will inherit its parameter setttings, and which you can use to perform different tests on, like below.

[19]:

## create an initial object linked to your data in 'locifile'
aa = ipa.baba(data=locifile)

## create two other copies
bb = aa.copy()
cc = aa.copy()

## print these objects
print aa
print bb
print cc

<ipyrad.analysis.baba.Baba object at 0x7fc55634a8d0>
<ipyrad.analysis.baba.Baba object at 0x7fc55634ab50>
<ipyrad.analysis.baba.Baba object at 0x7fc55634a110>

Linking tests to the baba object

The next thing we need to do is to link a 'test' to each of these objects, or a list of tests. In the Short tutorial above we auto-generated a list of tests from an input tree, but to be more explicit about how things work we will write out each test by hand here. A test is described by a Python dictionary that tells it which samples (individuals) should represent the ‘p1’, ‘p2’, ‘p3’, and ‘p4’ taxa in the ABBA-BABA test. You can see in the example below that we set two samples to represent the outgroup taxon (p4). This means that the SNP frequency for those two samples combined will represent the p4 taxon. For the baba object named 'cc' below we enter two tests using a list to show how multiple tests can be linked to a single baba object.

[20]:

aa.tests = {
    "p4": ["32082_przewalskii", "33588_przewalskii"],
    "p3": ["29154_superba"],
    "p2": ["33413_thamno"],
    "p1": ["40578_rex"],
}

bb.tests = {
    "p4": ["32082_przewalskii", "33588_przewalskii"],
    "p3": ["30686_cyathophylla"],
    "p2": ["33413_thamno"],
    "p1": ["40578_rex"],
}

cc.tests = [
    {
     "p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["41954_cyathophylloides"],
     "p2": ["33413_thamno"],
     "p1": ["40578_rex"],
    },
    {
     "p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["41478_cyathophylloides"],
     "p2": ["33413_thamno"],
     "p1": ["40578_rex"],
    },
]

Other parameters

Each baba object has a set of parameters associated with it that are used to filter the loci that will be used in the test and to set some other optional settings. If the 'mincov' parameter is set to 1 (the default) then loci in the data set will only be used in a test if there is at least one sample from every tip of the tree that has data for that locus. For example, in the tests above where we entered two samples to represent “p4” only one of those two samples needs to be present for the locus to be included in our analysis. If you want to require that both samples have data at the locus in order for it to be included in the analysis then you could set mincov=2. However, for the test above setting mincov=2 would filter out all of the data, since it is impossible to have a coverage of 2 for ‘p3’, ‘p2’, and ‘p1’, since they each have only one sample. Therefore, you can also enter the mincov parameter as a dictionary setting a different minimum for each tip taxon, which we demonstrate below for the baba object 'bb'.

[21]:

## print params for object aa
aa.params

[21]:

database   None
mincov     1
nboots     1000
quiet      False

[22]:

## set the mincov value as a dictionary for object bb
bb.params.mincov = {"p4":2, "p3":1, "p2":1, "p1":1}
bb.params

[22]:

database   None
mincov     {'p2': 1, 'p3': 1, 'p1': 1, 'p4': 2}
nboots     1000
quiet      False

Running the tests

When you execute the 'run()' command all of the tests for the object will be distributed to run in parallel on your cluster (or the cores available on your machine) as connected to your ipyclient object. The results of the tests will be stored in your baba object under the attributes 'results_table' and 'results_boots'.

[23]:

## run tests for each of our objects
aa.run(ipyclient)
bb.run(ipyclient)
cc.run(ipyclient)

[####################] 100%  calculating D-stats  | 0:00:07 |
[####################] 100%  calculating D-stats  | 0:00:06 |
[####################] 100%  calculating D-stats  | 0:00:10 |

The results table

The results of the tests are stored as a data frame (pandas.DataFrame) in results_table, which can be easily accessed and manipulated. The tests are listed in order and can be referenced by their 'index' (the number in the left-most column). For example, below we see the results for object 'cc' tests 0 and 1. You can see which taxa were used in each test by accessing them from the .tests attribute as a dictionary, or as .taxon_table which returns it as a dataframe. An even better way to see which individuals were involved in each test, however, is to use our plotting functions, which we describe further below.

[31]:

## you can sort the results by Z-score
cc.results_table.sort_values(by="Z", ascending=False)

## save the table to a file
cc.results_table.to_csv("cc.abba-baba.csv")

## show the results in notebook
cc.results_table

[31]:

	dstat	bootmean	bootstd	Z	ABBA	BABA	nloci
0	-0.007	-0.009	0.044	0.152	238.688	241.875	8313
1	-0.008	-0.008	0.041	0.193	248.250	252.250	8822

Auto-generating tests

Entering all of the tests by hand can be pain, which is why we wrote functions to auto-generate tests given an input rooted tree, and a number of contraints on the tests to generate from that tree. It is important to add constraints on the tests otherwise the number that can be produced becomes very large very quickly. Calculating results runs pretty fast, but summarizing and interpreting thousands of results is pretty much impossible, so it is generally better to limit the tests to those which make some intuitive sense to run. You can see in this example that implementing a few contraints reduces the number of tests from 1608 to 13.

[32]:

## create a new 'copy' of your baba object and attach a treefile
dd = bb.copy()
dd.newick = newick

## generate all possible tests
dd.generate_tests_from_tree()

## a dict of constraints
constraint_dict={
        "p4": ["32082_przewalskii", "33588_przewalskii"],
        "p3": ["40578_rex", "35855_rex"],
    }

## generate tests with contraints
dd.generate_tests_from_tree(
    constraint_dict=constraint_dict,
    constraint_exact=False,
)

## 'exact' contrainst are even more constrained
dd.generate_tests_from_tree(
    constraint_dict=constraint_dict,
    constraint_exact=True,
)

2006 tests generated from tree
126 tests generated from tree
14 tests generated from tree

Running the tests

The .run() command will run the tests linked to your analysis object. An ipyclient object is required to distribute the jobs in parallel. The .plot() function can then optionally be used to visualize the results on a tree. Or, you can simply look at the results in the .results_table attribute.

[33]:

## run the dd tests
dd.run(ipyclient)
dd.plot(height=500, pct_tree_y=0.2, alpha=4);
dd.results_table

[####################] 100%  calculating D-stats  | 0:01:00 |

[33]:

	dstat	bootmean	bootstd	Z	ABBA	BABA	nloci
0	0.071	0.071	0.034	2.082	415.266	360.406	9133
1	0.120	0.121	0.035	3.400	421.000	330.484	8611
2	0.085	0.088	0.041	2.044	327.828	276.609	6849
3	0.129	0.129	0.044	2.967	326.953	252.047	6505
4	0.096	0.097	0.037	2.558	376.078	310.266	8413
5	0.135	0.135	0.038	3.519	380.672	290.359	7939
6	-0.092	-0.090	0.040	2.299	278.641	335.234	6863
7	-0.109	-0.109	0.037	2.916	310.672	386.297	8439
8	-0.085	-0.083	0.044	1.948	276.609	327.828	6849
9	-0.096	-0.096	0.038	2.506	310.266	376.078	8413
10	-0.129	-0.130	0.043	3.009	252.047	326.953	6505
11	-0.135	-0.134	0.038	3.556	290.359	380.672	7939
12	-0.023	-0.023	0.032	0.714	435.562	455.750	8208
13	-0.013	-0.014	0.030	0.434	509.906	523.438	9513

More about input file paths (i/o)

The default (required) input data file is the .loci file produced by ipyrad. When performing D-statistic calculations this file will be parsed to retain the maximal amount of information useful for each test.

An additional (optional) file to provide is a newick tree file. While you do not need a tree in order to run ABBA-BABA tests, you do need at least need a hypothesis for how your samples are related in order to setup meaningful tests. By loading in a tree for your data set we can use it to easily set up hypotheses to test, and to plot results on the tree.

[20]:

## path to a locifile created by ipyrad
locifile = "./analysis-ipyrad/pedicularis_outfiles/pedicularis.loci"

## path to an unrooted tree inferred with tetrad
newick = "./analysis-tetrad/tutorial.tree"

(optional): root the tree

For abba-baba tests you will pretty much always want your tree to be rooted, since the test relies on an assumption about which alleles are ancestral. You can use our simple tree plotting library toytree to root your tree. This library uses Toyplot as its plotting backend, and ete3 as its tree manipulation backend.

Below I load in a newick string and root the tree on the two P. przewalskii samples using the root() function. You can either enter the names of the outgroup samples explicitly or enter a wildcard to select them. We show the rooted tree from a tetrad analysis below. The newick string of the rooted tree can be saved or accessed by the .newick attribute, like below.

[39]:

## load in the tree
tre = toytree.tree(newick)

## set the outgroup either as a list or using a wildcard selector
tre.root(names=["32082_przewalskii", "33588_przewalskii"])
tre.root(wildcard="prz")

## draw the tree
tre.draw(width=400)

## save the rooted newick string back to a variable and print
newick = tre.newick

Interpreting results

You can see in the results_table below that the D-statistic range around 0.0-0.15 in these tests. These values are not too terribly informative, and so we instead generally focus on the Z-score representing how far the distribution of D-statistic values across bootstrap replicates deviates from its expected value of zero. The default number of bootstrap replicates to perform per test is 1000. Each replicate resamples nloci with replacement.

In these tests ABBA and BABA occurred with pretty equal frequency. The values are calculated using SNP frequencies, which is why they are floats instead of integers, and this is also why we were able to combine multiple samples to represent a single tip in the tree (e.g., see the test we setup, above).

[41]:

## show the results table
print dd.results_table

    dstat  bootmean  bootstd      Z     ABBA     BABA  nloci
0   0.071     0.071    0.034  2.082  415.266  360.406   9133
1   0.120     0.121    0.035  3.400  421.000  330.484   8611
2   0.085     0.088    0.041  2.044  327.828  276.609   6849
3   0.129     0.129    0.044  2.967  326.953  252.047   6505
4   0.096     0.097    0.037  2.558  376.078  310.266   8413
5   0.135     0.135    0.038  3.519  380.672  290.359   7939
6  -0.092    -0.090    0.040  2.299  278.641  335.234   6863
7  -0.109    -0.109    0.037  2.916  310.672  386.297   8439
8  -0.085    -0.083    0.044  1.948  276.609  327.828   6849
9  -0.096    -0.096    0.038  2.506  310.266  376.078   8413
10 -0.129    -0.130    0.043  3.009  252.047  326.953   6505
11 -0.135    -0.134    0.038  3.556  290.359  380.672   7939
12 -0.023    -0.023    0.032  0.714  435.562  455.750   8208
13 -0.013    -0.014    0.030  0.434  509.906  523.438   9513

Running 5-taxon (partitioned) D-statistics

To perform partitioned D-statistic tests is not any harder than running the standard four-taxon D-statistic tests. You simply enter your tests with 5 taxa in them now, listed as p1-p5. We have not developed a function to generate 5-taxon tests from a phylogeny, as this test is more appropriately applied to a smaller number of tests to further tease apart the meaning of significant 4-taxon results. See example above in the short tutorial. A simulation example will be added here soon…

[ ]:

ipyrad-analysis toolkit: abba-baba

Load packages

Set up and connect to the ipyparallel cluster

A tree-based hypothesis

Load in your .loci data file and a tree hypothesis

Short tutorial: calculating abba-baba statistics

Look at the results

Plotting and interpreting results

generating tests

Further investigating results with 5-part tests

Full Tutorial

Creating a baba object

Linking tests to the baba object

Other parameters

Running the tests

The results table

Auto-generating tests

Running the tests

More about input file paths (i/o)

(optional): root the tree

Interpreting results

Running 5-taxon (partitioned) D-statistics

Creating a `baba` object