ipyrad-analysis toolkit: window_extracter¶
Extract all sequence data within a genomic window, concatenate, and write to a phylip file. Useful for inferring the phylogeny near a specific gene/region of interest. Follow up with downstream phylogenetic analysis of the region.
Key features:
- Automatically concatenates ref-mapped RAD loci in sliding windows.
- Filter to remove sites by missing data.
- Optionally remove samples from alignments.
- Optionally use consensus seqs to represent clades of multiple samples.
Required software¶
[1]:
# conda install ipyrad -c bioconda
# conda install raxml -c bioconda
# conda install toytree -c eaton-lab
[23]:
data = ip.load_json("../../tests/ped-2019.json")
loading Assembly: ped-2019
from saved path: ~/Documents/ipyrad/tests/ped-2019.json
[26]:
data.params
[26]:
0 assembly_name ped-2019
1 project_dir ~/Documents/ipyrad/tests
2 raw_fastq_path
3 barcodes_path
4 sorted_fastq_path ~/Documents/ipyrad/tests/example_empirical_rad/*.gz
5 assembly_method denovo
6 reference_sequence
7 datatype rad
8 restriction_overhang ('TGCAG', '')
9 max_low_qual_bases 5
10 phred_Qscore_offset 33
11 mindepth_statistical 6
12 mindepth_majrule 6
13 maxdepth 10000
14 clust_threshold 0.85
15 max_barcode_mismatch 0
16 filter_adapters 0
17 filter_min_trim_len 35
18 max_alleles_consens 2
19 max_Ns_consens 0.05
20 max_Hs_consens 0.05
21 min_samples_locus 4
22 max_SNPs_locus 0.2
23 max_Indels_locus 8
24 max_shared_Hs_locus 0.5
25 trim_reads (0, 0, 0, 0)
26 trim_loci (0, 0, 0, 0)
27 output_formats ['p', 's', 'l', 'v']
28 pop_assign_file
29 reference_as_filter
[ ]:
[ ]:
[2]:
import ipyrad.analysis as ipa
import toytree
from ipyrad.analysis.window_extracter import *
Required input data files¶
Your input data should be a .seqs.hdf
database file produced by ipyrad. This file contains the full sequence alignment for your samples as well as associated meta-data of the genomic positions of RAD loci relative to a reference genome.
[3]:
# the path to your HDF5 formatted seqs file
data1 = "../../tests/ped-2019_outfiles/ped-2019.seqs.hdf5"
data2 = "../../tests/refsim_outfiles/refsim.seqs.hdf5"
[4]:
with h5py.File(data1) as io5:
print(io5.keys())
print(list(io5["phymap"].attrs.keys()))
print(io5["phymap"][:])
print(io5["phymap"][:, 2] - io5["phymap"][:, 1])
<KeysViewHDF5 ['phy', 'phymap', 'scaffold_lengths', 'scaffold_names']>
['columns', 'phynames', 'reference']
[[ 1 0 69 0 0]
[ 2 69 138 0 0]
[ 3 138 207 0 0]
...
[ 40794 2835157 2835226 0 0]
[ 40795 2835226 2835295 0 0]
[ 40796 2835295 2835364 0 0]]
[69 69 69 ... 69 69 69]
Short Tutorial:¶
The window_extracter()
tool takes the .seqs.hdf5
database file from ipyrad as its input file. You select scaffolds by their index (integer) which can be found in the .scaffold_table
. The first step is to load the data file to see which scaffolds are in your data set and their size. These will be in the same order they appear in your reference genome.
Load data file to get scaffold information¶
[11]:
# check scaffold idx (row) against scaffold names
ext = ipa.window_extracter(data1, scaffold_idx=0)#.scaffold_table.head(15)
No data in selected window.
[ ]:
[15]:
with h5py.File(ext.data) as io5:
colnames = io5["phymap"].attrs["columns"]
print(colnames)
print(io5["phymap"][:])
mask = io5["phymap"][:, 0] = ext.scaffold_idx + 1
d = io5["phymap"][:][mask]
[b'chroms' b'phy0' b'phy1' b'pos0' b'pos1']
[[ 1 0 69 0 0]
[ 1 69 138 0 0]
[ 1 138 207 0 0]
...
[ 1 2835157 2835226 0 0]
[ 1 2835226 2835295 0 0]
[ 1 2835295 2835364 0 0]]
[18]:
# next: init_stats
[ ]:
[ ]:
We can see from the table above that this genome has 12 large scaffolds (chromosome-scale linkage blocks) and many other smaller unplaced scaffolds. If you are working with a high quality reference genome then it will likely look similar to this, whereas many other reference genomes will be composed of many more scaffolds that are mostly smaller in size. Here I will focus just on the large chromosomes.
Load tool and select window¶
Enter the data
file, the workdir
where files will be written to, and the scaffold_idx
that you want to extract sequence data from. Use start
and end
to select the window. You can filter sites from the alignment by using mincov
or minmap
along with an imap
dictionary. Similarly, you can reduce missing data among samples by either excluding samples (using exclude
or by using an imap
dict) or by applying consensus_sample
with an imap dictionary which will
produce a single consensus sequence to represent all of the samples grouped to the same name in an imap dictionary. Examples of each of these options are demonstraed below.
The .stats
attribute shows the information content of the selected window before and after filtering. When creating alignments this tool excludes any sites that have no data (e.g., the space between RAD markers, or the space between paired reads). In this case, we selected a 5Mb genomic window which contains 51,474bp of RAD sequence data and 1,687 SNPs. After filtering based on the mincov
value this was reduced to 42,397bp and 1,416 SNPs. The number of samples remained the same because
taxon sampling occurs before the filters are applied. To write the resulting alignment to a file call the .run()
function.
[7]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
data=data,
workdir="analysis-window_extracter",
scaffold_idx=0,
start=0,
end=5000000,
exclude=["CUMM5"],
mincov=20,
)
# show stats of the window
ext.stats
[7]:
scaffold | start | end | sites | snps | missing | samples | |
---|---|---|---|---|---|---|---|
prefilter | Qrob_Chr01 | 0 | 5000000 | 51474 | 1687 | 0.19 | 29 |
postfilter | Qrob_Chr01 | 0 | 5000000 | 42397 | 1416 | 0.11 | 29 |
Write selected window to a file¶
[8]:
ext.run(force=True)
Wrote data to /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-window_extracter/scaf0-0-5000000.phy
Advanced: Infer tree from phy output
You can pass in the file path that was created above to the .raxml
analysis object in ipyrad, or use it any other phylogenetic software that accepts phylip format. We can see from the stats table above that this alignment contains 11,713 sites with 397 SNPs and about 25% missing data.
[9]:
# run raxml on the phylip file
rax = ipa.raxml(data=ext.outfile, name="test", N=50, T=4)
# show the raxml command
print(rax.command)
raxmlHPC-PTHREADS-SSE3 -f a -T 4 -m GTRGAMMA -n test -w /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml -s /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-window_extracter/scaf0-0-5000000.phy -p 54321 -N 50 -x 12345
[10]:
# run job and wait to finish
rax.run(force=True)
job test finished successfully
[11]:
# plot the tree for this genome window
print(rax.trees.bipartitions)
tre = toytree.tree(rax.trees.bipartitions)
rtre = tre.root("reference").collapse_nodes(min_support=50)
rtre.draw(node_labels="support");
/home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bipartitions.test
Advanced: Population/species sampling
When you have multiple samples per species you can use an imap
dictionary to define them as a clade to create a consensus sequence to represent each clade as a single taxon. This can be useful for reducing the amount of missing data, and reducing the number of tips in the tree.
[12]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
data = "/home/deren/Downloads/ref_pop2.seqs.hdf5",
workdir="analysis-window_extracter",
scaffold_idx=0,
start=0,
end=5000000,
mincov=4,
imap={
"reference": ["reference"],
"virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
"mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
"gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
"bran": ["BJSL25", "BJSB3", "BJVL19"],
"fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
"sagr": ["CUVN10", "CUCA4", "CUSV6"],
"oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
},
)
[13]:
# write the phylip file
ext.run(force=True)
Wrote data to /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-window_extracter/scaf0-0-5000000.phy
[14]:
# filtering now reduced from 30 to 8 samples
ext.stats
[14]:
scaffold | start | end | sites | snps | missing | samples | |
---|---|---|---|---|---|---|---|
prefilter | Qrob_Chr01 | 0 | 5000000 | 51474 | 1690 | 0.21 | 30 |
postfilter | Qrob_Chr01 | 0 | 5000000 | 50183 | 452 | 0.02 | 8 |
[15]:
# infer tree on imap
rax = ipa.raxml(data=ext.outfile, name="test2", N=50, T=4)
rax.run(force=True)
job test2 finished successfully
[16]:
# plot the tree for this genome window
print(rax.trees.bipartitions)
tre = toytree.tree(rax.trees.bipartitions)
rtre = tre.root("reference").collapse_nodes(min_support=50)
rtre.draw(node_labels="support");
/home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bipartitions.test2