ipyrad-analysis toolkit: window_extracter

Extract all sequence data within a genomic window, concatenate, and write to a phylip file. Useful for inferring the phylogeny near a specific gene/region of interest. Follow up with downstream phylogenetic analysis of the region.

Key features:

  1. Automatically concatenates ref-mapped RAD loci in sliding windows.
  2. Filter to remove sites by missing data.
  3. Optionally remove samples from alignments.
  4. Optionally use consensus seqs to represent clades of multiple samples.

Required software

[1]:
# conda install ipyrad -c bioconda
# conda install raxml -c bioconda
# conda install toytree -c eaton-lab
[23]:
data = ip.load_json("../../tests/ped-2019.json")
loading Assembly: ped-2019
from saved path: ~/Documents/ipyrad/tests/ped-2019.json
[26]:
data.params
[26]:
0   assembly_name               ped-2019
1   project_dir                 ~/Documents/ipyrad/tests
2   raw_fastq_path
3   barcodes_path
4   sorted_fastq_path           ~/Documents/ipyrad/tests/example_empirical_rad/*.gz
5   assembly_method             denovo
6   reference_sequence
7   datatype                    rad
8   restriction_overhang        ('TGCAG', '')
9   max_low_qual_bases          5
10  phred_Qscore_offset         33
11  mindepth_statistical        6
12  mindepth_majrule            6
13  maxdepth                    10000
14  clust_threshold             0.85
15  max_barcode_mismatch        0
16  filter_adapters             0
17  filter_min_trim_len         35
18  max_alleles_consens         2
19  max_Ns_consens              0.05
20  max_Hs_consens              0.05
21  min_samples_locus           4
22  max_SNPs_locus              0.2
23  max_Indels_locus            8
24  max_shared_Hs_locus         0.5
25  trim_reads                  (0, 0, 0, 0)
26  trim_loci                   (0, 0, 0, 0)
27  output_formats              ['p', 's', 'l', 'v']
28  pop_assign_file
29  reference_as_filter
[ ]:

[ ]:

[2]:
import ipyrad.analysis as ipa
import toytree


from ipyrad.analysis.window_extracter import *

Required input data files

Your input data should be a .seqs.hdf database file produced by ipyrad. This file contains the full sequence alignment for your samples as well as associated meta-data of the genomic positions of RAD loci relative to a reference genome.

[3]:
# the path to your HDF5 formatted seqs file
data1 = "../../tests/ped-2019_outfiles/ped-2019.seqs.hdf5"
data2 = "../../tests/refsim_outfiles/refsim.seqs.hdf5"
[4]:
with h5py.File(data1) as io5:
    print(io5.keys())
    print(list(io5["phymap"].attrs.keys()))
    print(io5["phymap"][:])
    print(io5["phymap"][:, 2] - io5["phymap"][:, 1])
<KeysViewHDF5 ['phy', 'phymap', 'scaffold_lengths', 'scaffold_names']>
['columns', 'phynames', 'reference']
[[      1       0      69       0       0]
 [      2      69     138       0       0]
 [      3     138     207       0       0]
 ...
 [  40794 2835157 2835226       0       0]
 [  40795 2835226 2835295       0       0]
 [  40796 2835295 2835364       0       0]]
[69 69 69 ... 69 69 69]

Short Tutorial:

The window_extracter() tool takes the .seqs.hdf5 database file from ipyrad as its input file. You select scaffolds by their index (integer) which can be found in the .scaffold_table. The first step is to load the data file to see which scaffolds are in your data set and their size. These will be in the same order they appear in your reference genome.

Load data file to get scaffold information

[11]:
# check scaffold idx (row) against scaffold names
ext = ipa.window_extracter(data1, scaffold_idx=0)#.scaffold_table.head(15)
No data in selected window.
[ ]:

[15]:
with h5py.File(ext.data) as io5:
    colnames = io5["phymap"].attrs["columns"]
    print(colnames)
    print(io5["phymap"][:])

    mask = io5["phymap"][:, 0] = ext.scaffold_idx + 1

    d = io5["phymap"][:][mask]
[b'chroms' b'phy0' b'phy1' b'pos0' b'pos1']
[[      1       0      69       0       0]
 [      1      69     138       0       0]
 [      1     138     207       0       0]
 ...
 [      1 2835157 2835226       0       0]
 [      1 2835226 2835295       0       0]
 [      1 2835295 2835364       0       0]]
[18]:
# next: init_stats
[ ]:

[ ]:

We can see from the table above that this genome has 12 large scaffolds (chromosome-scale linkage blocks) and many other smaller unplaced scaffolds. If you are working with a high quality reference genome then it will likely look similar to this, whereas many other reference genomes will be composed of many more scaffolds that are mostly smaller in size. Here I will focus just on the large chromosomes.

Load tool and select window

Enter the data file, the workdir where files will be written to, and the scaffold_idx that you want to extract sequence data from. Use start and end to select the window. You can filter sites from the alignment by using mincov or minmap along with an imap dictionary. Similarly, you can reduce missing data among samples by either excluding samples (using exclude or by using an imap dict) or by applying consensus_sample with an imap dictionary which will produce a single consensus sequence to represent all of the samples grouped to the same name in an imap dictionary. Examples of each of these options are demonstraed below.

The .stats attribute shows the information content of the selected window before and after filtering. When creating alignments this tool excludes any sites that have no data (e.g., the space between RAD markers, or the space between paired reads). In this case, we selected a 5Mb genomic window which contains 51,474bp of RAD sequence data and 1,687 SNPs. After filtering based on the mincov value this was reduced to 42,397bp and 1,416 SNPs. The number of samples remained the same because taxon sampling occurs before the filters are applied. To write the resulting alignment to a file call the .run() function.

[7]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=data,
    workdir="analysis-window_extracter",
    scaffold_idx=0,
    start=0,
    end=5000000,
    exclude=["CUMM5"],
    mincov=20,
)

# show stats of the window
ext.stats
[7]:
scaffold start end sites snps missing samples
prefilter Qrob_Chr01 0 5000000 51474 1687 0.19 29
postfilter Qrob_Chr01 0 5000000 42397 1416 0.11 29

Write selected window to a file

[8]:
ext.run(force=True)
Wrote data to /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-window_extracter/scaf0-0-5000000.phy

Advanced: Infer tree from phy output

You can pass in the file path that was created above to the .raxml analysis object in ipyrad, or use it any other phylogenetic software that accepts phylip format. We can see from the stats table above that this alignment contains 11,713 sites with 397 SNPs and about 25% missing data.

[9]:
# run raxml on the phylip file
rax = ipa.raxml(data=ext.outfile, name="test", N=50, T=4)

# show the raxml command
print(rax.command)
raxmlHPC-PTHREADS-SSE3 -f a -T 4 -m GTRGAMMA -n test -w /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml -s /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-window_extracter/scaf0-0-5000000.phy -p 54321 -N 50 -x 12345
[10]:
# run job and wait to finish
rax.run(force=True)
job test finished successfully
[11]:
# plot the tree for this genome window
print(rax.trees.bipartitions)
tre = toytree.tree(rax.trees.bipartitions)
rtre = tre.root("reference").collapse_nodes(min_support=50)
rtre.draw(node_labels="support");
/home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bipartitions.test
FLSF33LALC2FLBA140TXWV2SCCU3CUSV6FLSF54FLMO62FLWO6FLAB109FLSA185FLCK216CUCA4CUVN10FLSF47FLCK18HNDA09BZBB1CRL0030CRL0001MXSA3017MXED8BJSB3BJVL19BJSL25MXGT4TXMD3TXGR3reference9480525692549456100525884100100100100

Advanced: Population/species sampling

When you have multiple samples per species you can use an imap dictionary to define them as a clade to create a consensus sequence to represent each clade as a single taxon. This can be useful for reducing the amount of missing data, and reducing the number of tips in the tree.

[12]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data = "/home/deren/Downloads/ref_pop2.seqs.hdf5",
    workdir="analysis-window_extracter",
    scaffold_idx=0,
    start=0,
    end=5000000,
    mincov=4,
    imap={
        "reference": ["reference"],
        "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
        "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
        "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
        "bran": ["BJSL25", "BJSB3", "BJVL19"],
        "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
        "sagr": ["CUVN10", "CUCA4", "CUSV6"],
        "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
    },
)
[13]:
# write the phylip file
ext.run(force=True)
Wrote data to /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-window_extracter/scaf0-0-5000000.phy
[14]:
# filtering now reduced from 30 to 8 samples
ext.stats
[14]:
scaffold start end sites snps missing samples
prefilter Qrob_Chr01 0 5000000 51474 1690 0.21 30
postfilter Qrob_Chr01 0 5000000 50183 452 0.02 8
[15]:
# infer tree on imap
rax = ipa.raxml(data=ext.outfile, name="test2", N=50, T=4)
rax.run(force=True)
job test2 finished successfully
[16]:
# plot the tree for this genome window
print(rax.trees.bipartitions)
tre = toytree.tree(rax.trees.bipartitions)
rtre = tre.root("reference").collapse_nodes(min_support=50)
rtre.draw(node_labels="support");
/home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bipartitions.test2
virgsagrminigemioleofusibranreference647210096100100100