ipyrad-analysis toolkit: vcf_to_hdf5

View as notebook

Many genome assembly tools will write variant SNP calls to the VCF format (variant call format). This is a plain text file that stores variant calls relative to a reference genome in tabular format. It includes a lot of additional information about the quality of SNP calls, etc., but is not very easy to read or efficient to parse. To make analyses run a bit faster ipyrad uses a simplified format to store this information in the form of an HDF5 database. You can easily convert any VCF file to this HDF5 format using the ipa.vcf_to_hdf5() tool.

This tool includes an added benefit of allowing you to enter an (optional) ld_block_size argument when creating the file which will store information that can be used downstream by many other tools to subsample SNPs and perform bootstrap resampling in a way that reduces the effects of linkage among SNPs. If your data are assembled RAD data then the ld_block_size is not required, since we can simply use RAD loci as the linkage blocks. But if you want to combine reference-mapped RAD loci located nearby in the genome as being on the same linkage block then you can enter a value such as 50,000 to create 50Kb linkage block that will join many RAD loci together and sample only 1 SNP per block in each bootstrap replicate. If your data are not RAD data, e.g., whole genome data, then the ld_block_size argument will be required in order to encode linkage information as discrete blocks into your database.

Required software

If you are converting a VCF file assembled from some other tool (e.g., GATK, freebayes, etc.) then you will need to install the htslib and bcftools software and use them as described below.

[1]:
# conda install ipyrad -c bioconda
# conda install htslib -c bioconda
# conda install bcftools -c bioconda
[2]:
import ipyrad.analysis as ipa
import pandas as pd

Pre-filter data from other programs (e.g., FreeBayes, GATK)

You can use the program bcftools to pre-filter your data to exclude indels and low quality SNPs. If you ran the conda install commands above then you will have all of the required tools installed. To achieve the format that ipyrad expects you will need to exclude indel containing SNPs (this may change in the future). Further quality filtering is optional.

The example below reduced the size of a VCF data file from 29Gb to 80Mb! VCF contains a lot of information that you do not need to retain through all of your analyses. We will keep only the final genotype calls.

Note that the code below is bash script. You can run this from a terminal, or in a jupyter notebook by appending the (%%bash) header like below.

[ ]:
%%bash

# compress the VCF file if not already done (creates .vcf.gz)
bgzip data.vcf

# tabix index the compressed VCF (creates .vcf.gz.tbi)
tabix data.vcf.gz

# remove multi-allelic SNPs and INDELs and PIPE to next command
bcftools view -m2 -M2 -i'CIGAR="1X" & QUAL>30' data.vcf.gz -Ou |

    # remove extra annotations/formatting info and save to new .vcf
    bcftools annotate -x FORMAT,INFO  > data.cleaned.vcf

# recompress the final file (create .vcf.gz)
bgzip data.cleaned.vcf

A peek at the cleaned VCF file

[3]:
# load the VCF as an datafram
dfchunks = pd.read_csv(
    "/home/deren/Documents/ipyrad/sandbox/Macaque-Chr1.clean.vcf.gz",
    sep="\t",
    skiprows=1000,
    chunksize=1000,
)

# show first few rows of first dataframe chunk
next(dfchunks).head()
[3]:
NC_018152.2 51273 . G A 280.482 ..1 ..2 GT 0/0 ... 0/0.9 0/0.10 0/0.11 0/0.12 0/0.13 0/0.14 0/0.15 0/0.16 0/0.17 0/1.1
0 NC_018152.2 51292 . A G 16750.300 . . GT 1/1 ... 1/1 . 1/1 1/1 1/1 1/1 0/0 1/1 1/1 1/1
1 NC_018152.2 51349 . A G 628.563 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
2 NC_018152.2 51351 . C T 943.353 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
3 NC_018152.2 51352 . G A 607.681 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
4 NC_018152.2 51398 . C T 510.120 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0

5 rows × 29 columns

Converting clean VCF to HDF5

Here I using a VCF file from whole geome data for 20 monkey’s from an unpublished study (in progress). It contains >6M SNPs all from chromosome 1. Because many SNPs are close together and thus tightly linked we will likely wish to take linkage into account in our downstream analyses.

The ipyrad analysis tools can do this by encoding linkage block information into the HDF5 file. Here we encode ld_block_size of 20K bp. This breaks the 1 scaffold (chromosome) into about 10K linkage blocks. See the example below of this information being used in an ipyrad PCA analysis.

[4]:
# init a conversion tool
converter = ipa.vcf_to_hdf5(
    name="Macaque_LD20K",
    data="/home/deren/Documents/ipyrad/sandbox/Macaque-Chr1.clean.vcf.gz",
    ld_block_size=20000,
)

# run the converter
converter.run()
Indexing VCF to HDF5 database file
VCF: 6094152 SNPs; 1 scaffolds
[####################] 100% 0:02:22 | converting VCF to HDF5
HDF5: 6094152 SNPs; 10845 linkage group
SNP database written to ./analysis-vcf2hdf5/Macaque_LD20K.snps.hdf5

Downstream analyses

The data file now contains 6M SNPs across 20 samples and N linkage blocks. By default the PCA tool subsamples a single SNP per linkage block. To explore variation over multiple random subsamplings we can use the nreplicates argument.

[5]:
# init a PCA tool and filter to allow no missing data
pca = ipa.pca(
    data="./analysis-vcf2hdf5/Macaque_LD20K.snps.hdf5",
    mincov=1.0,
)
Samples: 20
Sites before filtering: 6094152
Filtered (indels): 0
Filtered (bi-allel): 0
Filtered (mincov): 794597
Filtered (minmap): 0
Filtered (combined): 794597
Sites after filtering: 5299555
Sites containing missing values: 0 (0.00%)
Missing values in SNP matrix: 0 (0.00%)

Run a single PCA analysis from subsampled unlinked SNPs

[6]:
pca.run_and_plot_2D(0, 1, seed=123);
Subsampling SNPs: 10841/5299555
SRR2981140SRR2981114nemestrina2SRR4454020SRR5947292SRR5947293SRR7588781SRR5947294SRR2981139sylvanusfasnoSRR4453966SRR4454026silenusfuscata2fassoSRR8285768DRR002233SRR5628058SRR1024051-250255075PC0 (25.4%) explained-25025PC1 (14.4%) explained

Run multiple PCAs over replicates of subsampled SNPs

Here you can see the results for a different 10K SNPs that are sampled in each replicate iteration. If the signal in the data is robust then we should expect to see the points clustering at a similar place across replicates. Internally ipyrad will rotate axes to ensure the replicate plots align despite axes swapping (which is arbitrary in PCA space). You can see this provides a better view of uncertainty in our estimates than the plot above (and it looks cool!)

[7]:
pca.run_and_plot_2D(0, 1, seed=123, nreplicates=25);
Subsampling SNPs: 10841/5299555
SRR2981140SRR2981114nemestrina2SRR4454020SRR5947292SRR5947293SRR7588781SRR5947294SRR2981139sylvanusfasnoSRR4453966SRR4454026silenusfuscata2fassoSRR8285768DRR002233SRR5628058SRR1024051-250255075PC0 (25.4%) explained-25025PC1 (14.4%) explained

More details on running PCAs, toggling options, and styling plots can be found in our ipyrad.analysis PCA tutorial