ipyrad-analysis toolkit: PCA¶

Principal component analysis is a dimensionality reduction method used to transform and project data points onto fewer orthogonal axes that can explain the greatest amount of variance in the data. While there are many tools available to implement PCA, the ipyrad tool has many options available specifically to deal with missing data. PCA analyses are very sensitive to missing data. The ipyrad.pca tool makes it easy to perform PCA on RAD-seq data by filtering and/or imputing missing data, and allowing for easy subsampling of individuals to include in analyses.

Required software¶

[1]:

# conda install ipyrad -c bioconda
# conda install scikit-learn -c bioconda
# conda install toyplot -c eaton-lab

[2]:

import ipyrad.analysis as ipa
import toyplot

Required input data files¶

Your input data should be a .snps.hdf database file produced by ipyrad. If you do not have this you can generate it from any VCF file following the vcf2hdf5 tool tutorial. The database file contains the genotype calls information as well as linkage information that is used for subsampling unlinked SNPs and bootstrap resampling.

[3]:

# the path to your .snps.hdf5 database file
data = "/home/deren/Downloads/ref_pop2.snps.hdf5"

Input data file and population assignments¶

If you are using the “sample” input method then population assignments (imap dictionary) are used for for filtering, color coding plots, and for imputation. If you are using the “kmeans” imputing method then population assignments are only used for filtering and color coding plots.

[4]:

# group individuals into populations
imap = {
    "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
    "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
    "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
    "bran": ["BJSL25", "BJSB3", "BJVL19"],
    "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
    "sagr": ["CUVN10", "CUCA4", "CUSV6"],
    "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
}

# require that 50% of samples have data in each group
minmap = {i: 0.5 for i in imap}

Enter data file and params¶

The pca analysis object takes input data as the .snps.hdf5 file produced by ipyrad. All other parameters are optional. The imap dictionary groups individuals into populations and minmap can be used to filter SNPs to only include those that have data for at least some proportion of samples in every group. The mincov option works similarly, it filters SNPs that are shared across less than some proportion of all samples (in contrast to minmap this does not use imap groupings).

When you init the object it will load the data and apply filtering. The printed output tells you how many SNPs were removed by each filter and the remaining amount of missing data after filtering. These remaining missing values are the ones that will be filled with imputation.

[5]:

# init pca object with input data and (optional) parameter options
pca = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.25,
    impute_method="sample",
)

Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 9517
Filtered (minmap): 112898
Filtered (combined): 121048
Sites after filtering: 228866
Sites containing missing values: 201371 (87.99%)
Missing values in SNP matrix: 640419 (10.36%)
Imputation: 'sampled'; (0, 1, 2) = 77.0%, 10.7%, 12.3%

Quick tutorial: run PCA¶

Call .run() and to get the PC axes and the variance explained by each axis. Feel free to take these data and plot them as you see fit, or, see below for built in plotting options.

[6]:

# run the PCA analysis
pcs, var = pca.run()

Subsampling SNPs: 29695/228866

Run PCA and plot results.¶

When you call .run() a PCA model will be fit to the data and two results are returned: (1) samples weightings on the component axes; (2) the proportion of variance explained by each axis. For convenience we have developed a plotting function that can be called as .run_and_plot_2D() to return the results as a toytree plot. The first two arguments to this are the two axes to be plotted.

By default this plotting function will return use the imap information to color points and create a legend.

[7]:

# plot PC axes 0 and 2 with no subsampling
pca.run_and_plot_2D(0, 2, subsample=False);

Subsampling SNPs¶

By default run() will randomly subsample one SNP per RAD locus to reduce the effect of linkage on your results. This can be turned off by setting subsample=False, like in the plot above. When using subsampling you can set the random seed to make your results repeatable. The results here subsample 29K SNPs from a possible 228K SNPs, but the final results are quite similar to above.

[8]:

# plot PC axes 0 and 2 with no subsampling
pca.run_and_plot_2D(0, 2, seed=123, subsample=True);

Subsampling SNPs: 29695/228866

Subsampling with replication¶

Subsampling unlinked SNPs is generally a good idea for PCA analyses since you want to remove the effects of linkage from your data. It also presents a convenient way to explore the confidence in your results. By using the option nreplicates you can run many replicate analyses that subsample a different random set of unlinked SNPs each time. The replicate results are drawn with a lower opacity and the centroid of all the points for each sample is plotted as a black point. You can hover over the points with your cursor to see the sample names pop-up.

[9]:

# plot PC axes 0 and 2 with no subsampling
pca.run_and_plot_2D(0, 2, seed=123, subsample=True, nreplicates=25);

Subsampling SNPs: 29695/228866

Advanced: Imputation algorithms:

We offer three algorithms for imputing missing data:

sample: Randomly sample genotypes based on the frequency of alleles within (user-defined) populations (imap).
kmeans: Randomly sample genotypes based on the frequency of alleles in (kmeans cluster-generated) populations.
None: All missing values are imputed with zeros (ancestral allele).

No imputation¶

The None option will almost always be a bad choice when there is any reasonable amount of missing data. Missing values will all be filled as zeros (ancestral allele) – this is what many other PCA tools do as well. I show it here for comparison to the imputed results, which are better. The two points near the top of the plot are samples with the most missing data that are erroneously grouped together. The rest of the samples also form much less clear clusters than in the other examples where we use imputation or stricter filtering options.

[10]:

# init pca object with input data and (optional) parameter options
no_imputation = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.25,
    impute_method=None,
)
no_imputation.run_and_plot_2D(0, 2, seed=123, subsample=False);

Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 9517
Filtered (minmap): 112898
Filtered (combined): 121048
Sites after filtering: 228866
Sites containing missing values: 201371 (87.99%)
Missing values in SNP matrix: 640419 (10.36%)
Imputation (null; sets to 0): 100.0%, 0.0%, 0.0%

No imputation but stricter filtering¶

Here I do not allow for any missing data (mincov=1.0). You can see that this reduces the number of total SNPs from 349K to 10K. The final reslult is not too different from our first example, but seems a little less smooth. In most data sets it is probably better to include more data by imputing some values, though. Many data sets may not have as many SNPs without missing data as this one.

[11]:

# init pca object with input data and (optional) parameter options
strict_filtering = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=1.0,
    impute_method=None,
)
strict_filtering.run_and_plot_2D(0, 2, seed=123, subsample=False);

Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 321628
Filtered (minmap): 112898
Filtered (combined): 322419
Sites after filtering: 27495
Sites containing missing values: 0 (0.00%)
Missing values in SNP matrix: 0 (0.00%)

Kmeans imputation¶

The kmeans clustering method allows imputing values based on population allele frequencies (like the sample method) but without having to a priori assign individuals to populations. In other words, it is meant to reduce the bias introduced by assigning individuals yourself. Instead, this method uses kmeans clustering to group individuals into “populations” and then imputes values based on those population assignments. This is accomplished through iterative clustering, starting by using only SNPs that are present across 90% of all samples (this can be changed with the topcov param) and then allowing more missing data in each iteration until it reaches the mincov parameter value.

This method works great especially if you have a lot of missing data and fear that user-defined population assignments will bias your results. Here it gives super similar results to our first plots using the “sample” impute method, suggesting that our population assignments are not greatly biasing the results. To use K=7 clusters you simply enter impute_method=7.

[12]:

# kmeans imputation
kmeans_imputation = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.5,
    impute_method=7,
)
kmeans_imputation.run_and_plot_2D(0, 2, seed=123);

Kmeans clustering: iter=0, K=7, mincov=0.9, minmap={'global': 0.5}
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 222081
Filtered (minmap): 29740
Filtered (combined): 225958
Sites after filtering: 123956
Sites containing missing values: 96461 (77.82%)
Missing values in SNP matrix: 142937 (4.27%)
Imputation: 'sampled'; (0, 1, 2) = 76.7%, 14.9%, 8.4%
{0: ['CUCA4', 'CUSV6', 'CUVN10'], 1: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 2: ['BJSB3', 'BJSL25', 'BJVL19'], 3: ['TXGR3', 'TXMD3'], 4: ['FLAB109', 'FLCK18', 'FLCK216', 'FLMO62', 'FLSA185', 'FLSF47', 'FLSF54', 'FLWO6'], 5: ['BZBB1', 'CRL0030', 'HNDA09', 'MXSA3017'], 6: ['MXED8', 'MXGT4']}

Kmeans clustering: iter=1, K=7, mincov=0.8, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 131220
Filtered (minmap): 115143
Filtered (combined): 152672
Sites after filtering: 197242
Sites containing missing values: 169747 (86.06%)
Missing values in SNP matrix: 420499 (7.90%)
Imputation: 'sampled'; (0, 1, 2) = 77.4%, 10.4%, 12.2%
{0: ['CUCA4', 'CUSV6', 'CUVN10'], 1: ['FLAB109', 'FLCK18', 'FLCK216', 'FLMO62', 'FLSA185', 'FLSF47', 'FLSF54', 'FLWO6'], 2: ['TXGR3', 'TXMD3'], 3: ['BJSB3', 'BJSL25', 'BJVL19'], 4: ['BZBB1', 'CRL0030', 'HNDA09', 'MXSA3017'], 5: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 6: ['MXED8', 'MXGT4']}

Kmeans clustering: iter=2, K=7, mincov=0.7, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 76675
Filtered (minmap): 115143
Filtered (combined): 127588
Sites after filtering: 222326
Sites containing missing values: 194831 (87.63%)
Missing values in SNP matrix: 588943 (9.81%)
Imputation: 'sampled'; (0, 1, 2) = 77.1%, 10.6%, 12.3%
{0: ['BZBB1', 'CRL0030', 'HNDA09', 'MXSA3017'], 1: ['FLAB109', 'FLCK18', 'FLCK216', 'FLMO62', 'FLSA185', 'FLSF47', 'FLSF54', 'FLWO6'], 2: ['BJSB3', 'BJSL25', 'BJVL19'], 3: ['MXED8', 'MXGT4'], 4: ['TXGR3', 'TXMD3'], 5: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 6: ['CUCA4', 'CUSV6', 'CUVN10']}

Kmeans clustering: iter=3, K=7, mincov=0.6, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 52105
Filtered (minmap): 115143
Filtered (combined): 123642
Sites after filtering: 226272
Sites containing missing values: 198777 (87.85%)
Missing values in SNP matrix: 625543 (10.24%)
Imputation: 'sampled'; (0, 1, 2) = 77.1%, 10.6%, 12.3%
{0: ['FLAB109', 'FLCK18', 'FLCK216', 'FLMO62', 'FLSA185', 'FLSF47', 'FLSF54', 'FLWO6'], 1: ['BZBB1', 'CRL0030', 'HNDA09', 'MXSA3017'], 2: ['BJSB3', 'BJSL25', 'BJVL19'], 3: ['MXED8', 'MXGT4'], 4: ['CUCA4', 'CUSV6', 'CUVN10'], 5: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 6: ['TXGR3', 'TXMD3']}

Kmeans clustering: iter=4, K=7, mincov=0.5, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 29740
Filtered (minmap): 115143
Filtered (combined): 123207
Sites after filtering: 226707
Sites containing missing values: 199212 (87.87%)
Missing values in SNP matrix: 630388 (10.30%)
Imputation: 'sampled'; (0, 1, 2) = 77.1%, 10.5%, 12.3%
Subsampling SNPs: 29470/226707

Advanced: Save plot to PDF

You can save the figure as a PDF or SVG using a toyplot render function like below. The .plot_and_run() function returns a toyplot Canvas object which you can store as a variable and then save to file.

[13]:

import toyplot.pdf

# save returned values of the plot command
canvas, axes, mark = kmeans_imputation.run_and_plot_2D(0, 2, seed=123)

# pass the canvas object to render function
toyplot.pdf.render(canvas, "PCA-kmeans-7.pdf")

Subsampling SNPs: 29470/226707

Advanced: Missing data per sample

You can view the proportion of missing data per sample by accessing the .missing data table from your pca analysis object. You can see that most samples in this data set had 10% missing data or less, but a few had 20-50% missing data. You can hover your cursor over the plot above to see the sample names. It seems pretty clear that samples with huge amounts of missing data do not stand out at outliers in these plots like they did in the no-imputation plot. Which is great!

[14]:

# .missing is a pandas DataFrame
kmeans_imputation.missing.sort_values(by="missing")

[14]:

	missing
BJSL25	0.03
BJVL19	0.03
FLBA140	0.03
CRL0030	0.04
LALC2	0.04
FLSF54	0.04
CUVN10	0.06
FLAB109	0.06
MXGT4	0.07
MXED8	0.08
CUSV6	0.08
HNDA09	0.08
FLSF33	0.08
BJSB3	0.09
FLSF47	0.09
MXSA3017	0.09
FLMO62	0.10
TXMD3	0.10
FLWO6	0.11
FLCK18	0.11
TXGR3	0.11
BZBB1	0.11
FLCK216	0.11
FLSA185	0.13
CUCA4	0.14
SCCU3	0.23
TXWV2	0.55

Advanced: Accessing component loadings

You can easily access the PCA results and use them in other plotting tools, or with other downstream analyses. As an example, a relatively new popular dimensionality reduction method called t-SNE is increasingly being used as a method for further dimensionality reduction after performing a PCA. This is a manifold learning method that aims to spread points across a plane while preserving the distances between them. See the scikit-learn docs for more details. Using TSNE we can represent the similarity of samples across all of the PCA component loadings, instead of just looking at a few axes pairs at a time.

[15]:

import numpy as np
import pandas as pd

[16]:

# run() returns the components and variances
comps, var = pca.run()

Subsampling SNPs: 29695/228866

[17]:

# view components as a dataframe
df = pd.DataFrame(comps, columns=pca.names)

# show the first few component loadings
df.head()

[17]:

	BJSB3	BJSL25	BJVL19	BZBB1	CRL0030	CUCA4	CUSV6	CUVN10	FLAB109	FLBA140	...	FLWO6	HNDA09	LALC2	MXED8	MXGT4	MXSA3017	SCCU3	TXGR3	TXMD3	TXWV2
0	32.914522	62.209996	-11.425539	-36.160892	-3.748261	-0.645594	1.654549	3.977300	3.315715	0.265890	...	3.070203	3.802013	-1.599840	0.013287	-4.887306	-4.183931	46.987450	2.691919	-9.281167	-5.151435e-13
1	31.103561	58.685620	-10.228913	-32.171740	-3.764746	0.239929	3.220804	0.518112	1.481364	0.510498	...	-1.307076	-1.097346	1.495217	0.650045	1.155887	2.386474	-17.042467	9.084550	44.189804	-6.000755e-14
2	30.894172	60.115790	-11.705797	-32.785239	-3.661601	0.741221	0.114627	2.118460	2.945673	0.741840	...	-2.426063	-1.711741	0.042619	-0.201941	3.086794	2.121274	-33.114363	-11.950209	-33.419024	-4.308776e-13
3	50.973466	-40.960433	4.692376	-2.309443	11.169090	-22.761889	3.450243	-0.029285	2.913106	-1.336157	...	-2.728792	-11.296191	-4.882758	0.509511	-30.870089	42.385129	1.178756	-2.748648	-0.208088	6.520895e-13
4	51.696678	-40.128684	3.270133	-3.148091	8.837426	-14.675287	-0.609130	-0.676379	0.558236	0.349628	...	-1.181166	-21.335528	-12.327858	3.732828	11.260503	-18.968773	-4.344499	41.091110	-9.408211	-3.098632e-13

5 rows × 27 columns

Advanced: TSNE and other dimensionality reduction methods

[18]:

from sklearn.manifold import TSNE

[19]:

# get a list of colors for n populations
colors = toyplot.color.brewer.palette("Spectral", count=len(pca.imap))

# get ordered list of colors for names in pca.names
nidxs = [
    np.where([i in pca.imap[key] for key in pca.imap])[0][0]
    for i in pca.names
]
colors = [toyplot.color.Palette()[i] for i in nidxs]

The results of this plot distinct clusters that capture the species assignments for each individual, and it recovers three broader groups of clusters, which correspond to the three major clades we recover in the phylogeny. This does not represent the fine structure as well as PCA, where for example we pick up the differences between samples in Mexico and Texas, but it does identify the distinct species very clearly (i.e., species delimitation). Try playing with the perplexity value of the model to find a best projection of your data.

[20]:

# init TSNE model object with params (sensitive)
tsne_model = TSNE(
    perplexity=4,
    init='random',
    n_iter=100000,
    random_state=123,
)

# fit TSNE model to the PCA axes
tsne_data = tsne_model.fit_transform(comps)

# draw scatterplot of the 2 tsne dimensions
toyplot.scatterplot(
    tsne_data[:, 0], tsne_data[:, 1],
    width=350,
    height=300,
    size=10,
    xlabel="TSNE component 1",
    ylabel="TSNE component 2",
    color=colors,
    mstyle={"stroke": "black"},
    title=pca.names,
);