ipyrad-analysis toolkit: sratools

For reproducibility purposes, it is nice to be able to download the raw data for your analysis from an online repository like NCBI with a simple script at the top of your notebook. We’ve written a simple wrapper for the sratools command line program (which is notoriously difficult to use and poorly documented) to try to make this easier to do.

Required software

[1]:
# conda install ipyrad -c bioconda
# conda install sratools -c bioconda
[2]:
import ipyrad.analysis as ipa

Fetch info for a published data set by its accession ID

You can find the study ID or individual sample IDs from published papers or by searching the NCBI or related databases. ipyrad can take as input one or more accessions IDs for individual Runs or Studies (SRR or SRP, and similarly ERR or ERP, etc.).

[3]:
# init sratools object with an accessions argument
sra = ipa.sratools(accessions="SRP065788")
[4]:
# fetch info for all samples from this study, save as a dataframe
stable = sra.fetch_runinfo()

Fetching project data...
[5]:
# the dataframe has all information about this study
stable.head()
[5]:
Run ReleaseDate LoadDate spots bases spots_with_mates avgLength size_MB AssemblyName download_path ... SRAStudy BioProject Study_Pubmed_id ProjectID Sample BioSample SampleType TaxID ScientificName SampleName
0 SRR2895732 2015-11-04 15:50:01 2015-11-04 17:19:15 2009174 182834834 0 91 116 NaN https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 PRJNA299402 NaN 299402 SRS1146158 SAMN04202163 simple 224736 Viburnum betulifolium Lib1_betulifolium
1 SRR2895743 2015-11-04 15:50:01 2015-11-04 17:18:35 2452970 223220270 0 91 140 NaN https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 PRJNA299402 NaN 299402 SRS1146171 SAMN04202164 simple 1220044 Viburnum bitchiuense Lib1_bitchiuense_combined
2 SRR2895755 2015-11-04 15:50:01 2015-11-04 17:18:46 4640732 422306612 0 91 264 NaN https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 PRJNA299402 NaN 299402 SRS1146182 SAMN04202165 simple 237927 Viburnum carlesii Lib1_carlesii_D1_BP_001
3 SRR2895756 2015-11-04 15:50:01 2015-11-04 17:20:18 3719383 338463853 0 91 214 NaN https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 PRJNA299402 NaN 299402 SRS1146183 SAMN04202166 simple 237928 Viburnum cinnamomifolium Lib1_cinnamomifolium_PWS2105X
4 SRR2895757 2015-11-04 15:50:01 2015-11-04 17:20:06 3745852 340872532 0 91 213 NaN https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 PRJNA299402 NaN 299402 SRS1146181 SAMN04202167 simple 237929 Viburnum clemensae Lib1_clemensiae_DRY6_PWS_2135

5 rows × 30 columns

File names

You can select columns by their index number to use for file names. See below.

[8]:
stable.iloc[:5, [0, 28, 29]]
[8]:
Run ScientificName SampleName
0 SRR2895732 Viburnum betulifolium Lib1_betulifolium
1 SRR2895743 Viburnum bitchiuense Lib1_bitchiuense_combined
2 SRR2895755 Viburnum carlesii Lib1_carlesii_D1_BP_001
3 SRR2895756 Viburnum cinnamomifolium Lib1_cinnamomifolium_PWS2105X
4 SRR2895757 Viburnum clemensae Lib1_clemensiae_DRY6_PWS_2135

Download the data

From an sratools object you can fetch just the info, or you can download the files as well. Here we call .run() to download the data into a designated workdir. There are arguments for how to name the files according to name fields in the fetch_runinfo table. The accessions argument here is a list of the first five SRR sample IDs in the table above.

[10]:
# select first 5 samples
list_of_srrs = stable.Run[:5]
list_of_srrs
[10]:
0    SRR2895732
1    SRR2895743
2    SRR2895755
3    SRR2895756
4    SRR2895757
Name: Run, dtype: object
[11]:
# new sra object
sra2 = ipa.sratools(accessions=list_of_srrs, workdir="downloaded")

# call download (run) function
sra2.run(auto=True, name_fields=(1,30))
Parallel connection | oud: 4 cores
[####################] 100% 0:02:07 | downloading/extracting fastq data

5 fastq files downloaded to /home/deren/Documents/ipyrad/newdocs/cookbook/downloaded

Check the data files

You can see that the files were named according to the SRR and species name in the table. The intermediate .sra files were removed and only the fastq files were saved.

[12]:
! ls -l downloaded
total 6174784
-rw-rw-r-- 1 deren deren 1372440058 Aug 17 16:36 SRR2895732_Lib1_betulifolium.fastq
-rw-rw-r-- 1 deren deren 1422226640 Aug 17 16:36 SRR2895743_Lib1_bitchiuense_combined.fastq
-rw-rw-r-- 1 deren deren  759216310 Aug 17 16:37 SRR2895755_Lib1_carlesii_D1_BP_001.fastq
-rw-rw-r-- 1 deren deren 1812215534 Aug 17 16:36 SRR2895756_Lib1_cinnamomifolium_PWS2105X.fastq
-rw-rw-r-- 1 deren deren  956848184 Aug 17 16:36 SRR2895757_Lib1_clemensiae_DRY6_PWS_2135.fastq