AWS mirrored SRA file download by tanaes · Pull Request #355 · nf-core/fetchngs

tanaes · 2025-08-15T08:35:36Z

Per Issue #354, I've taken a stab at adapting my direct AWS download method for S3-mirrored SRA files. For ease of integration, I've so far just specified environment options with aws cli installed, but perhaps it would be more appropriate to try and instead use existing nextflow AWS integrations?

I've tested this on a couple datasets of various sizes I'm working with right now and it seems to be working quite reliably.

I did some very basic benchmarking of performance differences.

On a study of 28 samples with around 10 Gbp sequences, the AWS method was around 5 times faster for the download step than the SRA Prefetch approach. The bulk of the time was still taken up by unpacking, though.

PR checklist

This comment contains a description of changes (with reason).

Changes:

addition of aws cli download method for mirrored SRA files
If you've fixed a bug or added code that should be tested, add tests!

Tests added (plus nf-test snapshots) for the revised workflow, the new subworkflow, and the added module

If you've added a new tool - have you followed the pipeline conventions in the contribution docs
If necessary, also make a PR on the nf-core/fetchngs branch on the nf-core/test-datasets repository.

Uses existing test data and approaches from sratools downloads.

Make sure your code lints (nf-core lint).

There are linting errors in the dev branch, but no new linting errors introduced.

Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).

Tests passing.

Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
Usage Documentation in docs/usage.md is updated.
Output Documentation in docs/output.md is updated.
CHANGELOG.md is updated.
README.md is updated (including new tool citations and authors/contributors).

Dev -> Master for v1.12.0 release

…rect

adamrtalbot · 2025-09-23T08:47:04Z

modules/local/sra_aws_download/main.nf

+    aws s3 cp \\
+        --region us-east-1 \\
+        --no-sign-request \\
+        ${args} \\
+        s3://sra-pub-run-odp/sra/${run_accession}/${run_accession} \\
+        ${prefix}.sra


Nextflow has native method of downloading from AWS using the SDK, do we think this will be more efficient than using that?

Using Nextflow:

No worker processes

No copying files around

Lower latency

Using a process:

Can offload downloading to a worker process

Leave Nextflow 'head' process more available

Could supply faster networking to worker nodes (which may be in a different env).

adamrtalbot · 2025-09-23T08:47:32Z

modules/local/sra_aws_download/main.nf

+    # Verify download
+    if [ ! -f "${prefix}.sra" ]; then
+        echo "ERROR: Failed to download ${run_accession} from AWS S3"
+        exit 1
+    fi


This shouldn't be necessary, aws cp will exit with 1 if it fails.

adamrtalbot · 2025-09-23T08:50:47Z

modules/local/sra_aws_download/main.nf

+        --region us-east-1 \\
+        --no-sign-request \\
+        ${args} \\
+        s3://sra-pub-run-odp/sra/${run_accession}/${run_accession} \\


If this is just doing aws cp, why not use the file operator in native Nextflow, e.g. file/fromPath("s3://sra-pub-run-odp/sra/${run_accession}/${run_accession}")

Thinking about this, we can probably write a function to return meta, file("s3://etc") and make that into a pseudo-process, then call it exactly the same way. This wouldn't actually copy the file, just a pointer to it so we would only ever move the file once, which will be infinitely more efficient than using a process (literally!). It would still copy the file to the publishDir via normal Nextflow mechanisms.

drpatelh and others added 6 commits February 29, 2024 16:11

Merge pull request nf-core#286 from nf-core/dev

8ec2d93

Dev -> Master for v1.12.0 release

Adding direct AWS download for SRA files

d51006b

add workflow test

bfb598c

Merge branch 'dev' of https://github.com/nf-core/fetchngs into aws-di…

7b85915

…rect

changing aws workflow test to better match sra workflow test

be52c3e

added workflow test snapshot after dev merge

63a34da

kokul-atx requested review from adamrtalbot and drpatelh September 23, 2025 02:34

adamrtalbot reviewed Sep 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS mirrored SRA file download#355

AWS mirrored SRA file download#355
tanaes wants to merge 6 commits intonf-core:devfrom
tanaes:aws-direct

tanaes commented Aug 15, 2025

Uh oh!

adamrtalbot Sep 23, 2025

Uh oh!

adamrtalbot Sep 23, 2025

Uh oh!

adamrtalbot Sep 23, 2025

Uh oh!

adamrtalbot Sep 23, 2025

Uh oh!

adamrtalbot Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tanaes commented Aug 15, 2025

PR checklist

Uh oh!

adamrtalbot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

adamrtalbot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

adamrtalbot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

adamrtalbot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

adamrtalbot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants