Conversation
Dev -> Master for v1.12.0 release
| aws s3 cp \\ | ||
| --region us-east-1 \\ | ||
| --no-sign-request \\ | ||
| ${args} \\ | ||
| s3://sra-pub-run-odp/sra/${run_accession}/${run_accession} \\ | ||
| ${prefix}.sra |
There was a problem hiding this comment.
Nextflow has native method of downloading from AWS using the SDK, do we think this will be more efficient than using that?
There was a problem hiding this comment.
Using Nextflow:
- No worker processes
- No copying files around
- Lower latency
Using a process:
- Can offload downloading to a worker process
- Leave Nextflow 'head' process more available
- Could supply faster networking to worker nodes (which may be in a different env).
| # Verify download | ||
| if [ ! -f "${prefix}.sra" ]; then | ||
| echo "ERROR: Failed to download ${run_accession} from AWS S3" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
This shouldn't be necessary, aws cp will exit with 1 if it fails.
| --region us-east-1 \\ | ||
| --no-sign-request \\ | ||
| ${args} \\ | ||
| s3://sra-pub-run-odp/sra/${run_accession}/${run_accession} \\ |
There was a problem hiding this comment.
If this is just doing aws cp, why not use the file operator in native Nextflow, e.g. file/fromPath("s3://sra-pub-run-odp/sra/${run_accession}/${run_accession}")
There was a problem hiding this comment.
Thinking about this, we can probably write a function to return meta, file("s3://etc") and make that into a pseudo-process, then call it exactly the same way. This wouldn't actually copy the file, just a pointer to it so we would only ever move the file once, which will be infinitely more efficient than using a process (literally!). It would still copy the file to the publishDir via normal Nextflow mechanisms.
Per Issue #354, I've taken a stab at adapting my direct AWS download method for S3-mirrored SRA files. For ease of integration, I've so far just specified environment options with aws cli installed, but perhaps it would be more appropriate to try and instead use existing nextflow AWS integrations?
I've tested this on a couple datasets of various sizes I'm working with right now and it seems to be working quite reliably.
I did some very basic benchmarking of performance differences.


On a study of 28 samples with around 10 Gbp sequences, the AWS method was around 5 times faster for the download step than the SRA Prefetch approach. The bulk of the time was still taken up by unpacking, though.
PR checklist
Changes:
addition of aws cli download method for mirrored SRA files
If you've fixed a bug or added code that should be tested, add tests!
Tests added (plus nf-test snapshots) for the revised workflow, the new subworkflow, and the added module
Uses existing test data and approaches from sratools downloads.
nf-core lint).There are linting errors in the dev branch, but no new linting errors introduced.
nextflow run . -profile test,docker --outdir <OUTDIR>).Tests passing.
nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).