-
Notifications
You must be signed in to change notification settings - Fork 1
Check Metadata ENA Program
The Check Metadata ENA program enables the carry out of the main analysis and checks related to the metadata tables generated at different points of the workflow. Exclusive for ENA Dataset Workflow. This program corresponds to the Control Check Programs group, which means that it will not generate output files, but rather only information through the command terminal (stdout).
The analyses are divided into four main parts:
-
Runs’ Stats. Some relevant default statistics will be calculated for the run accessions: (1) Number of run accessions, (2) Appearances per scientific name and tax id, (3) Appearances per instrument model and instrument platform, (4) Appearances per library layout (PAIRED or SINGLE fastq files), (5) Appearances per library strategy and library source.
-
Runs’ Checks. Some checks of interest will be made as well for the run accessions: (1) Compare the library layout of each run with the number of fastq files observed in the provided ENA Download Column, (2) Check if the original uploaded fastq files are available, (3) Check for duplicated fastq file names in the original uploaded fastq files (if available).
-
Sample Stats. Some relevant statistics will be carried out for the Sample Accession Column and the provided Sample Column (-s parameter): (1) Number of unique samples, (2) Samples per scientific name and tax id, (3) Samples per instrument model and instrument platform, (4) Samples per library layout (PAIRED or SINGLE fastq files), (5) Samples per library strategy and library source, (6) Groups of samples by number of run_accessions.
-
Sample Checks. Some checks of interest will be made as well for the Sample Accession Column and the provided Sample Column (-s parameter): (1) Check if the number of samples equals the number of run accessions, (2) Check if the number of both sample columns equals each other, (3) Check if there are multiple library strategy and library source combinations per sample, (4) Check if there are multiple scientific name and tax id combinations per sample, (5) Check if there are multiple instrument model and instrument platform combinations per sample, (6) Check if there are multiple library layouts per sample.
If warnings are detected during the various checks, advice messages will be displayed indicating what could be the reasons for concern and what should be done.
Furthermore, there is an optional extra function:
- Extra Columns Stats. When used, extra appearance statistics will be calculated for each column name provided by the user (-e parameter): (1) Number of run accessions, (2) Number of samples in the Sample Accession Column, (3) Number of samples in the provided Sample Column (-s parameter). This function is interesting to obtain statistics on relevant variables that are not analyzed by default.
Input Elements:
| Input | Type | Description |
|---|---|---|
PROJECT_metadata.tsv |
File |
Metadata Table. One of the Metadata Tables generated in the different steps of the workflow by Download Metadata ENA program (PROJECT_ENA_metadata.tsv), Merge Metadata program (PROJECT_merged_metadata.tsv) or Filter Metadata program (PROJECT_filtered_metadata.tsv) |
Output Elements:
| Output | Type | Description |
|---|---|---|
Analysis and Checks by Runs and Samples |
stdout |
Results of the different analyses and checks |
Usage:
check_metadata_ENA [-h] -t METADATA_TABLE
[-c {fastq_ftp,fastq_aspera,fastq_galaxy,submitted_ftp,submitted_aspera,submitted_galaxy}]
[-p FASTQ_PATTERN] [-s SAMPLE_COLUMN]
[-e EXTRA_COLUMNS_STATS [EXTRA_COLUMNS_STATS ...]]
[-x] [-v]
Options:
| Parameter | Description |
|---|---|
-h, --help |
Show help message and exit. |
-t, --metadata_table |
Metadata Table [Expected sep=TABS]. Indicate the path to the Metadata Table file. |
-c, --ena_download_column |
ENA Download Column (Optional) [Default:fastq_ftp]. Indicate the ENA Metadata Table column with the download links. Permitted options are {fastq_ftp,fastq_aspera,fastq_galaxy,submitted_ftp,submitted_aspera,submitted_galaxy}. |
-p, --fastq_pattern |
Fastq File Pattern (Optional) [Default:".fastq.gz"]. Indicate the pattern to identify Fastq files. |
-s, --sample_column |
Sample Column (Optional) [Default:sample_alias]. Indicate the Metadata Table column to be used as samples. |
-e, --extra_columns_stats |
Extra Columns Stats (Optional). Indicate the names for the extra columns to check appearances separated by spaces (If a column name has spaces, quote it). |
-x, --plain_text |
Plain Text Mode (Optional). If indicated, it will enable Plain Text mode, and text will appear without colors. |
-v, --version |
Show program's version number and exit. |
Commands:
- Check metadata with colored text stdout:
check_metadata_ENA -t PRJEB10949_ENA_metadata.tsv
- Check metadata with plain text stdout:
check_metadata_ENA -t PRJEB10949_ENA_metadata.tsv --plain_text
- Check metadata using "sample_title" instead of "sample_alias"(-s parameter):
check_metadata_ENA -t PRJEB10949_ENA_metadata.tsv -s sample_title
- Check metadata and get extra stats for "tissue" and "common name" columns:
check_metadata_ENA -t PRJEB10949_ENA_metadata.tsv -e tissue "common name"
- Check metadata using "submitted_ftp" instead of the default "fastq_ftp" as ENA Download Column:
check_metadata_ENA -t PRJEB10949_ENA_metadata.tsv -c submitted_ftp
To see a full and detailed example of dataset curation, see the Tutorial Full Example page.