-
Notifications
You must be signed in to change notification settings - Fork 1
Check Metadata Values Program
The Check Metadata Values program performs a series of analyses and checks on the values of the Curated Metadata Table based on the information of the provided Variables Dictionary file. This program corresponds to the Check Programs group, which means that it will not generate output files, but rather only information through the command terminal (stdout).
The analyses are divided into five parts (and will be carried out for each of the provided variables):
-
Requiredness Check. The program will check the presence of the variable in the Curated Metadata Table according to its requiredness nature depending on whether is indicated as a required or an optional variable. If the program does not find the variable in the metadata table, it will not perform the rest of the analysis.
-
Class Type Check. The program will check the dtype of the variable in the Curated Metadata Table based on its class type. Meaning that it will check the values of the variable and verify that these are strings or booleans if indicated as a character variable, or numerical values if indicated as a numeric variable.
-
Uniqueness Within Variable Check. The program will check the variable in the Curated Metadata Table according to its uniqueness within variable nature depending on whether is indicated as a unique or a nonunique variable. Especially useful for finding duplicate values in variables that should have unique values, such as sample names.
-
Uniqueness Between Variables Check. The program will check the presence of multiple matches when comparing the uniqueness between the variable and a set of variables of interest in the Curated Metadata Table. Especially useful for detecting inconsistencies related to duplicates. For example, the presence of identical individual identifiers between different datasets.
-
Allowed Values Check. The program will check that the values of the variable in the Curated Metadata Table are within the allowed parameters provided in the Variables Dictionary. Particularly interesting to look for inconsistencies at a general level with the different analyses available (any, subset, wholeset, range).
The Variables Dictionary file must contain the following columns of interest for this particular program:
-
Variable. Indicates the names of the final columns of the curated metadata table that will be used as our reference universe of possible variables. This column must be indicated as "variable" in the table header. The program will verify that all variables (column names) in the curated metadata tables are present in the Variables Dictionary.
-
Requiredness. Indicates the requiredness nature of the provided variables. This column must be indicated as "requiredness" in the table header. Valid options are "required" (a variable that must always be present and cannot contain NAs) or "optional" (a variable that could be not present). The program will verify that all required variables are present in the curated metadata tables before performing any analysis.
-
Class Type. Indicates the type of the provided variables. This column must be indicated as "class_type" in the table header. Valid options are "character" (a variable that contains string or boolean values) or "numeric" (a variable that contains numerical values).
-
Uniqueness Within Variable. Indicates the uniqueness within variable nature of the provided variables. This column must be indicated as "uniqueness_within_variable" in the table header. Valid options are "unique" (a variable that must contain only unique values without duplicates) or "nonunique" (a variable that could have duplicate values, but it doesn't have to).
-
Check Uniqueness Between Variables. Indicates whether it is desired to check the uniqueness between variables for the provided variables. This column must be indicated as "check_uniqueness_between_variables" in the table header. Valid options are "yes" or "no".
-
Variables for Check Uniqueness Between. Indicates the set of variables of interest to be used to check the uniqueness between variables for the provided variables. This column must be indicated as "variables_for_uniqueness_between" in the table header. Valid options are the "none" string (indicating that a set of variables of interest is not provided) or a non-empty variables list formatted as a Python list (only variables present in the "variable" column different than the variable been evaluated are allowed).
-
NAs Treatment. Indicates how to take into account the not available (NA) values when performing the Allowed Values Check for the provided variables. This column must be indicated as "NAs_allowed" in the table header. Valid options are "yes" (indicating that NAs are allowed values) or "no" (indicating that NAs are not allowed values).
-
Allowed Values Treatment. Indicates the type of analysis to be carried out when performing the Allowed Values Check for the provided variables. This column must be indicated as "allowed_values_treatment" in the table header. Valid options are "any" (indicating that any value is allowed), "subset" (the program will check that the values of the variable in the metadata table are present in the provided allowed values of the Variables Dictionary), "wholeset" (the program will check that all the provided allowed values are present in the variable of the metadata table and that there are no unexpected extra elements), or "range" (for numerical variables only, the program will check that the values are within the allowed [min,max] numerical range provided).
-
Allowed Values. Indicates the set of allowed values for the provided variables. This column must be indicated as "allowed_values" in the table header. Valid options are the "any" string or a non-empty list formatted as a Python list. In case of using the range treatment the latter must be a numerical Python list [min,max] of length 2.
For instance, see the variables_dictionary_example.tsv test file.
Input Elements:
| Input | Type | Description |
|---|---|---|
curated_metadata_table.tsv |
File |
Curated Metadata Table |
variables_dictionary_file.tsv |
File |
Variables Dictionary |
Output Elements:
| Output | Type | Description |
|---|---|---|
Analysis and Checks |
stdout |
Results of the different analyses and checks |
Usage:
check_metadata_values [-h] -t METADATA_TABLE -d VARIABLES_DICTIONARY [-x] [-v]
Options:
| Parameter | Description |
|---|---|
-h, --help |
Show help message and exit. |
-t, --metadata_table |
Curated Metadata Table [Expected sep=TABS]. Indicate the path to the Curated Metadata Table file. |
-d, --variables_dictionary |
Variables Dictionary [Expected sep=TABS]. Indicate path to the Variables Dictionary file. |
-x, --plain_text |
Plain Text Mode (Optional). If indicated, it will enable Plain Text mode, and text will appear without colors. |
-v, --version |
Show program's version number and exit. |
Commands:
- Check Metadata Values with colored text stdout:
check_metadata_values -t curated_metadata_table.tsv -d variables_dictionary_example.tsv
- Check Metadata Values with plain text stdout:
check_metadata_values -t curated_metadata_table.tsv -d variables_dictionary_example.tsv --plain_text
To see a full and detailed example of dataset curation, see the Tutorial Full Example page.