Skip to content

microbiomedata/refscan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

refscan

refscan is a command-line tool people can use to scan the NMDC MongoDB database for referential integrity violations.

%% This is the source code of a Mermaid diagram, which GitHub will render as a diagram.
%% Note: PyPI does not render Mermaid diagrams, and instead displays their source code.
%%       Reference: https://github.com/pypi/warehouse/issues/13083
graph LR
    schema[LinkML<br>schema]
    database[(MongoDB<br>database)]
    script[["refscan"]]
    violations["List of<br>violations"]
    references["List of<br>references"]:::dashed_border
    schema --> script
    database --> script
    script -.-> references
    script --> violations
    
    classDef dashed_border stroke-dasharray: 5 5
Loading

In addition to using refscan to scan the NMDC MongoDB database for referential integrity violations, people can use refscan to generate graphs (diagrams) depicting which collections' documents (or which classes' instances) can contain references to which other collections' documents (or classes' instances) while still being schema compliant.

How it works

Here is a summary of how each of refscan's main functions works under the hood.

Scan

refscan does this in two stages:

  1. It uses the LinkML schema to determine where references can exist in a MongoDB database that conforms to the schema.

    Example: The schema might say that, if a document in the biosample_set collection has a field named associated_studies, that field must contain a list of ids of documents in the study_set collection.

  2. It scans the MongoDB database to check the integrity of all the references that do exist.

    Example: For each document in the biosample_set collection that has a field named associated_studies, for each value in that field, confirm there is a document having that id in the study_set collection.

Graph

refscan does this in three stages:

  1. It uses the LinkML schema to determine where references can exist in a MongoDB database that conforms to the schema.
  2. It formats that list of references into a data structure compatible with Cytoscape.js.
  3. It outputs an HTML document that uses Cytoscape.js to visualize that data structure as a graph.

Assumptions

refscan was designed under the assumption that every document in every collection described by the schema has a field named type, whose value is the class_uri of the schema class the document represents an instance of. refscan uses that class_uri value (in that type field) to determine the name of that schema class, whose definition refscan then uses to determine which fields of that document can contain references.

Usage

Install

Assuming you have pipx installed, you can install the tool by running the following command:

pipx install refscan

pipx is a tool people can use to download and install Python scripts that are hosted on PyPI. You can install pipx by running $ python -m pip install pipx.

Run

Once installed, you can display the tool's --help snippet by running:

refscan --help

At the time of this writing, the tool's --help snippet is:

 Usage: refscan [OPTIONS] COMMAND [ARGS]...

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --help          Show this message and exit.                                            โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Commands โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ version   Show version number and exit.                                                โ”‚
โ”‚ scan      Scan the NMDC MongoDB database for referential integrity violations.         โ”‚
โ”‚ graph     Generate an interactive graph of the references described by a schema.       โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Each command has its own --help snippet.

The scan command

At the time of this writing, the --help snippet for the scan command is:

 Usage: refscan scan [OPTIONS]

 Scan the NMDC MongoDB database for referential integrity violations.

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ *  --schema                               FILE  Filesystem path at which the YAML file โ”‚
โ”‚                                                 representing the schema is located.    โ”‚
โ”‚                                                 [default: None]                        โ”‚
โ”‚                                                 [required]                             โ”‚
โ”‚    --database-name                        TEXT  Name of the database.                  โ”‚
โ”‚                                                 [default: nmdc]                        โ”‚
โ”‚    --mongo-uri                            TEXT  Connection string for accessing the    โ”‚
โ”‚                                                 MongoDB server. If you have Docker     โ”‚
โ”‚                                                 installed, you can spin up a temporary โ”‚
โ”‚                                                 MongoDB server at the default URI by   โ”‚
โ”‚                                                 running: $ docker run --rm --detach -p โ”‚
โ”‚                                                 27017:27017 mongo                      โ”‚
โ”‚                                                 [env var: MONGO_URI]                   โ”‚
โ”‚                                                 [default: mongodb://localhost:27017]   โ”‚
โ”‚    --verbose                                    Show verbose output.                   โ”‚
โ”‚    --skip-source-collection,--skip        TEXT  Name of collection you do not want to  โ”‚
โ”‚                                                 search for referring documents. Option โ”‚
โ”‚                                                 can be used multiple times.            โ”‚
โ”‚                                                 [default: None]                        โ”‚
โ”‚    --reference-report                     FILE  Filesystem path at which you want the  โ”‚
โ”‚                                                 program to generate its reference      โ”‚
โ”‚                                                 report.                                โ”‚
โ”‚                                                 [default: references.tsv]              โ”‚
โ”‚    --violation-report                     FILE  Filesystem path at which you want the  โ”‚
โ”‚                                                 program to generate its violation      โ”‚
โ”‚                                                 report.                                โ”‚
โ”‚                                                 [default: violations.tsv]              โ”‚
โ”‚    --no-scan                                    Generate a reference report, but do    โ”‚
โ”‚                                                 not scan the database for violations.  โ”‚
โ”‚    --locate-misplaced-documents                 For each referenced document not found โ”‚
โ”‚                                                 in any of the collections the schema   โ”‚
โ”‚                                                 allows, also search for it in all      โ”‚
โ”‚                                                 other collections.                     โ”‚
โ”‚    --help                                       Show this message and exit.            โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
The MongoDB connection string (--mongo-uri)

As documented in the --help snippet above, you can provide the MongoDB connection string to the tool via either (a) the --mongo-uri option; or (b) an environment variable named MONGO_URI. The latter can come in handy when the MongoDB connection string contains information you don't want to appear in your shell history, such as a password.

Here's how you could create that environment variable:

export MONGO_URI='mongodb://username:password@localhost:27017'
The schema (--schema)

As documented in the --help snippet above, you can provide the path to a YAML-formatted LinkML schema file to the tool via the --schema option.

Show/hide tips for getting a schema file

If you have curl installed, you can download a YAML file from GitHub by running the following command (after replacing the {...} placeholders and customizing the path):

# Download the raw content of https://github.com/{user_or_org}/{repo}/blob/{branch}/path/to/schema.yaml
curl -o schema.yaml https://raw.githubusercontent.com/{user_or_org}/{repo}/{branch}/path/to/schema.yaml

For example:

# Download the raw content of https://github.com/microbiomedata/nmdc-schema/blob/main/nmdc_schema/nmdc_materialized_patterns.yaml
curl -o schema.yaml https://raw.githubusercontent.com/microbiomedata/nmdc-schema/main/nmdc_schema/nmdc_materialized_patterns.yaml

# Download the raw content of https://github.com/microbiomedata/nmdc-schema/blob/v11.2.1/nmdc_schema/nmdc_materialized_patterns.yaml
curl -o schema.yaml https://raw.githubusercontent.com/microbiomedata/nmdc-schema/v11.2.1/nmdc_schema/nmdc_materialized_patterns.yaml

Output

While refscan is running, it will display console output indicating what it's currently doing.

Screenshot of refscan console output

Once the scan is complete, the reference report (TSV file) and violation report (TSV file) will be available in the current directory (or in custom directories, if any were specified via CLI options).

The graph command

At the time of this writing, the --help snippet for the graph command is:

 Usage: refscan graph [OPTIONS]

 Generate an interactive graph of the references described by a schema.

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ *  --schema         FILE                Filesystem path at which the YAML file         โ”‚
โ”‚                                         representing the schema is located.            โ”‚
โ”‚                                         [default: None]                                โ”‚
โ”‚                                         [required]                                     โ”‚
โ”‚    --graph          FILE                Filesystem path at which you want refscan to   โ”‚
โ”‚                                         generate the graph.                            โ”‚
โ”‚                                         [default: graph.html]                          โ”‚
โ”‚    --subject        [collection|class]  Whether you want each node of the graph to     โ”‚
โ”‚                                         represent a collection or a class.             โ”‚
โ”‚                                         [default: collection]                          โ”‚
โ”‚    --verbose                            Show verbose output.                           โ”‚
โ”‚    --help                               Show this message and exit.                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Update

You can update the tool to the latest version available on PyPI by running:

pipx upgrade refscan

Uninstall

You can uninstall the tool from your computer by running:

pipx uninstall refscan

Container-based usage

You can also run refscan via a container image hosted by the GitHub Container Registry.

docker run --rm -it refscan --help

Note: When running refscan via a container image, you can reference your host machine via the special hostname, "host.docker.internal".

In other words, $ docker run refscan --mongo-uri mongodb://host.docker.internal:27017 does the same thing as $ refscan --mongo-uri mongodb://localhost:27017, except the first command runs refscan within a container while the second one runs it directly on your host machine.

Development

We use uv to both (a) manage dependencies and (b) build distributable packages that can be published to PyPI.

Note: We initialized this repository using Poetry. We switched from Poetry to uv at around commit #1449ceca.

Clone repository

git clone https://github.com/microbiomedata/refscan.git
cd refscan

Set up Python virtual environment

You can set up a Python virtual environment by issuing the following command from the root directory of the repository:

uv sync

That command will:

  1. Create a Python virtual environment at .venv (if one doesn't already exist there)
  2. Install all dependencies described in uv.lock into that Python virtual environment
  3. Uninstall all dependencies not described in uv.lock from that Python virtual environment

Activate Python virtual environment

Now that you have set up a Python virtual environment, you can activate it by issuing the following command:

source .venv/bin/activate

Note: Once you're ready to deactivate the Python virtual environment, you can do so by running $ deactivate.

Make changes

Edit the tool's source code and documentation however you want.

While editing the tool's source code, you can run the tool as you normally would in order to test things out.

uv run refscan --help

Check types

We use mypy as the static type checker for refscan.

You can perform static type checking by running the following command from the root directory of the repository:

uv run mypy

Run tests

We use pytest as the testing framework for refscan.

Tests are defined in the tests directory.

You can run the tests by running the following command from the root directory of the repository:

uv run pytest

Format code

We use ruff as the code formatter for refscan.

We mostly use it with its default rules. All of the ways we deviate from those are listed in the [tool.ruff] section of pyproject.toml.

You can check the code's compliance with the "formatter rules" by running this command from the root directory of the repository:

uv run ruff format --check

That will output a list of files that don't comply. To see the violations, themselves, you can run:

uv run ruff format --diff

You can format the code by omitting the --check and --diff flags:

uv run ruff format

Lint code

We also use ruff as the code linter for refscan.

We use it with its default rules, plus some additional ones, all of which are listed in the [tool.ruff.lint] section of pyproject.toml.

You can check the code's compliance with the "linter rules" by running this command from the root directory of the repository:

uv run ruff check

Building and publishing

Build for production

Whenever someone publishes a GitHub Release in this repository, a GitHub Actions workflow will automatically build a package and publish it to PyPI. That package will have a version identifier that matches the name of the Git tag associated with the Release.

Test the build process locally

In case you want to test the build process locally, you can do so by running:

uv build

That will create both a source distribution file (whose name ends with .tar.gz) and a wheel file (whose name ends with .whl) in the dist directory.

About

Command-line program that scans NMDC database for referential integrity violations

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •