Skip to content

CLI tool to verify workflow reproducibility

License

Notifications You must be signed in to change notification settings

sapporo-wes/tonkaz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tonkaz

DOI Apache License

Publication on GigaScience

Tonkaz is a CLI tool to verify workflow reproducibility. It compares the RO-Crate of workflow execution results and calculates the reproducibility level of each output file.

Reproducibility level is defined as follows:

- Level3 ⭐⭐⭐ : Files are identical with the same checksum
- Level2 ⭐⭐   : Files are different, but their features (file size, map rate, etc.) are similar (within threshold: 0.05)
- Level1 ⭐     : Files are different, and their features are different (beyond threshold)
- Level0        : File not found

Level3: "Fully Reproduced" <---> Level0: "Not Reproduced"

If you want to try easily, run as follows. It compares the execution results of nf-core/rnaseq v3.7 twice in the same Linux environment.

$ tonkaz ./tests/example_crate/rnaseq_1st.json ./tests/example_crate/rnaseq_2nd.json

# Example output:
$ cat ./tests/comparison_results/rnaseq_same_env.log

We provide various examples in the tests/README.md. Please check it out.

Installation

Use a single binary that is built without any dependencies.

# for Linux x86_64
$ curl -fsSL -o ./tonkaz https://github.com/sapporo-wes/tonkaz/releases/latest/download/tonkaz_x86_64-unknown-linux-gnu

# for Mac x86_64
$ curl -fsSL -o ./tonkaz https://github.com/sapporo-wes/tonkaz/releases/latest/download/tonkaz_x86_64-apple-darwin

# for Mac Apple silicon
$ curl -fsSL -o ./tonkaz https://github.com/sapporo-wes/tonkaz/releases/latest/download/tonkaz_aarch64-apple-darwin

$ chmod +x ./tonkaz
$ ./tonkaz --help

Or, use the Docker environment:

docker run -it --rm ghcr.io/sapporo-wes/tonkaz:latest --help

Usage

Pass two crates as arguments to the tonkaz command. (local file or URL)

tonkaz crate1.json crate2.json

For more details:

$ tonkaz -h
Tonkaz 0.1.0 by @suecharo

CLI tool to verify workflow reproducibility

Usage: tonkaz [options] crate1 crate2

Options:
  -a, --all                    Use all output files for comparison
  -t, --threshold <threshold>  Set threshold for comparison (default: 0.05)
  -h, --help                   Show this help message and exit
  -v, --version                Show version and exit

Examples:
  $ tonkaz crate1 crate2
  $ tonkaz crate1 https://example.com/crate2
  $ tonkaz https://example.com/crate1 https://example.com/crate2

How to prepare an RO-Crate for Tonkaz?

Tonkaz supports ONLY RO-Crate generated by Sapporo-service (version 1.6.0 or newer) or Yevis-cli. For more information about Sapporo and Yevis, please see these repositories.

The RO-Crate can be generated to pass the --fetch-ro-crate option to Yevis-cli's test command as follows:

# Execute the workflow
$ yevis test --fetch-ro-crate https://example.com/path/to/yevis-metadata-file

# The RO-Crate is generated in the `test-logs` directory
$ ls test-logs/
ro-crate-metadata_c13b6e27-a4ee-426f-8bdb-8cf5c4310bad_1.0.0_test_1.json

Or, the RO-Crate can be generated from Sapporo's run_dir.

# At Sapporo run_dir
$ ls
cmd.txt                     run.sh                      state.txt
exe/                        run_request.json            stderr.log
executable_workflows.json   sapporo_config.json         stdout.log
outputs/                    service_info.json           workflow_engine_params.txt
run.pid                     start_time.txt              yevis-metadata.yml

# Execute sapporo/ro_crate.py script
$ docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v $PWD:$PWD -w $PWD ghcr.io/sapporo-wes/sapporo-service:latest python3 /app/sapporo/ro_crate.py $PWD

Comparison of MultiQC Statistics (New Feature in 0.3.0)

Sapporo-service, from version 1.6.0 onwards, introduced a feature where MultiQC is executed after workflow completion, and its results are added to the RO-Crate. Tonkaz can compare these MultiQC results. MultiQC checks the workflow output, extracts units per sample, and aggregates statistical data from each tool for individual samples.

Previously, Tonkaz facilitated comparisons at the file level, but with the introduction of MultiQC comparison, it is now possible to compare at the sample level. Please note that since it is difficult to handle file-level and sample-level comparisons in parallel, the MultiQC comparison is an additional feature.

However, this sample-level comparison is considered beneficial as it allows for more detailed comparison of workflow results.

An example of the comparison results is as follows:

- Salmon Num_mapped
  .--------------------------------------------------------------------------.
  |             Sample             |  in Crate1   |  in Crate2   |   Level   |
  |--------------------------------|--------------|--------------|-----------|
  | RAP1_IAA_30M_REP1              | 38268        | 40165        | ⭐⭐      |
  | RAP1_UNINDUCED_REP1            | 39317        | 39317        | ⭐⭐      |
  | RAP1_UNINDUCED_REP2            | 78884        | 81361        | ⭐⭐      |
  | WT_REP1                        | 74109        | 74109        | ⭐⭐      |
  | WT_REP2                        | 37368        | 37368        | ⭐⭐      |
  '--------------------------------------------------------------------------'

- Samtools Flagstat_total
  .--------------------------------------------------------------------------.
  |             Sample             |  in Crate1   |  in Crate2   |   Level   |
  |--------------------------------|--------------|--------------|-----------|
  | RAP1_IAA_30M_REP1              | 94912        | 94912        | ⭐⭐      |
  | RAP1_UNINDUCED_REP1            | 49040        | 49040        | ⭐⭐      |
  | RAP1_UNINDUCED_REP2            | 98338        | 98338        | ⭐⭐      |
  | WT_REP1                        | 188243       | 188241       | ⭐⭐      |
  | WT_REP2                        | 94419        | 94419        | ⭐⭐      |
  '--------------------------------------------------------------------------'

## Development

We use [Deno](https://deno.land/) `v1.40.2`.

If you want to use the Docker environment, please run the following command:

```bash
$ docker run -it --rm -v $PWD:$PWD -w $PWD denoland/deno:1.40.2 deno --version
deno 1.40.2 (release, x86_64-unknown-linux-gnu)
v8 12.1.285.6
typescript 5.3.3

Testing

Please see ./tests directory.

License

Apache-2.0. See the LICENSE.