Skip to content

Reproducible and comprehensive sequence analyses for single-species bacterial dataset

License

Notifications You must be signed in to change notification settings

hzi-bifo/seq2geno

Repository files navigation

Seq2Geno

This package is an integrated tool for microbial sequence analyses. We refactored and packed the methods of the published research and evaluated the reproducibility with the same raw data.

Repository structure

This repository includes:

  • install: information and scripts for installation
  • examples: the example data (the tutorial )
  • main: the scripts for user interface and the calling methods of core workflows
  • snps: the scripts for generating SNPs table
  • denovo: the scripts for computing de novo assemblies and the gene presence/absence table
  • expr: the scripts for calculating expression levels
  • phylo: the scripts for phylogenetic tree inference
  • difexpr: the methods for identifying differentially expressed genes with the expression levels matrix
  • cont_anc_rcn: ancestral reconstruction for continuous data such as expression levels

Available functions

  • detect single nucleotide variants
  • create de novo assemblies
  • compute gene presence/absence and the indels
  • count gene expression levels
  • infer the phylogenetic tree
  • find differentially expressed genes (additional data that won't be used by Geno2Pheno)
  • reconstruct ancestral values of expression level (additional data that won't be used by Geno2Pheno)

Download Seq2Geno and ensure the environment

  1. Check the prerequisites

    • conda (tested version: 4.10.0)
    • file .condarc that includes these channels and is detectable by your conda
      • hzi-bifo
      • conda-forge/label/broken
      • bioconda
      • conda-forge
      • defaults
    • python (tested verson: 3.7)
    • Linux (tested version: Debian GNU/Linux 8.8 jessie)
    • git (tested version: 2.21)
  2. Download this package

git clone --recurse-submodules https://github.com/hzi-bifo/seq2geno.git
cd seq2geno
git submodule update --init --recursive

The command uses --recurse-submodules to download the submodules. The flag is available only in git version >2.13. Earlier git versions might have the substitute. After the package is downloaded, main/seq2geno and main/seq2geno_gui are the executable scripts for Seq2Geno.

  1. To ensure the environment and test the package, please either follow the steps in install/README.md or use the automatic tools:
cd install/
./SETENV.sh snakemake_env
conda activate snakemake_env
./TESTING.sh

Usage and Input

Once the environment is properly set up, Seq2Geno can be launched using the graphical user interface (GUI) or command line

  • GUI

The commands

S2G

or

seq2geno_gui

will launch the graphic user interface. Use the tool to read, edit, or save the arguments in a yaml file. Once the arguments are ready, the analyses can be launched with this interface; for large-scale researches, however, generating the yaml file and launching the analyses with the command line method (described below) might be more convenient, as having processes running in the background should be more convenient. To learn more, please read the manual doc/GUI_manual.pdf.

  • command line
S2G -d -f [options_yaml] -z [zip_input] -l [log_file] --outzip [output_zip_type] --to_gp

Both options_yaml and zip_input specify the materials to use (see the examples/). At least one of them should be used. When options_yaml is properly set, zip_input will be neglected. The options_yaml describes all the options and paths to input data for Seq2Geno. The zip_input packs all the materials and has a structure that Seq2Geno can recognize (see input_zip_structure.md for more details).

The log_file should be a non-existing filename to store the log information; if not set, the messages will be directed to stdout and stderr.

The output_zip_type should be one of 'none' (default), 'all', 'main', or 'g2p'. The choice specifies whether or how the output results should be packed into an zip file.

The flag --to_gp specifies whether to submit the results to the Geno2Pheno server.

  • arguments

The input file is an yaml file where all options are described (a template in examples/). The file includes two parts:

  1. features:
option action values ([default])
dryrun display the processes and exit [Y]/N
snps SNPs calling Y/[N]
denovo creating de novo assemblies Y/[N]
expr counting expression levels Y/[N]
phylo inferring the phylogeny Y/[N]
de differential expression Y/[N]
ar ancestral reconstruction of expression levels Y/[N]

To only create the folder and config files, please turn off the last six options.

  1. general (* mandatory):

    • cores: number of cpus (integer; automatically adjusted if larger than the available cpu number)

    • mem_mb: memory size to use (integer in mb; automatically adjusted if larger than the free memory). Note: some processes may crush because of insufficiently allocated memory

    • *wd: the working directory. The intermediate and final files will be stored under this folder. The final outcomes will be symlinked to the sub-directory RESULTS/.

    • *dna_reads: the list of DNA-seq data

    It should be a two-column list, where the first column includes all samples and the second column lists the paired-end reads files. The two reads file are separated by a comma. The first line is the first sample.

    sample01	/paired/end/reads/sample01_1.fastq.gz,/paired/end/reads/sample01_2.fastq.gz
    sample02	/paired/end/reads/sample02_1.fastq.gz,/paired/end/reads/sample02_2.fastq.gz
    sample03	/paired/end/reads/sample03_1.fastq.gz,/paired/end/reads/sample03_2.fastq.gz
    
    • *ref_fa, ref_gff, ref_gbk: the data of reference genome

    The fasta, gff, and genbank files of a reference genome. They should have same sequence ids.

    • old_config: if recognizable, the config files that were previously stored in the working directory will be reused. ('Y': on; 'N': off)

    • rna_reads: the list of RNA-seq data. (string of filename)

    It should be a two-column list, where the first column includes all samples and the second column lists the short reads files. The first line is the first sample.

    sample01	/transcription/reads/sample01.rna.fastq.gz
    sample02	/transcription/reads/sample02.rna.fastq.gz
    sample03	/transcription/reads/sample03.rna.fastq.gz
    
    • phe_table: the phenotype table (string of filename)

    The table is tab-separated. For n samples with m phenotypes, the table is (n+1)-by-(m+1) as shown below. The first column should be sample names. The header line should includes names of phenotypes. Missing values are acceptable.

    strains	virulence
    sample01	high
    sample02	mediate
    sample03	low
    
    • adaptor: the adaptor file (string of filename)

    The fasta file of adaptors of DNA-seq. It is used to process the DNA-seq reads.

Example data and usages

The folder examples/ includes a structured zip file and a yaml file--the two input formats that Seq2Geno can recognize. The zip file can be used as the input with this command:

S2G -z examples/example_input/seq2geno_input.zip\
 -l examples/example_input_zip.log\
 --outzip g2p

To use the configuration yaml file, please ensure unpacked example data (that is, the zip file) and edit the yaml file to ensure the right paths to those example data. After they are ready, please run with this command:

S2G -f examples/example_input/seq2geno_input.yml\
 -l exapmles/seq2geno_input_yml.log\
 --outzip g2p

Train the phenotypic predictor with the Seq2Geno results

  • Start from Seq2Geno

To include automatic submission to the Geno2Pheno server, just use the flag --to_gp:

S2G -f examples/example_input/seq2geno_input.yml\
 -l exapmles/seq2geno_input_yml.log\
 --outzip g2p --to_gp
  • With precomputed data

Please directly visit Geno2Pheno

FAQ

Why the analyses crushed?

Please check the log file or STDOUT and STDERR and determine the exact error.

Will every procedure be rerun if I want to add one sample?

No, you just need to add one more line in your reads list (i.e., the dna or the rna reads. See section arguments for more details.) and then run the same workflow again. Seq2Geno uses Snakemake to determine whether certain intermediate data need to be recomputed or not.

Will every procedure be rerun if I want to exclude one sample?

No; however, besides excluding that sample from the reads list, you will need to remove all the subsequent results that were previously computed. That could be risky.

Will every procedure be rerun if I accidentally delete some data?

No, only the deleted one and the subsequent data will be recomputed.

Where is the final data?

In the working directory, the main results are collected in the subfolder RESULTS/. You can also find the other intermediate data in the corresponding folders (e.g., mapping results)

What is the current status?

If the log file was specified, you could check the log file to determine the current status. Otherwise, the status should be directed to your STDOUT or STDERR.

License

GPLv3 (please refer to LICENSE)

Contact

Please contact Tzu-Hao Kuo (Tzu-Hao.Kuo@helmholtz-hzi.de) and specify:

  • the error message or unexpected situation
  • how to reproduce the problem

Citation

We will be publishing the paper for the joint work of Seq2Geno and Geno2Pheno. Before that, please use

Asgari*, E., Kuo*, T.-H., Bremges, A., Robertson, G., Weimann, A. & McHardy, A. C. (2021). Seq2Geno2Pheno [A Computational Workflow for Phenotype Predictive-Modeling and Biomarker Detection from Microbial Sequence Data]

or

@software{seq2geno2pheno2021,
  author = {Ehsaneddin Asgari*, Tzu-Hao Kuo*, Andreas Bremges, Gary Robertson, Aaron Weimann, Alice C. McHardy},
  title = {Seq2Geno2Pheno: A Computational Workflow for Phenotype Predictive-Modeling and Biomarker Detection from Microbial Sequence Data},
  version = {v2.00001},
  date = {2021-07-07},
}