Seq2Geno

This package is an integrated tool for microbial sequence analyses. We refactored and packed the methods of the published research and evaluated the reproducibility with the same raw data.

Repository structure
Available functions
Download Seq2Geno and ensure the environment
Usage and Input
Example data and usages
Train the phenotypic predictor with the Seq2Geno results
- Start from Seq2Geno
- With precomputed data
FAQ
License
Contact
Citation

Repository structure

This repository includes:

install: information and scripts for installation
examples: the example data (the tutorial )
main: the scripts for user interface and the calling methods of core workflows
snps: the scripts for generating SNPs table
denovo: the scripts for computing de novo assemblies and the gene presence/absence table
expr: the scripts for calculating expression levels
phylo: the scripts for phylogenetic tree inference
difexpr: the methods for identifying differentially expressed genes with the expression levels matrix
cont_anc_rcn: ancestral reconstruction for continuous data such as expression levels

Available functions

detect single nucleotide variants
create de novo assemblies
compute gene presence/absence and the indels
count gene expression levels
infer the phylogenetic tree
find differentially expressed genes (additional data that won't be used by Geno2Pheno)
reconstruct ancestral values of expression level (additional data that won't be used by Geno2Pheno)

Download Seq2Geno and ensure the environment

Check the prerequisites
- conda (tested version: 4.10.0)
- file .condarc that includes these channels and is detectable by your conda
  - hzi-bifo
  - conda-forge/label/broken
  - bioconda
  - conda-forge
  - defaults
- python (tested verson: 3.7)
- Linux (tested version: Debian GNU/Linux 8.8 jessie)
- git (tested version: 2.21)
Download this package

git clone --recurse-submodules https://github.com/hzi-bifo/seq2geno.git
cd seq2geno
git submodule update --init --recursive

The command uses --recurse-submodules to download the submodules. The flag is available only in git version >2.13. Earlier git versions might have the substitute. After the package is downloaded, main/seq2geno and main/seq2geno_gui are the executable scripts for Seq2Geno.

To ensure the environment and test the package, please either follow the steps in install/README.md or use the automatic tools:

cd install/
./SETENV.sh snakemake_env
conda activate snakemake_env
./TESTING.sh

Usage and Input

Once the environment is properly set up, Seq2Geno can be launched using the graphical user interface (GUI) or command line

GUI

The commands

S2G

or

seq2geno_gui

will launch the graphic user interface. Use the tool to read, edit, or save the arguments in a yaml file. Once the arguments are ready, the analyses can be launched with this interface; for large-scale researches, however, generating the yaml file and launching the analyses with the command line method (described below) might be more convenient, as having processes running in the background should be more convenient. To learn more, please read the manual doc/GUI_manual.pdf.

command line

S2G -d -f [options_yaml] -z [zip_input] -l [log_file] --outzip [output_zip_type] --to_gp

Both options_yaml and zip_input specify the materials to use (see the examples/). At least one of them should be used. When options_yaml is properly set, zip_input will be neglected. The options_yaml describes all the options and paths to input data for Seq2Geno. The zip_input packs all the materials and has a structure that Seq2Geno can recognize (see input_zip_structure.md for more details).

The log_file should be a non-existing filename to store the log information; if not set, the messages will be directed to stdout and stderr.

The output_zip_type should be one of 'none' (default), 'all', 'main', or 'g2p'. The choice specifies whether or how the output results should be packed into an zip file.

The flag --to_gp specifies whether to submit the results to the Geno2Pheno server.

arguments

The input file is an yaml file where all options are described (a template in examples/). The file includes two parts:

features:

option	action	values ([default])
dryrun	display the processes and exit	[Y]/N
snps	SNPs calling	Y/[N]
denovo	creating de novo assemblies	Y/[N]
expr	counting expression levels	Y/[N]
phylo	inferring the phylogeny	Y/[N]
de	differential expression	Y/[N]
ar	ancestral reconstruction of expression levels	Y/[N]

To only create the folder and config files, please turn off the last six options.

general (* mandatory):
- cores: number of cpus (integer; automatically adjusted if larger than the available cpu number)
- mem_mb: memory size to use (integer in mb; automatically adjusted if larger than the free memory). Note: some processes may crush because of insufficiently allocated memory
- *wd: the working directory. The intermediate and final files will be stored under this folder. The final outcomes will be symlinked to the sub-directory RESULTS/.
- *dna_reads: the list of DNA-seq data
It should be a two-column list, where the first column includes all samples and the second column lists the paired-end reads files. The two reads file are separated by a comma. The first line is the first sample.
```
sample01	/paired/end/reads/sample01_1.fastq.gz,/paired/end/reads/sample01_2.fastq.gz
sample02	/paired/end/reads/sample02_1.fastq.gz,/paired/end/reads/sample02_2.fastq.gz
sample03	/paired/end/reads/sample03_1.fastq.gz,/paired/end/reads/sample03_2.fastq.gz
```
- *ref_fa, ref_gff, ref_gbk: the data of reference genome
The fasta, gff, and genbank files of a reference genome. They should have same sequence ids.
- old_config: if recognizable, the config files that were previously stored in the working directory will be reused. ('Y': on; 'N': off)
- rna_reads: the list of RNA-seq data. (string of filename)
It should be a two-column list, where the first column includes all samples and the second column lists the short reads files. The first line is the first sample.
```
sample01	/transcription/reads/sample01.rna.fastq.gz
sample02	/transcription/reads/sample02.rna.fastq.gz
sample03	/transcription/reads/sample03.rna.fastq.gz
```
- phe_table: the phenotype table (string of filename)
The table is tab-separated. For n samples with m phenotypes, the table is (n+1)-by-(m+1) as shown below. The first column should be sample names. The header line should includes names of phenotypes. Missing values are acceptable.
```
strains	virulence
sample01	high
sample02	mediate
sample03	low
```
- adaptor: the adaptor file (string of filename)
The fasta file of adaptors of DNA-seq. It is used to process the DNA-seq reads.

Example data and usages

The folder examples/ includes a structured zip file and a yaml file--the two input formats that Seq2Geno can recognize. The zip file can be used as the input with this command:

S2G -z examples/example_input/seq2geno_input.zip\
 -l examples/example_input_zip.log\
 --outzip g2p

To use the configuration yaml file, please ensure unpacked example data (that is, the zip file) and edit the yaml file to ensure the right paths to those example data. After they are ready, please run with this command:

S2G -f examples/example_input/seq2geno_input.yml\
 -l exapmles/seq2geno_input_yml.log\
 --outzip g2p

Train the phenotypic predictor with the Seq2Geno results

Start from Seq2Geno

To include automatic submission to the Geno2Pheno server, just use the flag --to_gp:

S2G -f examples/example_input/seq2geno_input.yml\
 -l exapmles/seq2geno_input_yml.log\
 --outzip g2p --to_gp

With precomputed data

Please directly visit Geno2Pheno

FAQ

Why the analyses crushed?

Please check the log file or STDOUT and STDERR and determine the exact error.

Will every procedure be rerun if I want to add one sample?

No, you just need to add one more line in your reads list (i.e., the dna or the rna reads. See section arguments for more details.) and then run the same workflow again. Seq2Geno uses Snakemake to determine whether certain intermediate data need to be recomputed or not.

Will every procedure be rerun if I want to exclude one sample?

No; however, besides excluding that sample from the reads list, you will need to remove all the subsequent results that were previously computed. That could be risky.

Will every procedure be rerun if I accidentally delete some data?

No, only the deleted one and the subsequent data will be recomputed.

Where is the final data?

In the working directory, the main results are collected in the subfolder RESULTS/. You can also find the other intermediate data in the corresponding folders (e.g., mapping results)

What is the current status?

If the log file was specified, you could check the log file to determine the current status. Otherwise, the status should be directed to your STDOUT or STDERR.

License

GPLv3 (please refer to LICENSE)

Contact

Please contact Tzu-Hao Kuo (Tzu-Hao.Kuo@helmholtz-hzi.de) and specify:

the error message or unexpected situation
how to reproduce the problem

Citation

We will be publishing the paper for the joint work of Seq2Geno and Geno2Pheno. Before that, please use

Asgari*, E., Kuo*, T.-H., Bremges, A., Robertson, G., Weimann, A. & McHardy, A. C. (2021). Seq2Geno2Pheno [A Computational Workflow for Phenotype Predictive-Modeling and Biomarker Detection from Microbial Sequence Data]

or

@software{seq2geno2pheno2021,
  author = {Ehsaneddin Asgari*, Tzu-Hao Kuo*, Andreas Bremges, Gary Robertson, Aaron Weimann, Alice C. McHardy},
  title = {Seq2Geno2Pheno: A Computational Workflow for Phenotype Predictive-Modeling and Biomarker Detection from Microbial Sequence Data},
  version = {v2.00001},
  date = {2021-07-07},
}

Name		Name	Last commit message	Last commit date
Latest commit History 492 Commits
cmpr		cmpr
cont_anc_rcn		cont_anc_rcn
denovo		denovo
difexpr		difexpr
doc		doc
examples		examples
expr		expr
install		install
lib		lib
main		main
phylo		phylo
snps		snps
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG		CHANGELOG
LICENSE		LICENSE
README.md		README.md
input_zip_structure.md		input_zip_structure.md

License

hzi-bifo/seq2geno

Folders and files

Latest commit

History

Repository files navigation

Seq2Geno

Repository structure

Available functions

Download Seq2Geno and ensure the environment

Usage and Input

Example data and usages

Train the phenotypic predictor with the Seq2Geno results

FAQ

License

Contact

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages