Skip to content

gersteinlab/MOBI

Repository files navigation

MOBI

MOtif inference with advanced BInding site selection.

Methods overview

fig1

As shown in this figure, we postulate there are several TF binding modes. In the result of a typical ChIP-seq experiment targeting a certain TF (here shown in red), the identified binding sites could either contain the target motif (fig. A) or not (fig. B). The ideal binding sites for motif inference should be those from scenario one.

Notice that among all these cases:

  • Binding site in scenario one is least "crowded", i.e. there are fewest possibly interacting proteins in the binding site window.
  • The target binding motif should locate in or close to the peak summit (in this figure, it is the center).

Therefore, we select binding sites with low "crowdness" score and trim the binding sites to a shorter length. Using such binding sites result in more accuate motif inference.

For more details, see our paper:
Xu, J., Gao, J., Ni, P., & Gerstein, M. (2024). Less-is-more: selecting transcription factor binding regions informative for motif inference. Nucleic Acids Research, gkad1240.

Download

We applied MOBI to the 4 samples from ENCODE respectively with their best parameters and infer the motifs. You could download all our predicted motifs here.

Sample TFs DREME MEME STREME HOMER DESSO
Drosophila melanogaster 454 Download Download Download Download 52 TFs
Caenorhabditis elegans 283 Download Download Download Download 33 TFs
GM12878 136 Download Download Download Download 56 TFs
K562 336 Download Download Download Download 89 TFs

(Note that files will be in MEME format. File name is the TF name but any "(", ")" and ":" characters in the TF name will be omitted, e.g. TF name A(B)C will have motif file named ABC.meme)

Examples output motifs

See here for a few examples of the inferred motifs.

Download C-score files

We provide the C-score we used in the paper in bigWig format for users to download. (If the download didn't start, right click to copy the link and paste into the browser manually).
Fly
Worm
GM12878
K562

Requirement of scripts

  • python3
  • pybedtools
  • numpy
  • pandas

To run the inference step, you need to have DREME/MEME/STREME/HOMER properly installed. See here for meme-suite installation. You can test by meme -version or other equivalent commands.

Run the example

  1. Download the github and unzip into MOBI/, go to the folder cd MOBI/.
  2. Download the example ChIP-seq data: bash DownloadExampleData.sh.
  3. Download the genome https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz into example_data/, then uncompress and index the fasta (require samtools):
cd example_data/  
gunzip dm6.fa.gz  
samtools faidx dm6.fa  
cd ..
  1. Run the scripts python MOBI.py. This will generate a file called joblist_inference.sh. All intermediate files are in example_result/.
  2. Make sure you have the inference tool (e.g. DREME) installed. Run the scripts bash joblist_inference.sh. All predicted motifs are now under example_result/inference/.

Infer motifs for your own data

Modify line 7-14 in file MOBI.py accordingly (will be updated to argument). Then run step 4 and 5 in the previous section.

  • data_chip(str): Path to the folder containing all the ENCODE ChIP-seq files. All these files are used to calculated the C-score (see below)
  • data_meta(str): Path to a tab-seperated file. The first column is the basename of the file and the second column is the TF name. Notice the first column should be a subset of the basenames of files in data_chip
  • genome_fasta(str): Path to the genome fasta file. The index file should be generated beforehand and located in the same folder
  • result_main(str): Path to the main result folder.
  • width_list(list of int): A list of binding regions length to try. 100 indicates a binding regions of +-100bp around ChIP-seq peak summit (a total of 200bp)
  • rank_list(list of str): Choice of RankSPP, RankCrowdness or RankLinear0.1 where 0.1 could be changed to other float. For detail of the ranking method, see the paper
  • tool_list(list of str): Choice of DREME, MEME, STREME and HOMER.

Finding optimal parameters for your own data

In order to find the best ranking method (weights) and binding sites width, we simply do a brute force search for the parameters that give the best result:

  • Run the above section with width_list and rank_list cover all the parameters you want to try.
  • Having a list of "known" motifs in the MEME format. You could download this from Cis-BP or other database.
  • Modify line 5-12 in MOBI_tomtom.py, run the scripts with python MOBI_tomtom.py. This will generate a file called joblist_tomtom.sh. Make sure you have meme-suite install by verifing tomtom --help. Run the script with bash joblist_tomtom.sh. This is to compare the inferred motifs to the known motifs.
  • Modify line 5-12 in MOBI_stats.py, run the scripts with python MOBI_stats.py. The best parameters will be shown in example_result/stats/DREME_idx.txt if you are using DREME. You can find the result for this optimal paremeters by the file names in example_result/inference/DREME/.

Contacts

For any questions, please contact Jiahao Gao(jiahao.gao@yale.edu)
Gerstein Lab 2024
MIT License