Skip to content

govinda-kamath/clustering_on_transcript_compatibility_counts

Repository files navigation

Clustering of transcript compatibility counts

This repository is a companion to the paper "Fast and accurate single-cell RNA-Seq analysis by clustering of transcript compatibility counts" by Ntranos*, Kamath*, Zhang*, Pachter, and Tse (*equal contributors). It contains the scripts and software necessary for reproducing the results in the paper.

Overview

Rather than clustering cells based on their transcript abundances or gene expression, we determine the transcript compatibility counts (TCCs) for each cell and cluster cells using distances between TCC distributions. While there are multiple ways to compute transript-compatibility counts, in this implementation of the method we use kallisto.

In our paper we re-analyzed two recently published datasets:

  1. The 271 primary human myoblasts by Trapnell et al.
  2. The 3005 mouse brain cells by Zeisel et al.

We obtained the raw read files from NCBI's Gene Expression Omnibus. The Trapnell_pipeline and Zeisel_pipeline folders contain scripts for automatically downloading the SRR files corresponding to datasets. We recommend looking at the documentation in the iPython notebooks Trapnell_Analysis.ipynb, Zeisel_Analysis.ipynb, and Timing_Analysis.ipynb. The notebooks contain the code needed to generate the figures in our paper.

Preliminaries

The following programs are required to run the scripts in this repository.

To run the scripts in the iPython notebooks, the following Python modules are required.

Instructions

To run the code related to analysis on data of Trapnell et al. (Figures 4 and 5 in the main paper), please follow the following instructions:

  • Build the modified version of kallisto for paired end reads. This is in modified-kallisto-paired.
  • Download the human transcriptome from here.
  • Download the data set of Trapnell et al. from here to get all the .sra files in a single directory. We've provided a sample script that can do this in get_files.py.
  • Pass the directory of the SRA files, the path to the human trancriptome, and path to the modified version of kallisto for paired end reads to Trapnell_Analysis.ipynb, to verify all results on the dataset of Trapnell et al. Note that most of the compute-intensive code is run from Trapnell_wrapper.py, which is called from within the Jupyter notebook.

To run the code related to analysis on data of Zeisel et al. (Figures 6,7,8 in the main paper), please follow the following instructions:

  • Build the modified version of kallisto for single ended reads. This is in modified-kallisto-single. This is also used in the timing analysis.
  • Download the mouse transcriptome from here. This is also used in the timing analysis.
  • Download the data set of Zeisel et al. from here to get all the .sra files in a single directory. We've provided a sample script that can do this in get_files.py.
  • Pass the directory of the SRA files, the path to the mouse trancriptome, and path to the modified version of kallisto for single end reads to Zeisel_Analysis.ipynb, to verify all results on the dataset of Zeisel et al. Note that most of the compute-intensive code is run from Zeisel_wrapper.py, which is called from within the Jupyter notebook.

To run the timing analysis (Figure 3 in the main paper), please follow the following instructions:

  • Build the modified version of kallisto for single ended reads. This is in modified-kallisto-single. This is also used in the analysis of dataset of Zeisel et al.
  • Download the mouse transcriptome from here. This is also used in the analysis of dataset of Zeisel et al.
  • Download the Mouse genome from here and gunzip all the fa.gz files.
  • Pass the path to the mouse trancriptome, the path to the mouse genome, and path to the modified version of kallisto for single end reads to Timing_Analysis.ipynb, to verify all timing results. Note that most of the compute-intensive code is run from time_test.py, which is called from within the Jupyter notebook. Set the -p option in the call to time_test.py there, to run timing analysis on the same 10 cells used in the paper. Otherwise it is run on 10 cells randomly selected from the dataset of Zeisel et al.

The Method

The figure below explains TCC-based clustering pipeline and contrasts it with conventional approaches.

pipeline