Skip to content

In this project at the Rostlab (TUM), I search for the best way to encode small molecules for deep learning applications. The basis are SMILES, therefore this is featurization of SMILES.

veren4/SMILES_featurization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Featurization of SMILES for Deep Learning

This is my project for a research internship at the Rostlab at TUM with supervision from Prof. Dr. Burkhard Rost and Christian Dallago. It is my task to find a representation for small molecules (a class of drugs) for Deep Learning Applications such as binding affinity prediction. The basis for that is the SMILES representation. As a baseline, I take fingerprints.

Files and their function:

  • 0_Data_Loading.ipynb
    General data loading and saving as csv.

  • 0_Data_Loading_h5.ipynb
    I look at the data from ChemicalChecker, which is in h5 format. It turns out they only offer their model and not their raw data for download, so that exploration ends in this notebook.

  • 1_Finding_the_alphabet.ipynb
    My personally constructed apporach to tokenize canonical SMILES.

  • 1_Finding_the_alphabet_SmilesPE
    I use the SmilesPE package to tokenize SMILES.

  • 1_Molecular_Notation_Transformation.ipynb
    I transform SMILES to Fingerprints using the RDKit package.

  • 2_Umap.ipynb
    I generate 3-dimensional UMAPs.

  • 3_Alphabet_comparison_Lense_PubChem.ipynb
    In this notebook, load and tokenize samples drawn from the Lenselink- as well as from the PubChem dataset.

  • 4_Cleaning_the_Lense_dataset.ipynb
    I filter the Lenselink dataset for entries that have a token-alphabet that is a subset of the alphabet shared with PubChem.

  • 4_Cleaning_the_PubChem_dataset.ipynb
    I filter the PubChem dataset for entries that have a token-alphabet that is a subset of the alphabet shared with Lenselink.

  • alphabet_finder_0.py
    A script that extracts the token-alphabet from a file that contains SMILES with my own tokenization method and saves them to a txt file.

  • Lenselink_0_Mapping_SMILES_to_Lenselink.ipynb\

  • Lenselink_1_Molecular_Notation_Transformation.ipynb\

  • Lenselink_2_UMAP_2D.ipynb\

  • Lenselink_2_UMAP_3D.ipynb\

  • Lenselink_3_UMAP_2D_binarized_properties.ipynb\

  • Lenselink_4_Clustering_embeddings.ipynb\

  • Sample_Generator_BackToBasics.py
    This script draws x random samples of size y from a dataset without loading the dataset into memory.

  • Subset_Generator_linecounter.py\

About

In this project at the Rostlab (TUM), I search for the best way to encode small molecules for deep learning applications. The basis are SMILES, therefore this is featurization of SMILES.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published