GitHub - veren4/SMILES_featurization: In this project at the Rostlab (TUM), I search for the best way to encode small molecules for deep learning applications. The basis are SMILES, therefore this is featurization of SMILES.

Featurization of SMILES for Deep Learning

This is my project for a research internship at the Rostlab at TUM with supervision from Prof. Dr. Burkhard Rost and Christian Dallago. It is my task to find a representation for small molecules (a class of drugs) for Deep Learning Applications such as binding affinity prediction. The basis for that is the SMILES representation. As a baseline, I take fingerprints.

Files and their function:

0_Data_Loading.ipynb
General data loading and saving as csv.
0_Data_Loading_h5.ipynb
I look at the data from ChemicalChecker, which is in h5 format. It turns out they only offer their model and not their raw data for download, so that exploration ends in this notebook.
1_Finding_the_alphabet.ipynb
My personally constructed apporach to tokenize canonical SMILES.
1_Finding_the_alphabet_SmilesPE
I use the SmilesPE package to tokenize SMILES.
1_Molecular_Notation_Transformation.ipynb
I transform SMILES to Fingerprints using the RDKit package.
2_Umap.ipynb
I generate 3-dimensional UMAPs.
3_Alphabet_comparison_Lense_PubChem.ipynb
In this notebook, load and tokenize samples drawn from the Lenselink- as well as from the PubChem dataset.
4_Cleaning_the_Lense_dataset.ipynb
I filter the Lenselink dataset for entries that have a token-alphabet that is a subset of the alphabet shared with PubChem.
4_Cleaning_the_PubChem_dataset.ipynb
I filter the PubChem dataset for entries that have a token-alphabet that is a subset of the alphabet shared with Lenselink.
alphabet_finder_0.py
A script that extracts the token-alphabet from a file that contains SMILES with my own tokenization method and saves them to a txt file.
Lenselink_0_Mapping_SMILES_to_Lenselink.ipynb\
Lenselink_1_Molecular_Notation_Transformation.ipynb\
Lenselink_2_UMAP_2D.ipynb\
Lenselink_2_UMAP_3D.ipynb\
Lenselink_3_UMAP_2D_binarized_properties.ipynb\
Lenselink_4_Clustering_embeddings.ipynb\
Sample_Generator_BackToBasics.py
This script draws x random samples of size y from a dataset without loading the dataset into memory.
Subset_Generator_linecounter.py\

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
functions		functions
.gitignore		.gitignore
0_Data_Loading.ipynb		0_Data_Loading.ipynb
0_Data_Loading_h5.ipynb		0_Data_Loading_h5.ipynb
0_Metadata_Preparation.ipynb		0_Metadata_Preparation.ipynb
0_Tokenization.ipynb		0_Tokenization.ipynb
1_Finding_the_alphabet.ipynb		1_Finding_the_alphabet.ipynb
1_Finding_the_alphabet_SmilesPE.ipynb		1_Finding_the_alphabet_SmilesPE.ipynb
1_Molecular_Notation_Transformation.ipynb		1_Molecular_Notation_Transformation.ipynb
2_Umap.ipynb		2_Umap.ipynb
3_Alphabet_comparison_Lense_PubChem.ipynb		3_Alphabet_comparison_Lense_PubChem.ipynb
3_LSTM.ipynb		3_LSTM.ipynb
4_Cleaning_the_Lense_dataset.ipynb		4_Cleaning_the_Lense_dataset.ipynb
4_Cleaning_the_PubChem_dataset.ipynb		4_Cleaning_the_PubChem_dataset.ipynb
5_language_model_1_DeepDTA.ipynb		5_language_model_1_DeepDTA.ipynb
LSTM_KDNuggets_train.ipynb		LSTM_KDNuggets_train.ipynb
Lenselink_0_Mapping_SMILES_to_Lenselink.ipynb		Lenselink_0_Mapping_SMILES_to_Lenselink.ipynb
Lenselink_1_Molecular_Notation_Transformation.ipynb		Lenselink_1_Molecular_Notation_Transformation.ipynb
Lenselink_2_UMAP_2D.ipynb		Lenselink_2_UMAP_2D.ipynb
Lenselink_2_UMAP_3D.ipynb		Lenselink_2_UMAP_3D.ipynb
Lenselink_3_UMAP_2D_binarized_properties.ipynb		Lenselink_3_UMAP_2D_binarized_properties.ipynb
Lenselink_4_Clustering_embeddings.ipynb		Lenselink_4_Clustering_embeddings.ipynb
My_LSTM.ipynb		My_LSTM.ipynb
My_LSTM_2.ipynb		My_LSTM_2.ipynb
My_LSTM_3.ipynb		My_LSTM_3.ipynb
README.md		README.md
SMILES_featurization_try_1.ipynb		SMILES_featurization_try_1.ipynb
Sample_Generator_BackToBasics.py		Sample_Generator_BackToBasics.py
Subset_Generator_linecounter.py		Subset_Generator_linecounter.py
alphabet_finder_0.py		alphabet_finder_0.py
df_notebook0_dataloading.csv		df_notebook0_dataloading.csv
df_notebook1_Transformation.csv		df_notebook1_Transformation.csv
wonderland.yml		wonderland.yml

veren4/SMILES_featurization

Folders and files

Latest commit

History

Repository files navigation

Featurization of SMILES for Deep Learning

About

Topics

Resources

Stars

Watchers

Forks

Languages