Large-scale comparison of machine learning methods for drug target prediction on ChEMBL
Directory structure
genDirStructure.cpp creates the following directory structure:SampleIdTable.txt | sample names corresponding to binary data stored in subdirectories |
chemFeatures | |
chemFeatures/cl | used clustering file cl1.info is there, further results from clusterMinFull.cpp are stored in a subdirectory clusterMinFull |
chemFeatures/d | one subdirectory for each dense, real-valued data matrix (csv) |
chemFeatures/s | one subdirectory for each sparse matrix (fpf) |
train | file tocompute.info, that describes targets (assays) to consider, and file train.info, that describes compound-assay relations (double entries may exist) are in this directory |
run | in subdirectories results from the C/C++ pipeline are stored there |
Further the Python pipeline assumes directories:
dataPython | all data stored for Python format |
dataPythonReduced | only compounds considered in Python format (reduces main memory assumption) |
resPython | Deep learning results stored in subdirectories |
You might consider downloading data provided below by the following commands:
wget https://raw.githubusercontent.com/ml-jku/lsc/master/download.sh
chmod u+x download.sh
./download.sh ~/jkuLSCData
Data SampleIdTable.txt
- SampleIdTable.txt (MD5: df7e773de4ce0272d8ed8207c4ef6a6f)
Data chemFeatures/cl
- clusterMinFull.zip (MD5: 039c96f160dc66b326c9bd220327528f)
- cl1.info (MD5: 7187541a8706bdfbc32d53c6f936266c)
Data chemFeatures/d
- dense.zip (MD5: 947a5ecf3aef70a4a20ae7925e9dd4d5)
- semisparse.zip (MD5: b3e8068e3c20d4a5dde6cdb1e09159dd)
- toxicophores.zip (MD5: 19b5a2879ba601e0fba7db237606f537)
Data chemFeatures/s
- ECFC4.zip (MD5: 7f2d170469eb8cb09bc65d17349e8ad7)
- ECFC6_ES.zip (MD5: b13e686c6c01357ab371e47a8c8835ff)
- DFS8_ES.zip (MD5: fbd79243d2774d1d2f7ccc9ffad266c3)
- semisparse.zip (MD5: af609242aacb3f1ed364e907d9167ccb)
- toxicophores.zip (MD5: 673acdc5628898ef340fac35f3495adb)
Data train
- train.zip (MD5: 5cb9ccd3486651a4bcd90a7c583dcbc8)
- tocompute.info (MD5: e5ee2c1c729f1af7190e1d220241a780)
Data dataPython
- dataPython.zip (MD5: 1e3166eba4406b00f490a0a7c021bd3e, does not include data for GC, Weave and SmilesLSTM)
Data dataPythonReduced
- dataPythonReduced.zip (MD5: c665bb9f4acbfce94adc6c4ab240bac0, does not include data for GC, Weave and SmilesLSTM)
- chembl20LSTM.pckl (MD5: 357f80595f6b081daef839c7f642fb30)
- chembl20Smiles.pckl (MD5: bb3027dbc41163fc55cdff9b5fedbb50)
- chembl20MACCS.pckl (MD5: 3a79efc3230cb4eed512d6795b124fd0)
- chembl20Deepchem.pckl (MD5: 255684411c7ff4aea899fa0db294869e)
- chembl20Conv.pckl (MD5: 4ddc2d1f49b379c63e518996fe1b7377)
- chembl20Weave.pckl (MD5: e9394029b2a697d0757adff3fadfb6f0)
Contact: Andreas Mayr (mayr@bioinf.jku.at)