PrOCoil Data Repository (V1)
Contents
- Introduction
- All Data at a Glance
- PDB Data Set
- Clustering
- BLAST Augmentation
- Model Selection and Training
- Script for Computing Kernel Matrices
Introduction
The PrOCoil Web service and the R package procoil are based on the same support vector machine model that was trained according to the model selection procedure described in the following paper:C. C. Mahrenholz, I. G. Abfalter, U. Bodenhofer, R. Volkmer, and S. Hochreiter. Complex networks govern coiled coil oligomerization - predicting and profiling by means of a machine learning approach. Mol. Cell. Proteomics 10(5):M110.004994, 2011. DOI: 10.1074/mcp.M110.004994The purpose of this page is to make all data available that were used evaluating the computational approach and for training the PrOCoil model. The data available on this page were used to train the original PrOCoil models which were used in versions 1.x.y of the R package and by the PrOCoil Web service before May 7, 2016. The data that were used to train the updated models, are available here.
All Data at a Glance
- PrOCoil_PDB.csv (PDB data set; CSV; 26KB; 477 samples)
- PrOCoil_PDB_Matching.csv (PDB back-mapping; CSV; 46KB; 1587 lines)
- PrOCoil_PDB_60.txt (clustered PDB data set; plain text; 3.2KB; 374 lines)
- PrOCoil_PDB_60_Alignment.txt (multiple alignments of PDB data set clusters; plain text; 24KB; 851 lines)
- PrOCoil_BLAST.csv (BLAST data set; CSV; 736KB; 2357 samples)
- PrOCoil_Augmentation.txt (mapping of PDB samples to BLAST samples; plain text; 68KB; 477 lines)
- PrOCoil.csv (PDB + BLAST data set; CSV; 776KB; 2834 samples)
- PrOCoil_PDB_Folds.zip (PDB data set folds/test sets; ZIP; 9.1KB; 10 files)
- PrOCoil_CV_TrainingSets.zip (cross validation training sets; ZIP; 824KB; 20 files)
- PrOCoil_NestedCV_TrainingSets.zip (training sets for nested cross validation; ZIP; 3.4MB; 90 files)
- CoiledCoilKernel.pl (Perl script for computing kernel matrices; 7.4KB)
PDB Data Set
The following file contains dimeric and trimeric coiled coil segments we extracted from the PDB - The RCSB Protein Data Bank (version as of April 2007):- PrOCoil_PDB.csv (CSV; 26KB; 477 samples)
Format:- Column 1: unique identifier
- Column 2: amino acid sequence
- Column 3: heptad register
- Column 4: oligomerization state as determined by SOCKET ("DIMER" or "TRIMER")
- We scanned to whole PDB for coiled coils using SOCKET of which only parallel dimeric and trimeric coiled coils were selected.
- We filtered out duplicates and exact sub-sequences.
- In case a coiled coil segment has heptad irregularities, we only included the longest sub-sequence whose heptad register is regular.
The following file provides a back-mapping of the above sequences to coiled coil segments in the PDB as identified by SOCKET:
- PrOCoil_PDB_Matching.csv (CSV; 46KB; 1587 lines)
Format:- Column 1: unique identifier (according to PrOCoil_PDB.csv)
- Column 2: identifies whether the segment matches a coiled coil segment exactly or partly; "MATCHES" means that the sequence matches a coiled coil determined by SOCKET exactly; "CONTAINEDIN" means that the sequence matches a sub-sequence of a coiled coil determined by SOCKET; "CONTAINS" means that the sequence contains a coiled coil determined by SOCKET as sub-sequence; matching is supposed to be exact both in terms of amino acid sequence and heptad register
- Columns 3-6: these columns identify to which coiled coil segment the sequence fits (exactly or partly), where Column 3 is the PDB identifier, Column 4 identifies the chain, and Columns 5 and 6 correspond to the start and end positions of the coiled coil segments in the chain (as determined by SOCKET)
PDB357,MATCHES,1FN9,A,100,110
Clustering
As described in the paper (see reference above), sequence clustering was performed to correct for sequence clusters that are over-represented in the PDB (e.g. GCN4 mutants). This was done by single linkage clustering according to a gap-free heptad-specific alignment to ensure that no pair of sequences from two different clusters match to a degree of 60% or higher (percentage computed as the number of matching positions relative to the length of the shorter sequence). The following data set provides the final grouping of samples in our PDB data set:- PrOCoil_PDB_60.txt (plain text; 3.2KB; 374 lines)
Format: every line corresponds to one cluster of sequences; each cluster is a comma-separated list of identifiers according to PrOCoil_PDB.csv - PrOCoil_PDB_60_Alignment.txt (plain text; 24KB; 851 lines)
Format: every cluster is shown as gap-free multiple alignment with the heptad register on top;
BLAST Augmentation
As described in the paper (see reference above), the PrOCoil PDB Data Set was augmented by putative coiled coils that were determined by masked BLAST searches and Marcoil (exact procedure described in the paper; see reference above). The following file provides the set of additional coiled coil sequences (the "BLAST data set"):- PrOCoil_BLAST.csv (CSV; 736KB; 2357 samples)
Format:- Column 1: unique identifier
- Column 2: amino acid sequence
- Column 3: heptad register
- PrOCoil_Augmentation.txt (plain text; 68KB; 477 lines)
Format: every line starts with an identifier of a sequence of the PDB Data Set that is followed by a comma-separated list of BLAST Data Set identifiers;
- PrOCoil.csv (CSV; 776KB; 2834 samples)
Format:- Column 1: unique identifier
- Column 2: amino acid sequence
- Column 3: heptad register
- Column 4: oligomerization state
Model Selection and Training
For model selection, the Clustered PDB Data Set was randomly split into 10 folds. More specifically, every fold is the union of a random selection of clusters, so no clusters were split over different folds. Thereby, we ensured that no coiled coil sequences belonging to different folds have a sequence similarity of more than 60% (according to the definition above). The following archive contains all folds:- PrOCoil_PDB_Folds.zip (ZIP; 9.1KB; 10 files)
This archive contains 10 files PrOCoil_PDB_F??.csv (with ?? being two-digit numbers identifying the numbers of folds).
As described in the supplement of the paper (see reference above), we first used nested cross validation to validate our model selection procedure. In the outer cross validation loop, we withheld one fold as test set and applied 9-fold cross validation on the remaining 9 folds in the inner cross validation loop. The following archive contains all 45 + 45 training sets for the inner cross validation loops each of which, as a consequences, contains 8 folds:
- PrOCoil_NestedCV_TrainingSets.zip (ZIP; 3.4MB; 90 files)
This archive contains 45 training sets PrOCoil_PDB_X??_X??.csv (with ?? being two-digit numbers identifying the folds that are left out from the training set). Each of those files corresponds to one training set without BLAST augmentation. This archive further contains 45 BLAST-augmented training sets PrOCoil_PDB_BLAST_X??_X??.csv (naming conventions analogously).
In the outer cross validation loop, models were trained on all 9 training folds. For the final parameter selection, we used regular 10-fold cross validation. The following archive provides 10 + 10 training sets with only one test fold left out:
- PrOCoil_CV_TrainingSets.zip (ZIP; 824KB; 20 files)
This archive contains 10 training sets PrOCoil_PDB_X??.csv (with ?? being the two-digit numbers identifying the fold that is left out from the training set). Each of those files corresponds to one training set without BLAST augmentation. This archive further contains 10 BLAST-augmented training sets PrOCoil_PDB_BLAST_X??.csv (naming conventions analogously).
- Column 1: unique identifier
- Column 2: amino acid sequence
- Column 3: heptad register
- Column 4: oligomerization state
Script for Computing Kernel Matrices
The following Perl script can be used to compute coiled coil kernel matrices that can be supplied into support vector machine software:- CoiledCoilKernel.pl (Perl script; 7.4KB)
To obtain information how to use the script, execute it without arguments.
The first command takes the non-augmented cross validation training set without fold no. 7 and computes the normalized coiled coil kernel matrix with m=7. The second command trains a support vector machine with cost parameter C=8. The third command computes the kernel matrix of fold no. 7 as test set versus the training set. The fourth command performs SVM prediction. Note that you need Perl and LIBSVM to execute these commands.perl CoiledCoilKernel.pl -m=7 -norm=1 -output=LIBSVM PrOCoil_PDB_X07.csv > PrOCoil_PDB_X07.train svm-train -t 4 -c 8 PrOCoil_PDB_X07.train PrOCoil_PDB_X07.model perl CoiledCoilKernel.pl -m=7 -norm=1 -output=LIBSVM PrOCoil_PDB_X07.csv PrOCoil_PDB_F07.csv > PrOCoil_PDB_F07.test svm-predict PrOCoil_PDB_F07.test PrOCoil_PDB_X07.model PrOCoil_PDB_F07.out