Institute for Machine Learning

PrOCoil Data Repository (V2)

Introduction
All Data at a Glance
PDB Data Set
BLAST Augmentation
R Example Code

Introduction

The PrOCoil Web service and the R package procoil are based on the same support vector machine model that was trained according to the model selection procedure described in the following paper:

C. C. Mahrenholz, I. G. Abfalter, U. Bodenhofer, R. Volkmer, and S. Hochreiter. Complex networks govern coiled coil oligomerization - predicting and profiling by means of a machine learning approach. Mol. Cell. Proteomics 10(5):M110.004994, 2011. DOI: 10.1074/mcp.M110.004994

This page makes the data sets available that have been used for training the PrOCoil models. These are updated data sets that were used to train the models which are included sind version 2.0.0 of the PrOCoil R package. In sync with the release of the package, the models have also been updated in the PrOCoil Web application on May 7, 2016. The models used before May 7, 2016, or in earlier versions of the R package, are still available here.

All Data at a Glance

PrOCoil_PDB_V2.csv (PDB data set; CSV file; 149KB; 1764 samples)
PrOCoil_PDB_V2.RData (PDB data set; R workspace; 51KB; 1764 samples)
PrOCoil_BLAST_V2.csv (BLAST data set; CSV file; 216KB; 1880 samples)
PrOCoil_BLAST_V2.RData (BLAST data set; R workspace; 40KB; 1880 samples)
PrOCoil_Augmentation_V2.txt (mapping of PDB samples to BLAST samples; plain text; 42KB)
PrOCoil_Augmentation_V2.RData (mapping of PDB samples to BLAST samples; R workspace; 6KB)

More details on these files can be found in the sections below.

PDB Data Set

The following files contain dimeric and trimeric coiled coil segments we extracted from the PDB - The RCSB Protein Data Bank (version as of January 2015):

PrOCoil_PDB_V2.csv (CSV; 149KB; 1764 samples)
Format:
- Column 1 (labeled "ID"): unique identifier "PDB???, where "??? is a number ranging from 1 to 1764
- Column 2 (labeled "PDB_IDs"): PDB back-mapping - IDs of PDB structures in which the sequence is contained
- Column 3 (labeled "Seq"): amino acid sequence
- Column 4 (labeled "Reg"): heptad register
- Column 5: (labeled "Class") oligomerization state as determined by SOCKET ("DIMER" or "TRIMER")
- Column 6: (labeled "Fold") assignment of sequence to one of the 10 folds in the model selection procedure (see below)
PrOCoil_PDB_V2.RData (R workspace; 51KB; 1764 samples)
This R workspace contains an R object PDBdataX contains the same data as PrOCoil_PDB_V2.csv, but in an convenient format that can be processed in R using the KeBABS package directly without any further preprocessing:
- Object of class AAStringSet containing the 1764 sequences
- The sequences are named according to the unique identifiers.
- PDB back-mapping IDs, oligomerization state, and fold numbers are contained in metadata columns "PDB_IDs", "Class", and "Fold", respectively.
- The heptad registers are contained in a metadata column "annotation". Furthermore, the object contains a metadata component "annotationCharset". This special structure is required to process the data with the coiled coil kernel in the KeBABS package.

This data set was created as follows:

We scanned to whole PDB for coiled coils using SOCKET of which only parallel dimeric and trimeric coiled coils with a minimum length of 11 amino acids were selected.
We filtered out exact duplicates. In contrast to the previous version of the data set, exact sub-sequences and heptad irregularities were not filtered out.
The assignment of sequences to one of the ten folds was done randomly. However, the sequences were clustered first to correct for sequence clusters that are over-represented in the PDB (e.g. GCN4 mutants). This was done by single linkage clustering according to a gap-free heptad-specific alignment to ensure that no pair of sequences from two different clusters match to a degree of 64% or higher (percentage computed as the number of matching positions relative to the length of the shorter sequence). Finally, whole clusters were merged into folds, i.e. two samples from the same cluster are always assigned to the same fold. As a result, in the cross validation procedure, samples from one cluster are always entirely in the training set or in the test set.

BLAST Augmentation

As described in the PrOCoil paper (see reference above), the PrOCoil PDB Data Set was augmented by putative coiled coils that were determined by masked BLAST searches and Marcoil (the exact procedure which slightly differs from the one described in the PrOCoil paper will be described in an upcoming paper). The following files provide the data set of additional putative coiled coil sequences (the "BLAST data set"):

PrOCoil_BLAST_V2.csv (CSV; 216KB; 1880 samples)
Format:
- Column 1 (labeled "ID"): unique identifier "BLAST???, where "??? is a number ranging from 1 to 1880
- Column 2 (labeled "Seq"): amino acid sequence
- Column 4 (labeled "Reg"): heptad register
- Column 5: (labeled "Class") oligomerization state as determined by SOCKET ("DIMER" or "TRIMER")
PrOCoil_BLAST_V2.RData (R workspace; 40KB; 1880 samples)
This R workspace contains an R object PDBdataX contains the same data as PrOCoil_BLAST_V2.csv, but in an convenient format that can be processed in R using the KeBABS package directly without any further preprocessing:
- Object of class AAStringSet containing the 1880 sequences
- The sequences are named according to the unique identifiers.
- The oligomerization state is contained in the metadata column "Class".
- The heptad registers are contained in a metadata column "annotation". Furthermore, the object contains a metadata component "annotationCharset". This special structure is required to process the data with the coiled coil kernel in the KeBABS package.
- There are two additional metadata columns "PDB_IDs" and "Fold" which are filled up with NA's for compatibility with the PDB Data Set, i.e. to facilitate easy merging of PDB with BLAST data.

The following files provide the mapping between PDB Data Set and BLAST Data Set, i.e. by which coiled coil sequences from the BLAST Data Set each coiled coil sequence from the PDB Data Set can be augmented:

PrOCoil_Augmentation_V2.txt (plain text; 42KB)
Format: every line starts with an identifier of a sequence of the PDB Data Set that is followed by a comma-separated list of BLAST Data Set identifiers;
PrOCoil_Augmentation_V2.RData (R workspace; 42KB)
This R workspace contains an R object PDB2BLASTmapping. This is a list with 100 components. Each component corresponds to one coiled coil sequence from the PDB Data Set that is augmented with at least one sequence from the BLAST Data Set. The names of the components are the unique identifiers of these PDB sequences. The components are character vectors containing the unique identifiers of the BLAST sequences with which the PDB sequences are augmented.

The total number of coiled coil sequences (PDB + BLAST data set) is 1764 + 1880 = 3644.

R Example Code

The following R script contains some example code how to process the data described above:

PrOCoil_Example_Code_V2.R (R script; 5KB)
PrOCoil 2.0 R example code
PrOCoil_Example_Code_V2.pdf (PDF file; 126KB):
PrOCoil 2.0 R example code executed and formatted using the knitr package

To execute the PrOCoil 2.0 R example code, the KeBABS package and the PrOCoil package (version at least 2.0.0) are required. Both packages are available from Bioconductor. For PrOCoil 2.0, at least Bioconductor 3.3 is required (which, as a consequence, requires at least R 3.3.0).