PrOCoil Data Repository (V2)

Contents

  1. Introduction
  2. All Data at a Glance
  3. PDB Data Set
  4. BLAST Augmentation
  5. R Example Code

Introduction

The PrOCoil Web service and the R package procoil are based on the same support vector machine model that was trained according to the model selection procedure described in the following paper:
C. C. Mahrenholz, I. G. Abfalter, U. Bodenhofer, R. Volkmer, and S. Hochreiter. Complex networks govern coiled coil oligomerization - predicting and profiling by means of a machine learning approach. Mol. Cell. Proteomics 10(5):M110.004994, 2011. DOI: 10.1074/mcp.M110.004994
This page makes the data sets available that have been used for training the PrOCoil models. These are updated data sets that were used to train the models which are included sind version 2.0.0 of the PrOCoil R package. In sync with the release of the package, the models have also been updated in the PrOCoil Web application on May 7, 2016. The models used before May 7, 2016, or in earlier versions of the R package, are still available here.

All Data at a Glance

More details on these files can be found in the sections below.

PDB Data Set

The following files contain dimeric and trimeric coiled coil segments we extracted from the PDB - The RCSB Protein Data Bank (version as of January 2015): This data set was created as follows:
  1. We scanned to whole PDB for coiled coils using SOCKET of which only parallel dimeric and trimeric coiled coils with a minimum length of 11 amino acids were selected.
  2. We filtered out exact duplicates. In contrast to the previous version of the data set, exact sub-sequences and heptad irregularities were not filtered out.
  3. The assignment of sequences to one of the ten folds was done randomly. However, the sequences were clustered first to correct for sequence clusters that are over-represented in the PDB (e.g. GCN4 mutants). This was done by single linkage clustering according to a gap-free heptad-specific alignment to ensure that no pair of sequences from two different clusters match to a degree of 64% or higher (percentage computed as the number of matching positions relative to the length of the shorter sequence). Finally, whole clusters were merged into folds, i.e. two samples from the same cluster are always assigned to the same fold. As a result, in the cross validation procedure, samples from one cluster are always entirely in the training set or in the test set.

BLAST Augmentation

As described in the PrOCoil paper (see reference above), the PrOCoil PDB Data Set was augmented by putative coiled coils that were determined by masked BLAST searches and Marcoil (the exact procedure which slightly differs from the one described in the PrOCoil paper will be described in an upcoming paper). The following files provide the data set of additional putative coiled coil sequences (the "BLAST data set"): The following files provide the mapping between PDB Data Set and BLAST Data Set, i.e. by which coiled coil sequences from the BLAST Data Set each coiled coil sequence from the PDB Data Set can be augmented: The total number of coiled coil sequences (PDB + BLAST data set) is 1764 + 1880 = 3644.

R Example Code

The following R script contains some example code how to process the data described above: To execute the PrOCoil 2.0 R example code, the KeBABS package and the PrOCoil package (version at least 2.0.0) are required. Both packages are available from Bioconductor. For PrOCoil 2.0, at least Bioconductor 3.3 is required (which, as a consequence, requires at least R 3.3.0).