PrOCoil Data Repository (V2)
Contents
- Introduction
- All Data at a Glance
- PDB Data Set
- BLAST Augmentation
- R Example Code
The
PrOCoil Web service and the
R package procoil are
based on the same support vector machine model that was trained according to the model selection procedure
described in the following paper:
C. C. Mahrenholz, I. G. Abfalter, U. Bodenhofer, R. Volkmer, and S. Hochreiter.
Complex networks govern coiled coil oligomerization - predicting
and profiling by means of a machine learning approach.
Mol. Cell. Proteomics 10(5):M110.004994, 2011.
DOI: 10.1074/mcp.M110.004994
This page makes the data sets available that have been used for training the
PrOCoil models. These are updated data sets that were
used to train the models which are included sind version 2.0.0 of the
PrOCoil R package. In sync with the release of the
package, the models have also been updated in the
PrOCoil Web application on May 7, 2016.
The models used before
May 7, 2016, or in earlier versions of the R package, are still available
here.
More details on these files can be found in the sections below.
The following files contain dimeric and trimeric coiled coil segments we extracted from the
PDB - The RCSB Protein Data Bank (version as of January 2015):
- PrOCoil_PDB_V2.csv (CSV; 149KB; 1764 samples)
Format:
- Column 1 (labeled "ID"): unique identifier "PDB???, where "??? is a
number ranging from 1 to 1764
- Column 2 (labeled "PDB_IDs"): PDB back-mapping - IDs of PDB structures in which the sequence is contained
- Column 3 (labeled "Seq"): amino acid sequence
- Column 4 (labeled "Reg"): heptad register
- Column 5: (labeled "Class") oligomerization state as determined by SOCKET
("DIMER" or "TRIMER")
- Column 6: (labeled "Fold") assignment of sequence to one of the 10 folds in the model selection procedure
(see below)
- PrOCoil_PDB_V2.RData (R workspace; 51KB; 1764 samples)
This R workspace contains an R object PDBdataX contains the same data as PrOCoil_PDB_V2.csv,
but in an convenient format that can be processed in R using the KeBABS package
directly without any further preprocessing:
- Object of class AAStringSet containing the 1764 sequences
- The sequences are named according to the unique identifiers.
- PDB back-mapping IDs, oligomerization state, and fold numbers are contained in metadata columns
"PDB_IDs", "Class", and "Fold", respectively.
- The heptad registers are contained in a metadata column "annotation". Furthermore,
the object contains a metadata component "annotationCharset". This special structure is
required to process the data with the coiled coil kernel in the KeBABS package.
This data set was created as follows:
- We scanned to whole PDB for coiled coils using SOCKET
of which only parallel dimeric and trimeric coiled coils with a minimum length of 11 amino acids were selected.
- We filtered out exact duplicates. In contrast to the previous version of the data set, exact sub-sequences
and heptad irregularities were not filtered out.
- The assignment of sequences to one of the ten folds was done randomly. However, the sequences were clustered first
to correct for sequence clusters that are over-represented in the PDB (e.g. GCN4 mutants). This was done by single linkage
clustering according to a gap-free heptad-specific alignment to ensure that no pair of sequences from two
different clusters match to a degree of 64% or higher (percentage computed as the number of matching positions
relative to the length of the shorter sequence). Finally, whole clusters were merged into folds, i.e.
two samples from the same cluster are always assigned to the same fold. As a result, in the cross validation
procedure, samples from one cluster are always entirely in the training set or in the test set.
As described in the PrOCoil paper (see reference
above), the
PrOCoil PDB Data Set was augmented by
putative coiled coils that were determined by masked BLAST searches and
Marcoil (the exact procedure which
slightly differs from the one described in the PrOCoil paper will be described in an
upcoming paper).
The following files provide the data set of additional putative coiled coil sequences (the "BLAST data set"):
- PrOCoil_BLAST_V2.csv (CSV; 216KB; 1880 samples)
Format:
- Column 1 (labeled "ID"): unique identifier "BLAST???, where "??? is a
number ranging from 1 to 1880
- Column 2 (labeled "Seq"): amino acid sequence
- Column 4 (labeled "Reg"): heptad register
- Column 5: (labeled "Class") oligomerization state as determined by SOCKET
("DIMER" or "TRIMER")
- PrOCoil_BLAST_V2.RData (R workspace; 40KB; 1880 samples)
This R workspace contains an R object PDBdataX contains the same data as PrOCoil_BLAST_V2.csv,
but in an convenient format that can be processed in R using the KeBABS package
directly without any further preprocessing:
- Object of class AAStringSet containing the 1880 sequences
- The sequences are named according to the unique identifiers.
- The oligomerization state is contained in the metadata column
"Class".
- The heptad registers are contained in a metadata column "annotation". Furthermore,
the object contains a metadata component "annotationCharset". This special structure is
required to process the data with the coiled coil kernel in the KeBABS package.
- There are two additional metadata columns "PDB_IDs" and "Fold" which are
filled up with NA's for compatibility with the PDB Data Set, i.e. to facilitate easy merging
of PDB with BLAST data.
The following files provide the mapping between
PDB Data Set and
BLAST Data Set, i.e. by which coiled coil sequences from the
BLAST Data Set each
coiled coil sequence from the
PDB Data Set can be augmented:
- PrOCoil_Augmentation_V2.txt (plain text; 42KB)
Format: every line starts with an identifier of a sequence of the PDB Data Set that is followed
by a comma-separated list of BLAST Data Set identifiers;
- PrOCoil_Augmentation_V2.RData (R workspace; 42KB)
This R workspace contains an R object PDB2BLASTmapping. This is a list with 100 components. Each component corresponds
to one coiled coil sequence from the PDB Data Set that is augmented with at least one sequence from the
BLAST Data Set. The names of the components are the unique identifiers of these PDB sequences.
The components are character vectors containing the unique identifiers of the BLAST sequences with which the PDB sequences
are augmented.
The total number of coiled coil sequences (PDB + BLAST data set) is 1764 + 1880 = 3644.
The following R script contains some example code how to process the data described above:
To execute the PrOCoil 2.0 R example code, the
KeBABS package and
the
PrOCoil package (version at least 2.0.0)
are required. Both packages are available from
Bioconductor.
For PrOCoil 2.0, at least
Bioconductor 3.3
is required (which, as a consequence, requires at least R 3.3.0).