Interpreting Protein Structure Prediction Data
Ginzu is a protocol that attempts to determine the regions of a protein chain that will fold into globular units, called "domains". It scans the protein chain sequence with successively less confident methods of detection to determine any homologs with experimentally determined structures, starting with PDB-Blast (PSI-BLAST against the PDB), and followed by the more remote fold-detection method FFAS03 (also previously ORFEUS and Pcons). After any homologs are identified, a search of remaining regions is done with HMMER against the Pfam- A protein family database. Lastly, the PSI-BLAST multiple sequence alignment is used to assign regions of increased likelihood of possessing a contiguous domain based on sequence clusters. The final step consists of selecting cut-points between the domains (and possibly defining new domains based on the strongest cutpoints for any remaining long stretches of the sequence that have not already matched a homolog with a structure or Pfam-A) using the PSI-BLAST MSA. Each domain method produces a score which corresponds to the confidence in the prediction as well as the confidence in the method to produce accurate domain predictions.
The PSI-BLAST method is used for detecting sequence homologs of a given protein in the Protein Data Bank (PDB). It uses the concept of searching with a position-specific residue substitution profile appropriate to the family in which the query belongs. This allows for more sensitive detection of remote homologous sequences. The confidence displayed is the -log(eval), where the e-val is the value returned by a PSI-BLAST search against the PDB. A confidence of 3.0 (e = 0.001) is considered to be a strong detection threshold and PDB-Blast is considered the highest confidence method.
FFAS03 is a fold recognition method based on profile-profile comparisons. During Ginzu execution, profiles of regions of the protein not annotated by PDB-Blast are built and compared to profiles built for sequences in the PDB. Scores are divided by -10 to allow comparison to other scaled e-val based scores (ex. PDB-Blast). Scores >= .95 have been shown to have less than 3% false positives and therefore hits are thresholded at this level.
(Jaroszewski, L., Rychlewski, L., Li, Z., Li, W. & Godzik, A. (2005) FFAS03: a server for profile-profile sequence alignments. Nucl. Acids Res. 33, W284-W288)
Pfam is a set of families of protein sequences that are represented as hidden Markov models, and may be searched with HMMER. The confidence of Pfam matched domains is given by: -log(eval), where the e-val returned by the search of the Pfam database using HMMER. Values of 3.0 (e = 0.001) or higher are considered significant.
Multiple Sequence Alignment (MSA) is used in the final and lowest-confidence sequence-based step of the Ginzu domain prediction algorithm. Analysis of the PSI-BLAST MSA is employed to predict domain cut points, based on the density of regions in the sequence alignments. Only confident cut predictions are shown. Confidence scores represent the number of sequence clusters in the MSA that overlap the domain.
After the Ginzu domain prediction algorithm has exhausted its analysis of the protein sequence to predict protein domains, remaining stretches of the sequence may be designated as individual domains, with longer stretches being cut into separate domains based on length. This is the least confident of the domain prediction steps.
Mammoth Confidence Metric (MCM) is the probability that the structure of the domain is classified in a specific SCOP superfamily. Rosetta structures are predicted for the domain on IBM.s World Community Grid and compared to PDB representatives classified in SCOP using Mammoth structure comparison method. Only structures for domains predicted using Pfam, MSA or deduced are predicted and classified (PDB-Blast and FFAS03 are run first and produce more confident fold predictions). Additionally, structures are only predicted for domains that meet thresholds of length (~<150), low predicted disorder (<25%) and no predicted transmembrane helices. The MCM score is a logistic regression of the Mammoth z-score, Rosetta convergence score, contact order and length ratio of domain sequence and matched structure sequence. Scores >= 0.9 are correct more than 75% and scores >= 0.8 are correct more than 66%.
Gene Ontology (GO) Predication
The Gene Ontology (GO) is a collection of terms that allows for the annotation of proteins across many genomes. GO function terms are predicted for domains with predicted structures and integrated with known GO biological process and cellular component terms. Function predictions are meant to be hypotheses testable in the lab. Predictions are given a log-likelihood ratio (LLR) score to estimate the confidence where a score greater than zero is more likely to be true than false. Structure evidence used to predict functions is scaled based on the confidence in the structure classification. For example, function predictions using structure evidence from Rosetta de novo models (i.e. MCM) are scaled by the MCM score. Finally, only predictions with greater than -3.0 LLR are shown.
Protein Data Bank is the world's protein structure data repository. We also refer to the file format from the Protein Data Bank used to describe a protein structure as a PDB.
ORFEUS (previous method)
ORFeus is a method for matching protein sequences to likely protein folds based on very remote sequence similarities, and is employed in the fold recognition step of the Ginzu domain prediction algorithm. The sequence profile and predicted secondary structures are searched against a database of sequence profiles and predicted secondary structures for proteins of known structure.
Results in the LiveBench, test of fold recognition methods suggest that scores of 7.5 or greater are almost always correct matches.
Pcons (previous method)
Pcons was the first consensus server for fold recognition and is used in the fold recognition step by the Ginzu domain prediction algorithm. It selects the best prediction out of several predictions. For each query sequence predictions from several fold recognition servers is collected. For each of these models a measure that relates to the quality of the model is calculated. The prediction of this new measure is accomplished by utilizing structural comparisons between the models and analyzing the server score for a particular model. Pcons makes at least 10% more correct predictions than the best single method and the specificity is significantly better.
Any Pcons score higher than 1.5 should be significant.