|
|
||
|
|
||
Interpreting Protein Structure Prediction Data
Deduced Domains
After the Ginzu domain prediction algorithm has exhausted its analysis of the protein
sequence to predict protein domains, remaining stretches of the sequence may be designated as individual domains, with
longer stretches being cut into separate domains based on length. This is the least confident of the domain prediction
steps.
Ginzu
Ginzu is a protocol that attempts to determine the regions of a protein chain that will fold into globular units,
called "domains". It scans the protein chain sequence with successively less confident methods of detection to
determine any homologs with experimentally determined structures, starting with
PDB-BLAST (PSI-BLAST against the PDB), and followed by the more
remote fold-detection methods ORFEUS and Pcons. After any homologs
are identified, a search of remaining regions is done with HMMER against the Pfam-A protein family database.
Lastly, the PSI-BLAST multiple sequence alignment is used to assign regions of increased likelihood of possessing
a contiguous domain based on sequence clusters. The final step consists of selecting cut-points between the
domains (and possibly defining new domains based on the strongest cutpoints for any remaining long stretches
of the sequence that have not already matched a homolog with a structure or Pfam-A) using the PSI-BLAST MSA.
MSA
Multiple Sequence Alignment (MSA) is used in the final and lowest-confidence sequence-based step of the Ginzu domain prediction algorithm.
Analysis of the PSI-BLAST MSA is employed to predict domain cut points, based on the density of regions in the sequence alignments. Only confident
cut predictions are shown.
ORFEUS
ORFeus is a method for matching protein sequences to likely protein folds based on very remote sequence similarities, and is
employed in the fold recognition step of the Ginzu domain prediction algorithm. The sequence profile
and predicted secondary structures are searched against a database of sequence profiles and predicted secondary
structures for proteins of known structure.
Results in the LiveBench, test of fold recognition methods suggest that scores of 7.5 or greater are almost always correct matches.
Pcons
Pcons was the first consensus server for fold recognition and is used
in the fold recognition step by the Ginzu domain prediction algorithm. It selects the best prediction out
of several predictions. For each query sequence predictions from several fold recognition servers is collected. For each of
these models a measure that relates to the quality of the model is calculated. The prediction of this new measure is accomplished
by utilizing structural comparisons between the models and analyzing the server score for a particular model. Pcons makes at
least 10% more correct predictions than the best single method and the specificity is significantly better.
Any Pcons score higher than 1.5 should be significant.
PDB
Protein Data Bank is the world's protein structure data repository. We also refer to the file format from the Protein Data Bank used to describe a protein structure as a PDB.
Pfam
Pfam is a set of families of protein sequences
that are represented as hidden Markov models, and may be searched with HMMER.
The confidence of Pfam matched domains is given by: -log(e-val), where the e-val returned by the search of the Pfam database using HMMER. Values of 3.0 (e = 0.001) or higher are considered significant.
PSI-BLAST
PSI-BLAST is a method for detecting sequence
homologs of a given protein. It uses the concept of searching with a position-specific residue substitution profile
appropriate to the family in which the query belongs. This allows for more sensitive detection of remote homologous
sequences.
Our domain prediction protocol, called Ginzu, scans the protein chain sequence with successively less confident methods of detection to determine any homologs with experimentally determined structures, starting with PSI-BLAST search against the PDB.
The confidence displayed is the -log(e-val), where the e-val is the value returned by a PSI-BLAST search against the PDB. A confidence of 3.0 (e = 0.001) is considered to be a strong detection threshold.