[<< wikibooks] Proteomics/Protein - Protein Interactions/Prediction
Page Edited and Updated by: Dan Surdyk 
E-mail: dfs6389@rit.eduThis Section:

== Prediction Methods for Interactions and Docking ==

=== Monte Carlo ===
The Monte Carlo approach to protein/protein interaction is modeled after the well-known random sampling algorithm used in computer science.  The theory is summarized as follows: Given a sufficiently large number of initial configurations, one will either emerge as the best configuration, or eventually lead to it.  The process begins with the docking of one protein to another.  A score is then computed based on things like energy, amount of exposed hydrophobic surface, number of interacting amino acids, among other parameters.  Random changes are made to the interaction (on the level of whole proteins, side chains, and even individual atoms!), and the step is either approved or rejected based on whether it caused the score to improve. [2]

=== Accessible Surface Area (ASA) ===
Each atom in a protein has a van der Waal’s radius, and this is used to create a van der Waal’s surface of the protein.  The ASA describes all the places that solvent could potentially make contact with this surface.  It is calculated by computationally “rolling” a sphere around on the van der Waal’s surface, and calculating what parts of the surface it makes contact with.  This surface is incorporated in many docking and surface patch programs. [3]

=== Critical Assessment of PRediction of Interactions (CAPRI) ===
The CAPRI challenge the European Bioinformatics Institute’s (EBI) creative way of evaluating protein docking programs.  Groups receive protein data from EBI and attempt to dock the proteins.  Their results are then tested and compared, not only giving recognition to deserving researchers, but also providing scientists with a comparison of the advantages of each program or protocol. [4]

=== Geometric Hashing ===
Geometric Hashing is a molecular modeling technique that matches a target molecule to a protein of interest.

TESSTESS is an algorithm that uses X-ray and NMR data to determine 3D templates of proteins from the structures deposited in PDB. New structures can be scanned against these 3D structures to identify functional sites. Thus TESS is used to identify function of new protein structures and design proteins with specific function with the development of a 3D template database. [5]

=== Weighted Delaunay Triangulation ===
Delaunay triangulation is one way of turning the surface of a protein into triangles.  The method involves taking a set of points in three dimensions, and forming triangles so that if a circle were drawn that contains all three points (a circumcircle), no other points in the set would fall within that circle.  In the case of protein surfaces, the points would be locations of atoms.  This creates a surface without little “sliver” triangles that take up more time and memory in the computer without really being meaningful to the calculations.[6]

=== FADE and PADRE ===
The Fast Atomic Density Evaluator (FADE) and Pair-wise Atomic Density Reverse Engineering (PADRE) programs are used to determine molecular modeling of proteins. FADE and PADRE identify most prominent features in a protein such as regions which are involved in interaction with the other molecules by elucidating features of interest like crevices, grooves and protrusions. FADE has a special feature of evaluating shape complementarity for docked protein protein complexes.[7] FADE and PADRE can be downloaded at http://www.mitchell-lab.org/mitchell-lab/FADE.php 

=== CAST ===
Computed Atlas of Surface Topography measures the shape of protein pockets with  weighted Delauney triangulation. Proteins and certain aminoacids have specific structures known as cavities  which creates the physiochemical properties needed for the function of the protein. CASTp is an online tool which estimates the function of the protein by locating and measuring the pockets and voids on 3D protein structures. The new version of CASTp includes the annotated versions from PDB,Swiss-prot and SNPs. The annotated residues from these sources are mapped to surface pockets, interior voids or other regions of the PDB structures. CASTp is used to study surface features, functional regions and specific roles of key residues of proteins.[8][9] CASTp can be downloaded at http://cast.engr.uic.edu

= Computational Methods for Predicting Protein and Domain Interaction Partners =

== Methods for Predicting Protein Interaction Partners ==

=== Gene neighbor and gene cluster method ===
The hypothesis of the gene neighbor method is genes with similar functions are transcribed together as a single unit known as operon. If the genes that encode two proteins are neighbors on the chromosomes of several genomes then those proteins will have similar function. These genes which transcribe together inspite of the genetic distance between them are known as co-regulated genes. Conservation in the co-regulated genes is observed even in the distantly related organisms. The operons reveal the functional linkage between the constituent genes. Gene neighbor method is the most widely used method compared to other genomic inference methods. This method is used to detect the previously unknown interactions. Examples of gene neighbor prediction are the prediction of archeal exosome by comparing with the eukaryotic genome.   [10] [11]

=== Phylogenetic profile methods ===
Phylogenetic profile describes the presence or absence of a protein in a particular organism whose genome has been sequenced. Phylogenetic  profiles consists of n entries where n indicates the number of genomes sequenced. Presence of a homolog to a given protein in the nth genome is denoted by one and absence of the protein is denoted by zero. The proteins are then clustered based on the similarity of their phylogenetic profiles. Functionally related proteins are found in the same cluster. Both proteins and protein domains can be detected by using this method. The main drawbacks of this method are high cost, dependence on high information profiles and false detection of homology between distant organisms. The most striking example of the false correlation is to detect correlation between ubiquitous unlinked proteins present in all genomes. [12] [13] [14]

=== Rosetta Stone method ===
Rosetta Stone method is also known as gene fusion method. Some pairs of interacting proteins have homologs in other organisms which are fused to a single protein chain. This fused protein is called a Rosetta Stone protein. This method detects protein interactions in different genomes. For example Gyr A and Gyr B are separate protein subunits in Escherichia coli where as the homolog of these proteins  found in Saccharomyces  cerevisiae is a single protein Topoisomerase-2. Rosetta Stone method is used for predicting about 6.4% of all experimental interactions. In Escherichia coli this method has found about 6,809 pairs of interacting non homologous proteins. [15] [16] [17]

=== Sequence-based co-evolution methods ===
This method is based on measuring the evolutionary distance between two different organisms. Co-evolution is the similarity in phylogenic trees of non-homologous proteins. The extent of similarity is quantified by calculating the correlation coefficient between the distance matrices indicating co-evolution of two different protein families. To calculate the correlation coefficient correspondence between the two elements is required which is not always available. Hence several algorithms are developed to identify specific interaction patterns. In co-evolution the changes in one protein which leads to the loss of its function is compensated by the correlated changes in the other gene. There is always some similarity between trees of any proteins due to the speciation process. This is known as "background similarity". It is constructed from the 16S rRNA sequences and the final distance matrix is calculated by subtracting the rRNA based distances from the evolutionary distance obtained from original phylogenetic trees. Widely studied example of co-evolution studies is to predict interaction partners in DNA colicin and their immunity proteins. Co-evolution method is also used for predicting domain interactions. [18]

=== Classification methods ===
Classification methods are used for the prediction of both protein and domain interactions. These methods distinguish true interacting protein pairs from the false non-interacting protein pairs by training a classifier with various data sources. There are many classifiers used in the classification methods the most popular classifier is Random Forest Decision followed by Support Vector Machines. Most popular classification method used is Kernel method which provides vectorial representation of the data in the feature space with a set of pair wise comparisons. The feature vectors represent the particular information of a protein such as protein interactions, domain compositions of a particular protein.  [19]

== Predicting Domain Interactions from Protein Interactions ==

=== Association methods ===
Association methods are generally assigned to predict domain interactions. Association methods are used to distinguish interacting proteins from non-interacting. Different classifiers such as correlated sequence signatures are used to distinguish this. Correlated sequence signatures considers each pair of interacting domains separately, ignoring other domains present in the protein. These methods distinguish interacting from non-interacting proteins by calculating the log odds score. The log odds score (log2(Pij/PiPj)) is calculated by taking the ratio of observed frequency domains in one protein pair to the background frequencies in the data. The domain interactions are predicted based on the positive log odds score. Based on the occurrence of domains in protein interacting pairs protein prediction can also be done. [20]

=== Bayesian network models and maximum likelihood methods ===
This method is used to predict the domain interactions. This is advantageous when compared to association methods as bayesian network models considers missing and incorrect interaction data. Bayesian parameters are estimated by using Maximum Likelihood Estimation method. Maximum Likelihood Estimation method calculates experimental errors in the scoring scheme by maximizing the probability of interactions of the putative domain pairs. The likelihood function(θ) is a function of many parameters such as λij (probability that domains i and j interact), fp (false postive rate), fn (false negative rate), hence it is very difficult to maximize the likelihood function directly. Therefore Expectation Maximization algorithm is used to find the maximum likelihood estimate of unknown parameters by calculating the complete data in two iterative sets
1) observed data: consisting of protein-protein interactions
2) unobserved data: consisting of non protein-protein interactions. [21]

=== Domain pair exclusion analysis ===
This method is advanced to maximum likelihood estimation method which predicts non specific interactions. Domain pair exclusion analysis detects specific and rare interactions between the proteins which have a lower θ value. These interactions are estimated by calculating the Eij score which is the logarithmic ratio of the probability that two proteins interact given the domains i and j interact to the probability that the two proteins interact given i and j do not interact. The probability in the numerator is calculated by using Expectation Maximization procedure and the probability in the denominator is calculated by repeating the procedure where the probability of domain pair interactions is zero. Higher E-scores indicate the high possibility of domain interaction while the lower values indicate that the competing domains are responsible for interactions. Thus domain pair exclusion analysis detects the interactions with low θ and high Eij values. Though it does not detect the false positives and false negatives it detects true positives to a considerable extent. [22]

=== p-Value method ===
The null hypothesis of this method tests whether the presence of a domain in a protein pair effects the protein interactions. The statistic in p-Value method is calculated by considering two factors. They are the fraction of false positives known as experimental error and the fraction of false negatives known as incompleteness of the dataset. An ideal reference distribution for this method is obtained by shuffling the domains in proteins such that the protein interactions remain stable. The p-statistics then obtained considering the reference distribution indicate the reliability of domain interactions given that the two proteins interact. There is a inverse relationship between the value of p-statistics and the interaction of the proteins. Given that two proteins interact the domain pair with the lowest p-value is most likely to interact. The p-value method works well if there are nine or more domains in a given protein pair.  [23]

== Available Software Programs for Determining Protein-Protein Interactions ==

=== Protein Interactions Calculator ===
Protein Interactions Calculator (PMID:17584791)is a server which recognizes various kinds of interactions; such as disulphide bonds, hydrophobic interactions, ionic interactions, hydrogen bonds, aromatic- aromatic interactions, aromatic-sulphur interactions and cation - π interactions within a protein or between proteins in a complex. It also determines the accessible surface area as well as the distance of a residue from the surface of the protein. 
The input should be in the Protein data bank(.pdb) format. 
Interactions are calculated based on empirical or semi-empirical set of rules. 
All the interactions and bonds can be seen in a single site and also the identified interactions/bonds between residues can be visualized using a RasMol and Jmol interface. [24]
URL: http://crick.mbu.iisc.ernet.in/~PIC/

=== DOCK ===
DOCK software is compatible with UNIX based platforms. DOCK generates many possible orientations of a putative ligand in a user selected region of a receptor structure. These orientations are scored using several scoring schemes which are designed to measure steric and chemical complementarity of the receptor-ligand complex. DOCK has many applications such as detecting the binding orientation of protein-protein complexes,  protein-DNA complexes, evaluating likely orientations of a single ligand, ranking molecules from a database. It also searches different databases for the compounds which acts as enzyme inhibitors, DNA binding compounds, compounds binding to receptor etc., DOCK uses a geometric matching algorithm to assess rigid receptor docking by superimposing the negative image of the ligand onto the binding pocket. [25][26]
The official website of DOCK is http://dock.compbio.ucsf.edu/

=== autoDOCK ===
AUTODOCK is composed of many automatic docking tools. It predicts how substrates and drug candidates bind to a receptor or protein of known 3D structure. It has two main programs. The first program performs the docking of the ligand to a set of grids describing the target protein. Second program is the AutoGrid which precalculates these grids. AUTODOCK is also compatible with the UNIX based platform. AUTODOCK has applications in many fields such as X-ray crystallography, structure-based drug design, protein-protein docking, lead optimization, combinatorial library design etc., [27] [28] [29]

=== ICM (Internal Coordinates Mechanism) ===
ICM uses bond length, torsion angles, bond angles to determine structural predictions. It does fast and accurate docking simulations. It has a unique set of tools for accurate ligand-protein docking, peptide-protein docking, and protein-protein docking including interactive graphic tools. [30] Visit the Abagyan lab at:  http://abagyan.scripps.edu/lab/web/man/frames.htm or www.molsoft.com to learn more.

=== FleXX ===
FleXX is a high speed computer program used to determine binding mode. It detects the protein-ligand complex within seconds if the three dimensional structure of the protein is known. FLEXX  estimates  the binding affinity of protein-ligand complex by predicting the geometry of protein-ligand complex from the known 3D structure of the protein. FLEXX has 2 main applications in protein predictions. They are
1) Complex prediction: It is used when a protein and a small molecule is present without the knowledge of the structure of the protein ligand complex. It creates and ranks a series of possible protein-ligand complexes.
2) Virtual Screening: It is used for prioritizing the compounds for experimental testing from the set of compounds and a given protein.  
Placement algorithm in FLEXX is based on interactions between two molecules and Boehm function is applied for scoring. Two databases are widely used in FLEXX to predict protein-ligand interactions. They are 
1) MIMUMBA torsion angle database: It is used for the creation of the conformers
2) Interaction geometry database: It is used to describe exactly intermolecular interaction patterns. The advantages of this software are it is best enrichment tool in structure based drug design,generating poses of lead structures in a protein binding site and docking gigantic libraries by using high speed docking. [31][32]
Download the program at: http://www.biosolveit.de/download/

=== GRAMM (Global RAnge Molecular Matching) ===
GRAMM is an empirical approach for smoothing intermolecular energy function by changing the range of the atom-atom potentials. It is a free program for protein docking used for high resolution and inaccurate studies. It predicts structure of the complex based on the atomic coordinates of two molecules and by doing exhaustive 6D search through relative translations and rotations of the molecules. The complex may consist of two proteins, a protein and small molecule, two transmembrane helices etc., GRAMM is compiled on SGI series, SUN SPARC, IBM RS6000, DEC Alpha and PC. [33][34]
To view the GRAMM website:  http://reco3.musc.edu/gramm/

=== GRAMM X ===
GRAMMX is the extended version of GRAMM. It is done by updating smoothed potentials, refinement stage and knowledge based scoring. Docking problems are processed by a 320 processor Linux cluster. GRAMMX can be implemented in Python and C++.
Perform a GRAMMX simulation at:  http://vakser.bioinformatics.ku.edu/resources/gramm/grammx/

=== FTDOCK ===
FTDock is a free program which performs rigid-body docking on two biomolecules in order to predict their correct binding geometry. FTDock outputs multiple predictions that can be screened using biochemical information. This software runs on RedHat Linux and pentium based platforms. FTDock used Fourier Transforms to determine docking.[35]  
Download the program at: http://www.bmm.icnet.uk/docking/download.html

=== ZDOCK ===
It is a computational technique where the structure of a complex between two proteins is predicted based on the independently crystallized structures of the components. ZDOCK uses a fourier transform to search all possible binding modes for the proteins, evaluating based on shape complementarity, desolvation energy and electro statistics.[36]
Visit the Boston University site at :   http://zdock.bu.edu/software.php

== Further Information ==
Monte Carlo method: http://www.desy.de/~heramc/mclist.html
Accessible Surface Area:
CAPRI home page: http://capri.ebi.ac.uk/
TESS at: http://www.biochem.ucl.ac.uk/bsm/PROCAT/manual/man1.html
FADE and PADRE: the San Diego Supercomputer Center http://www.sdsc.edu/CCMS/FP/
CAST: http://sts.bioengr.uic.edu/castp/

== References (Open Access) ==
^   Wikipedia page on Protein-protein docking (Monte Carlo section): http://en.wikipedia.org/wiki/Protein-protein_docking#Monte_Carlo_methods^  NACCESS S.Hubbard and J.Thornton. 1992. http://wolf.bms.umist.ac.uk/naccess/naccess.html^  CAPRI  community wide experiment on the comparative evaluation of protein-protein docking for structure prediction http://capri.ebi.ac.uk/^  Wallace AC, Borkakoti N, Thornton JM. 1997. TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases.^  Wikipedia page on Delaunay triangulation: http://en.wikipedia.org/wiki/Delaunay_triangulation^  Mitchell, J.C., Kerr, R. and Ten Eyck, L.F. 2001. Rapid atomic density measures for molecular shape characterization. J.Mol. Graph. Model. 19(3): 324-329,2001.^  Joe Dundas, Zheng Ouyang, Jeffery Tseng, Andrew Binkowski, Yaron Turpaz, and Jie Liang. 2006. CASTp: computed atas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucl. Acids Res., 34:W116-W118.^  Dundas J, Ouyang Z, Tseng J, Binkowski A, Turpaz Y and Liang J. 2006. CASTp:computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues.^  Marcotte, E.M., Pellegrini, M., Ho-Leung, N., Rice, D.W., Yeates, T.O., Eisenberg, D.  1999.  Detecting protein function and protein-protein interactions from genome sequences. Science 285:751–753.^  Shoemaker, B.A., Panchenko, A.R. 2007 Computational Methods to Predict Protein and Domain Interaction Partners. PLoS Comput Biol 3(4): e43.^  Timothy Palzkill, Proteomics, Kluwer Academic Publishers. http://books.google.com/books?id=JoEgI1a3yrAC&pg=PA78&lpg=PA78&dq=Gene+neighbor+method+in+proteomics&source=bl&ots=k7ByV84JZf&sig=wSdaO0l91A9BX-cFt9xwjQil3o4&hl=en&ei=034BSvLoBomeM4PclOcH&sa=X&oi=book_result&ct=result&resnum=1#PPA76,M1^  K. G. Tina, R. Bhadra and N. Srinivasan. 2007. PIC: Protein Interactions Calculator, Nucleic Acids Research,  Vol. 35, Web Server issue W473–W476.^  BioSolveIT http://www.biosolveit.de/FlexX/^  University of Illinois at Urbana. http://www.ks.uiuc.edu/Development/biosoftdb/biosoft.cgi?&category=7^  Structure based drug design and molecular modelling. http://www.imb-jena.de/~rake/Bioinformatics_WEB/dd_tools.html^  Boston University Bioinformatics. http://zlab.bu.edu/zdock/^  Delaunay triangulation. http://en.wikipedia.org/wiki/Delaunay_triangulation^  The Scripps Research Institute. http://www.scripps.edu/mb/olson/doc/autodock/