Homology recognition and comparative modelling

Kenji Mizuguchi (km227@cam.ac.uk)


This is a quick introduction to some tools for recognizing distant homologues and predicting the 3D structure of the protein of interest.

Notes and optional steps are for advanced users only.

To complete all the steps below, you will need to have VITO and MODELLER installed on your local machine. (These steps are optional and alternative programs exist.) See below for more details.

Reading material

Mizuguchi, K. 2004. Fold recognition for drug discovery. Drug Discovery Today: Targets 3: 18-23.

(downloadable from http://www-cryst.bioc.cam.ac.uk/fugue/documentation.html)

Mizuguchi, K. 2005 (in press). Modelling by homology. In Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics. John Wiley & Sons, Ltd.

(preprint available from http://www-cryst.bioc.cam.ac.uk/~kenji/local/g406210_revised.pdf within the cam domain.)

Predicting the structure of AatA


Nishi, J. et al (2003). The Export of Coat Protein from Enteroaggregative Escherichia coli by a Specific ATP-binding Cassette Transporter System. J. Biol. Chem., 278: 45680-45689. (http://www.jbc.org/)

We are looking at a protein called AatA from Enteroaggregative Escherichia coli (EAEC), a pathogen associated with endemic and epidemic diarrheal illness in both developing and industrialized countries. The AatA protein has been recently identified and is likely to play a role in the pathogenesis of EAEC.

1. Go to http://www.ebi.uniprot.org/index.shtml and type 'aatA coli' in the 'Text search UniProt Knowledgebase' box at the top of the page. This will return the UniProt entry Q6V4K6. When you come back to this exercise later, you can type in this accession code to retrieve the information about this protein.

(Q) How many amino acid residues does this protein consist of?

2. Locate the box with the first line 'Basic / Extended' then 'Viewers: ...' on the right. Click 'Fasta' next to 'Viewers'. This is a convenient way of retrieving only the amino acid sequence of the protein.

3. Search homologues in the Protein Data Bank using the BLAST program.

(a) First display the amino acid sequence of AatA as above and select it using the left-mouse button. For convenience, the amino acid sequence is copied below:


(b) Go to http://www.ncbi.nih.gov/BLAST/ and click 'Protein-protein BLAST (blastp) at the top right corner.

Paste the amino acid sequence in the 'Search' box, select 'pdb' in the 'Choose database' menu and click the 'BLAST!' icon.

Click the 'Format!' icon and wait until the result will appear.

(Q) Are there any statistically significant hits? Note that an e-value is the expected number of times that a particular score is obtained purely by chance. The lower the e-value, the more significant the score is. Look for anything smaller than 0.01.

4. Search more distant homologues of known structure using the FUGUE program.

(a) Go to http://www-cryst.bioc.cam.ac.uk/fugue/ and click the link 'SEARCH STRUCTURAL DATABASE'.

(b) Type your e-mail address and 'AatA' as the name of your sequence, paste the amino acid sequence then click the 'Search!' button.

You will have to wait for a while. For convenience the result has been saved in


(Q) Are there any significant hits? What is the function of the homologous proteins identified?

5. Go back to http://www-cryst.bioc.cam.ac.uk/fugue/ and click the 'Documentation' link in the menu on the left-hand side.

First download the book chapter there to learn how to interpret the result page and more about the methodology.

Click the ‘How to interpret the output’ link.

(Q) Are there homologous sequences to the query protein?

Note 1: The FUGUE server runs PSI-BLAST to collect homologous sequences. Due to the particular version of the sequence database (the nr database from NCBI) and the parameter setting, the output page may not contain the homologous sequences that you expect. You can create your own multiple sequence alignment and submit it to FUGUE. If you have delineated proper domain boundaries and created a good-quality multiple alignment using programs like MAFFT and T-COFFEE, this may produce better homology recognition results.

Note 2: In general, it is informative to examine the sequence conservation within the family to which the query protein belongs. This can be achieved by displaying the annotated alignments of the ‘aa’ type (at the bottom of the result page).

6 (optional). If there are no significant hits or you want to confirm your observations, you can try other fold recognition servers. A good starting point is a meta server, a few of which are shown below:




Many of the individual servers are linked from these sites.

7. Examine the top two hits of the result page (http://www-cryst.bioc.cam.ac.uk/~kenji/Tutorial/10126/fugue.html). They are both an outer membrane channel in Gram-negative bacteria (known as OprM, the first hit and TolC, the second hit).

Click the name of the second hit ‘hs1ek9a’ and you will get to an entry in the HOMSTRAD database. (To learn about HOMSTRAD, start from ‘Beginner’s guide to HOMSTRAD’ at http://www-cryst.bioc.cam.ac.uk/homstrad/Doc/Info.html.)

Visualize the 3D structure of the TolC protein.

(Q) This is not a biologically relevant form? Why?

8. Scroll down to the ‘View Alignments’ section. Click the ‘PDB’ link in the column second from the far right. You can visualize a ‘rough model’ for the 3D structure of the query protein. To understand what ‘a rough model’ means, see section 3 ‘How to examine rough models’ at http://www-cryst.bioc.cam.ac.uk/fugue/help.html.

9. Build a full-atom model in the following way.

(a) Download the text alignment file of the ‘ma’ type for the top hit ‘hs1wp1a’. It should look like this:

>P1;QUERY aatA

Save it as aatA_hs1wp1a_ma.ali in your local directory.

(b) Download the atomic coordinates of the OprM protein (the HOMSTRAD entry hs1wp1a). First, click the far left link ‘hs1wp1a’ to go to the HOMSTRAD page again, click ‘1wp1’ in the ‘PDB code’ column, right-click the ‘ATM’ link then save the target as ‘hs1wp1a.atm’ in the place where you saved the alignment file.

(c) Examine the alignment. This is the most crucial step for successful comparative modelling. The alignment file you have downloaded can be directly visualized with the CLUSTALX program and once saved in a different format, it can be manipulate with alignment editors such as SeaView and Pfaat. However, a better way is to use the alignment editor VITO, which can visualize both the sequence and the structure.

The following is optional and assumes a UNIX environment.

To run VITO (and MODELLER later), we need to modify the second line of the alignment file (aatA_hs1wp1a_ma.ali) as follows:

First go back to the top page of the HOMSTRAD entry hs1wp1a and click the ‘ali’ link (second from the left). It should look like this:

C; family: crystal structure of the drug-discharge outer membrane protein, oprm
C; class: unassigned
C; domain:
C; 1wp1   a:     hs1wp1a
C; end:
C; last update 3/11/2004
structureX:1wp1:   1 :A: 456 :A:outer membrane protein oprm:Pseudomonas aeruginosa:2.56:25.5

Now, select the line starting from ‘structureX:’ and copy it.

Open the alignment file that has been saved on your local machine (aatA_hs1wp1a_ma.ali) using a text editor (e.g. emacs). Replace the second line with the one copied from the database .ali file:

-> structureX:1wp1:1 :A: 456 :A:outer membrane protein oprm:Pseudomonas aeruginosa:2.56:25.5

and further replace ‘1wp1’ with ‘hs1wp1.atm’. The final line looks like:

structureX:hs1wp1a.atm:   1 :A: 456 :A:outer membrane protein oprm:Pseudomonas aeruginosa:2.56:25.5

Type ‘vito –ali aatA_hs1wp1a_ma.ali’. You will see both the sequence alignment and the 3D structure of OprM. Left-click ‘QUEY_aatA’. (It is underlined now.) Tools -> Set colorization list. You may have to adjust the view by pressing the Ctrl key, holding it and dragging with the left-mouse button. (You can use the left-button to rotate, middle-button to translate and right-button to zoom.) See http://bioserv.cbs.cnrs.fr/VITO/DOC/vito.html for more info.

(Q) Where are the insertions and deletions? In what colours are they shown?

(d) Building atomic coordinates.

Given a sequence-structure alignment, you can use MODELLER or other programs for building atomic coordinates.

To run MODELLER, download a template input file from here. (It can be used without any modifications if you have followed the filename convention so far.) Make two more minor modifications on the aatA_hs1wp1a_ma.ali file: change ‘hs1wp1a.atm’ with ‘hs1wp1a’ and change ‘QUERY aatA’ to ‘aatA’. (You can download the finished version from here.) Then type

‘mod8v1 modeller.top’

(Change mod8v1 to an appropriate command name for the version of MODELLER installed on your system.)

Note 1: MODELLER fails if it detects different numbers of amino acid residues in the alignment (.ali) and PDB (.atm) files. This does not happen in our example but may be observed in other cases. The most common reason for this is modified amino acids. If you encounter this error,  look at the .atm file. You may find a line like this:

HETATM   86  N  MSE A  15    -35.228  28.556  12.852  1.00 12.26

This is a seleno-methionine residue artificially introduced for solving the structure using the MAD method. MODELLER appears to convert this into MET while some other programs (including FUGUE that we used above) simplely ignore all HETATM records, hence discrepancies in the sequence.

One quick solution is to delete all the HETATM lines from the PDB file. After the section 9(b) above, you can type

sed '/^HETATM/d' hs1wp1a.atm > tmp.atm


mv tmp.atm hs1wp1a.atm

You can continue with the modified .atm file.

This solution is not entirely satisfactory because your final model has many missing residues corresponding to the seleno-methionine residues in the template but it will be perfectly all right as an exercise.

Note 2: If the SEQUENCE name includes a space character (like 'QUERY aatA'), it may cause problems in creating new files on a UNIX system. You should change the 'QUERY xxx' to 'xxx', as explained above.

(Q) Visualize the model aatA.B99990001.atm.

After the program has finished, delete aatA.rsr, aatA.sch, aatA.ini, aatA.V99990001, aatA.D00000001. These files are big and unnecessary.

(e) Model evaluation

Submit the model to the Verify3D server.

(Q) Which regions of the model are reliable and which regions are of low quality and appear to have some problems? How are these regions related to the alignment?

Copyright © 2004-2005 Kenji Mizuguchi

Last modified: Mon Nov 21 16:42:27 GMT 2005