Protein Data Bank (PDB) and bioinformatics tools for protein structure analysis
Authors: Dr. Szymon Krzywda and Stanisław Wosicki, MSc
Aims
The purpose of this exercise is to:
- Give you an idea of the existing global databases storing information about protein sequences
- Give you some preliminary experience with the use of computer tools for protein sequence analysis
- Familiarize you with the basic biochemical facts about the HIV retrovirus
- Analyze the sequence of the protease from HIV-1 and other retroviruses
- Learn how to use a highly professional molecular graphics program called PyMOL
- Learn how to view and make stereographic figures
Required reading
Before you start this exercise, you should be familiar with the following articles:
- Wlodawer, A., Minor, W., Dauter, Z,. Jaskolski, M. 2008. Protein crystallography for non-crystallographers, or how to get the best (but not more) from published macromolecular structures FEBS Journal 275: 1-21. (PDF)
- Learn about retroviruses and their proteases by reading this short article HIV Protease - function and structure. It is worth, but not necessary to answer the questions posed in this article.
- Wlodawer, A., Miller, M., Jaskolski, M., Sathyanarayana, B.K., Baldwin, E., Weber, I.T., Selk, L.M., Clawson, L., Schneider, J., Kent, S.B. 1989. Conserved folding in retroviral proteases: crystal structure of a synthetic HIV-1 protease. Science 245: 616-621. (PDF)
- M. Miller, J. Schneider, B.K. Sathyanarayana, M.V. Toth, G.R. Marshall, L. Clawson, L. Selk, S.B.H. Kent, A. Wlodawer. 1989. Structure of complex of synthetic HIV-1 protease with a substrate-based inhibitor at 2.3 A resolution. Science 246: 1149-1152. (PDF)
Getting started
The website of the Center for Biocrystallographic Research (CBB) in Poznan (www.man.poznan.pl/CBB) is a convenient launchpad for a number of biocrystallography-related databases and servers. Familiarize yourself with the CBB. When was the Center created? What are the research projects pursued in the Center? In your opinion, which one is the most interesting? Why?
The global bioinformatics servers and databases used in this exercise
You will need to access via the Internet the following global bioinformatics resources:
- The Entrez server at NCBI (National Center for Biotechnology Information) of the NIH (National Institutes of Health) in the USA. It allows you to launch searches within a number of databases.
- One of them is PubMed, a database of biomedical literature citations and abstracts. You will be able to search for articles of interest using, for example, author [au] and date (year) of publication [dp] keywords
- The ExPASy (Expert Protein Analysis System) server of the Swiss Institute of Bioinformatics (SIB). From ExPASy you will be able to search
- the UniProt Knowledgebase, which contains
- the Swiss-Prot (curated) and TrEMBL (computer annotated) databases of protein sequences developed by SIB and EBI (European Bioinformatics Institute) of the EMBL (European Molecular Biology Laboratory)
- The RCSB Protein Data Bank (PDB) containing experimentally determined atomic coordinates of all publicly accessible macromolecular structures
- A server for translation of amino acid code from 3- to 1-letter 3-to-1
Stereographics
How to view stereo figures
To fully appreciate and understand complicated macromolecular structures in three dimensions, scientists often use stereoscopic figures. They are fairly easy to generate in the computer, which stores the full three-dimensional information (i.e. the atomic coordinates XYZ of each atom) about the macromolecule of interest and then displays it in projection that simulates the view of the left eye ("left panel") and of the right eye ("right panel"). The projections are generated assuming that the eyes are separated by 60 mm and are viewing the object from a distance of about 30 cm. This corresponds to normal human eyes and the distance of good vision. When two such panels are generated and printed (or displayed on a computer screen), they are then put side-by-side with a separation (of their centers) of 60 mm and viewed from 30 cm. To achieve a three-dimensional impression, you must train each of the eyes on the respective panel and force your sight to focus on infinity, i.e. to pretend that your are looking "through" the picture. The two images, originally out of focus, should then start drifting towards each other. When they "meet" in the center - you will instantaneously experience an impression of a three-dimensionl object before your eyes. It is important to have both panels equally well illuminated and concentrate on the figure without distraction. Because we are viewing the figure looking "into infinity", this method is known as "wall viewing". Some people cannot achieve this experience without long training. It may be easier for them to try "cross-eye" viewing, where the panels are swapped (i.e. left panel on the right, right panel on the left) and are viewed with crossed eyes. In this method you are trying to focus your eyes on the tip of your nose, which is close enough to crossing your eyes. When you use the cross-eye method without swapping the panels, you will have an upleasant sensation that something is wrong: things that should be in the back (e.g. drawn with a thin line and fadded) will be popping up front. If you are unable to achieve the stereo effect with naked eyes, you will be able to do so using special viewers. Stereoviewers have magnifying lenses that help you see better the usually small stereofigure, and have a wall that separates the two optical paths so that there is no interference between the panels. A pair of stereoviewers should be placed exactly between the two panels of a well and uniformly illuminated stereo figure, with its base exactly parallel to the base line of the drawing. Try to see this image in stereo. The panel separation is about 100 mm, so you will have to view it from a distance of about 50 cm.
HIV Protease
In the next part of the exercise you will be working with the amino acid sequence of a
protein from the HIV retrovirus. This protein is an enzyme called protease. Proteases are hydrolytic enzymes that digest (cleave) other proteins. The cleavage takes place at a peptide bond, which is why proteases are a group of a larger family of enzymes called peptidases. Proteases that cleave far from the terminal peptide bonds of the substrate are called endopeptidases. If they remove one amino acid from the end, they are exopeptidases. Exopeptidases are divided into aminopeptidases and carboxypeptidases. The HIV protease is an
endopeptidase. From the point of view of the catalytic mechanism, proteases are divided into aspartyl proteases (the active site contains two aspartic acid residues) also known as acid proteases because they usually work at very low pH (for instance, pepsin works best in hydrochloric acid of pH 2 in your stomach), serine proteases (the active-site nucleophile is a serine residue), cysteine proteases (the active-site nucleophile is a cysteine residue) and
metalloproteases (they have a metal-cation-assisted mechanism). The HIV protease is an unusual aspartic protease.
The exercise step-by-step
Step 1
Reference entry from PDB
You will be analyzing in this exercise the sequences (primary structure) of retroviral proteases. However, since retroviruses encrypt their proteins in large genes, encoding many proteins, it is not straightforward to find the isolated amino acid sequences corresponding to retroviral proteases in protein (or DNA) sequence databases. To circumvent this difficulty, we will consult first the PDB looking for entries reporting the structure of retroviral proteases. In those studies, the sequences of the investigated proteins have been usually precisely tailored to correspond strictly to the mature protease. Thus, although the use of full structural information is not strictly the purpose of this exercise, we will start by searching the PDB for the first determined structure of a retroviral protease.
- Using the Advanced Search of the PDB website, search for the first RSV protease structure. You may use Author=Wlodawer and Deposit Date during 1989.
- You may Sort the Results by Release Date
- Note the PDB accession code of the structure of interest
- Click on the "document" icon to get the original file with the deposit. In the end it contains a large block of data with the atomic coordinates corresponding to this structure
- However, you are interested in the records marked SEQRES (sequence)
- Copy the amino acid sequence to the right mouse button; please note that the sequence may be repeated in the SEQRES records several times (attributed to different chains identified as A, B, etc.) if there are several protein chains in the crystallographic Asymmetric Unit (ASU), for instance because the protein forms oligomers (has quaternary structure)
Step 2
Sequence editing
- In a new browser window open the 3-to-1 amino acid code translator. Paste your 3-letter coded amino acid sequence in the program's window. Additionally you may also want to paste it in your notepad.
- Edit the pasted sequence to remove any unwanted text, e.g. the "SEQRES" labels. Only the sequence and blank spaces can remain.
- Click the Three...->One... button.
- Copy the 1-letter sequence (do NOT copy any asterisks, etc.)
- This will be your reference sequence in case of any doubts in the subsequent search of the UniProt database
Step 3
Sequence database text search
- Go to ExPASy and select UniProtKB from resources.
- In the UniProtKB search window type organism:"rous sarcoma" AND gene:pol (the pol gene contains the sequence of protease).
- Click the Accession number of the first entry that appears to contain the protease sequence
- Read carefully the contents of this entry; if you find in the Sequence annotation section a subentry for a protease Chain, that's great! You've found a retroviral protease amino acid sequence in the sequence databases that contain the information about many millions of genes, and store the information about many billions of amino acids in their sequences
- The identified protease sequence should be highlighted. If the search has been really successful, this sequence should be about 100 residues long. Remember this criterion: if at any point you find a"retroviral protease" with a number of residues that is significantly different - something is amiss!
- Copy to your notepad the 1-letter code from the Sequence window, together with the important ">xxx..." line, which is a simple annotation. This style uses the so-called Fasta format. You can put any other/additional comment on the ">xxx..." line, for instance "RSV PR".
- In the Hits pull-down menu, ask for 500 Hits.
Step 4
Sequence database Blast search
Using your Sequence as a template, you will now search all the available genetic and protein sequence information, looking for similar sequences. This is a formidable task because the global sequence databases store many billions of sequence data. Yet this search can be performed very efficiently using the program Blast
- Make sure there are 500 Hits selected.
- Press Blast
- You can Customize display asking to display 100 hits on one page.
- You will now have a difficult (tedious) task of selecting retroviral protease sequences from among the hits. Many entries will repeat essentially the same sequence (for instance different HIV-1 strains isolated from different AIDS patients). You want to select only one HIV-1 hit, but one that has the complete protease sequence included. likewise with other retroviruses. The annotated sequences (gold star) are to be preferred.
- As a suggestion, look for Avian leukosis virus (ALV), Avian myeloblastosis associated virus (MAV), Lymphoproliferative disease virus (LDV), Simian retrovirus (SRV), Mason-Pfizer monkey virus (M-PMV), Feline immunodeficieny virus (FIV), Mouse mammary tumor virus (MMTV), Simian immunodefficiency virus (SIV), HIV-1, Human T-cell leukemia virus (HTLV-1). For faster searching use taxonomy filter. Some of proteases are not enough similar to show up in just 500 hits. In this case you can search them separatly (see step 3). Remember, that the pol gene contains the sequence of protease.
- Most of the hits correspond to (slightly) different genetic sequences of the HIV-1 virus. You can have an idea what a tremendous amount of sequence data is being fed to the database on an almost hourly basis!
- Some of our hits look strange (too long!), for instance from SRV or M-PMV. We will see what's going on later.
- In all cases that look suspicious (e.g. SIV, FIV) try to verify the protease sequence through reference to PDB (see Step 1).
Step 5
Sequence alignment using ClustalW
In your notepad, you have now many sequences in Fasta format. They should look something like this:
>RSV PR
LAMTMEHKDR.....
>ALV PR
LAMTMEHKDR.....
>MAV PR
LAMTMEHKDR.....
>HIV-1 PR
PQITLWQRPL.....
>M-PMCV PR
VTYNLEKRPT.....
>SRV PR
SRKSLTTPSG.....
>SIV PR
PQFHLWKRPV.....
.....
- You will now align those sequences using the program ClustalW available from the EMBL site.
- Copy the sequences and press Submit
Step 6
Analysis of results
In the output, the sequences will be aligned according to the conservation of their corresponding residues,with the most similar sequences at the top and the least similar at the bottom. It is easy to detect identical residues, the assignment of "similarity" is more problematic. Evidently, aspartic acid (D) and glutamic acid (E) are similar, also aspartic acid (D) and asparagine (N) are similar. Residues with aliphatic side chain are also similar, for instance leucine (L), isoleucine (I), valine (V), alanine (A), but of course the degree of similarity decreases from L to A as the chain gets shorter and shorter. Ananlogously, residues with aromatic side chains are certainly similar. There is now a lot of knowledge and experience in this field, and this knowledge is usually expressed in the form of a matrix, where the pairwise similarities can be described numerically. Below the aligned sequences, you will find characters like * (identity) and : or . highlighting the degree of similarity of all the compared sequences at each position. If there is no mark under a given column, no reliable homology could be detected at this position; even though the residues are typed in one column in such a place, this may be simply a consequence of a better agreement down- and up-stream in the sequence. Dashes within a sequence indicate a gap. Sequence alignment programs are not very happy about gaps and usually count a huge penalty for the necessity of such a break. This is because gaps must be connected with insertions or deletions of long chuncks of sequence, which are much less likely than a simple point mutation changing the identity of a given residue. (Sometimes such point mutations in the genomic sequence do not even show up in the protein sequence because of the degeneracy of the genetic code.) You can easily explore the Amino acid properties in the ClustalW results.
- Looking at the alignment, decide which sequence is the least homologous. Note also the number of identities (*) and similarities (* plus : plus .).
- Now, eliminate this black sheep from the lot and run ClustalW again.
- Analyze the results as before. Perhaps you would like to do more prunning? But remember, you want to leave at least two sequences for alignment ;-).
- In a satisfactory final alignment note: (1) the most similar region(s), (2) the least similar regions. What is your interpretation of these regions?
- Identify the active site of the enzymes. What can you say about its conservation?
- Now select only two sequences for comparison, for instance RSV/HIV or HIV/HTLV.
- Run a pairwise sequence alignment, which is conceptually somewhat different from the multiple sequence alignment above.
- In the pairwise alignment output, count the number o identities nI (*) and the number of similarities nS (* plus : plus .). If you express these numbers as fractions relative to the number of residues in the shorter sequence (n1), you will have the percentage indicators of identity (100*nI/n1 %) and similarity (100*nS/n1 %).
- In your particular example, do you think the compared sequences are very similar, not particularly similar, not similar at all? Explain!
Summary
In this part of exercise you have learned:
- About retroviruses, such as HIV-1, HIV-2, SIV, M-PMV, RSV (ASV), FIV, EIAV, HTLV
- About aspartic proteases, and about retroviral proteases in particular
- About global bioinformatics-related databases and servers, such as ExPASy, UniProtKB, Swiss-Prot, PDB, Entrez, PubMed
- About some important bioinformatics-related instutions, such as NIH, NCBI, SIB, EBI, EMBL, RCSB
- About some programs for protein sequence analysis, such as Blast, Fasta, and ClustalW
Can you spell-out and explain the abbreviations in the above list?
PyMOL
Get to know PyMOL
- PyMOL is a public domain, highly advanced molecular graphics program developed by DeLano Scientific. PyMOL has its own Wiki resource, and you should first visit this site at www.pymolwiki.org.
- PyMOL manuals are available on-line at pymol.org.
- An excellent site showing advanced uses of PyMOL is maintained by Daan Van Aalten at his website.
- When you launch PyMOL, you get an External GUI (Graphical User Interface) (gray window) and a graphics viewer with an Internal GUI on the right. You can control PyMOL by selecting commands from the External GUI Menu Bar, by typing commands in the command line, or by executing (Run) a script with PyMOL commands (for an example click here). The latter mode is very useful; it allows you to know exactly what has been done to get a given effect. Alternatively, a lot of options are available via the Internal GUI, which can be used to get your picture quickly. In the upper part of the Internal GUI, you always see different "selections" defined during your session. On those selections, you can perform Actions (A), Show (S) them in different style, Hide (H) those displayed styles, work with Labels (L), or with Colors (C).To work with a given structure, you must always Open it first, e.g. via the File menu.
Step 1
Have a lookat at the PDB
- In a browser window, open the PDB site.
- Search for the first "correct" structure of HIV-1 protease 3HVP.
- Note how many molecules are in the assymetric unit and what is the biological assembly of the protein, the resolution, the space group, the size of the protein, who first discovered the structure, and when.
- Launch PyMOL and type in the External GUI:
PyMOL>fetch 3hvp
- If you are going to copy/paste the command, check if the spaces (especially after commas) are preserved. Personally, I advice you to type them manually.
Step 2
Let's have a look at the structure
- Note, there are no water molecules in the structure. Why is that?
- The displayed line representation of the structure tells little about the protein architecture. Let's show the structure as cartoon by typing in the PyMOL GUI:
PyMOL>show cartoon, (chain A)
PyMOL>hide lines
While some graphical programmes use built-in secondary structure libraries, PyMOL derives the information from the PDB file, if available. The well-ordered beta strands from both the amino and carboxyl terminus are not defined in the 3HVP.PDB file. We have to tell the programm to draw them by typing:
PyMOL>alter A/1:4/,ss='S'
PyMOL>alter A/96:99/,ss='S'
After clicking Show/cartoon you should see the two beta strands. You can highlight the active side chain of Asp25 from the catalitic triad Asp25-Thr26-Gly27 by typing:
PyMOL>show sticks, resi 25 and not (name c,n)
You have probably noticed that there is only a half of the active protease displayed. Now we have to create a dimer, biologically functional form of the protein. To do this type in the command line:
PyMOL>symexp sym,3HVP,(3HVP),2.5
Now you can center the picture by clicking the right mouse button on the atom close to the centre of the homodimer and go to atom/center. You can also change the name of the second molecule by clicking Action/rename object (sym for example).
Play with the excellent select/display options of PyMOL. For instance, using the right mouse button, you can select a chain to be shown in a preset stick representation, by just one click!
- Color the chain differently. For example, make one monomer blue and the other orange
- Now, display the homodimer as cartoon, Hide/everything and Show/cartoon and color it yellow.
For the next part of the exercise we need to create an object from two monomers eg. 3HVP and its symmetry related copy. Doing so it will be much easier to manipulate the homodimer. Just type:
PyMOL>select dimer, (3HVP,sym)
All atoms of the new object are now selected. You can click using left mouse button somwhere close to the structure and these selections will dissapear.
- Can you identify the symmetry of the homodimer?
- What secondary structure element is the catalitic triad located at?
- Which residues make flaps?
- What is the main secondary structure element gluing together the monomers?
- Find a nice view. You can write out the orientation matrix by the following commands:
log_open logfile.log
(opens a logfile)
get_view
(writes out current orientation matrix)
log_close
(closes the logfile)
- You can give the selected residues (Asp25) a nice ball-and-stick representation, for instance, using the Wizard/Appearance, which will turn atoms into spheres by simple clicks. Activating Appearance open a new submenu window in the lower right-hand corner of the graphics window. You can control the spheres (like anything else in PyMOL) by the Setting menu Edit All... option. Editing a given parameter, e.g. sphere_scale or sphere_transparency, will take immediate effect. Try it!
- When you are done with the Appearance option, remember to click Done to return to the normal mode of operation.
- In the Mouse menu pick Selection mode as Atoms; this way you will be selecting only those atoms, on which you click.
- To get a beautiful appearance of the drawing, run the ray tracing procedure by clicking Ray in the External GUI. (Ray tracing is a special artistic procedure, which handles a three-dimensional object in the computer by shining light on it and tracking, or tracing, the reflections of all possible rays from all possible points of the object; in this way one gets shiny spots, as well as shadows, much as with natural illumination).
- If you want to print or paste your figure (Yyyyyyyeeeesssss!!!), it will look better on white background, which you can select from the Display menu.
- The figure can also be printed in stereo. For this purpose, you have to prepare two drawings (*.png files), one for the left eye (-left) and one for the right eye (-right). (This is for "wall" or parallel viewing; if you prefer to cross your eyes, as I do, simply swap the left/right drawings.)
ray angle=+3
(renders the scene from the left eye's point of view)
png left.png
(writes a file (left.png) with the current scene)
ray angle=-3
(renders the scene from the right eye's point of view)
png right.png
(writes a file (right.png) with the current scene)
- You can now paste the left.png and right.png files into your word or powerpoint editor, or preferably into a paint shop program such as corel, for printing. For proper viewing (from a distance of 30-40 cm with normal pupil separation), the components of the stereo pair should be 60 mm apart (distance between equivalent points).
- At the end of this part color the homodimer in blue and change the background to black.
Step 3
- In the PDB site search for the 4HVP HIV-1 protease.
- Note how many molecules are in the assymetric unit, the resolution, the space group, and when the structure was discovered.
- Now, in the PyMOL:
PyMOL>fetch 4HVP
- You will immediately notice water molecules shown as small red croses. Why you think they are present in this structure?
- Create a cartoon represenrantion of the homodimer:
PyMOL>show cartoon, (chain A)
PyMOL>show cartoon, (chain B)
a sticks representation of an inhibitor and active Asp25:
PyMOL>select ligand,resn 2NC
PyMOL>show sticks, (ligand)
PyMOL>show sticks, resi 25 and not (name c,n)
and finally hiding all lines and waters:
PyMOL>hide lines
PyMOL>hide nonbonded
- Color the molecule blue.
- Is the displayed inhibitor symmetric? What side chains can you identify?Are there any noncanonical side chains present? Why the inhibitor is not cleaved by the enzyme?
Step 4
Superimposing two structures
- In order to be able to see regions in which two related structures agree or disagree, it is convenient to superimpose the structures. The simplest way to do this is with the align command. The align command works fine if the two structures have similar sequences. Luckily both structures have identical sequence.
- Superimpose the 4HVP onto the 3HVP structure by typing:
PyMOL>align 4HVP,(dimer)
- Zoom the picture out to find the superimposed structures and center the picture.
- What are the main differences in the allign structures? Are the tips of the flaps in the same arrangement?
Fun
If you have finished your assignments quickly, take a few moments and test your knowledge of molecular biology by taking Swiss-Quiz from the ExPASy website. Alternatively, or in
addition, you may try my PX (Protein Crystallography) Quiz. Unfortunately, it is a mixture of English and Polish questions.
back
Last modification: Mar 21, 2014 (SW)