Protein Data Bank (PDB) and bioinformatics tools for protein structure analysis

Authors: Dr. Szymon Krzywda and Stanisław Wosicki, MSc

Aims

The purpose of this exercise is to:

Required reading

Before you start this exercise, you should be familiar with the following articles:

Getting started

The website of the Center for Biocrystallographic Research (CBB) in Poznan (www.man.poznan.pl/CBB) is a convenient launchpad for a number of biocrystallography-related databases and servers. Familiarize yourself with the CBB. When was the Center created? What are the research projects pursued in the Center? In your opinion, which one is the most interesting? Why?

The global bioinformatics servers and databases used in this exercise

You will need to access via the Internet the following global bioinformatics resources:

Stereographics

How to view stereo figures
To fully appreciate and understand complicated macromolecular structures in three dimensions, scientists often use stereoscopic figures. They are fairly easy to generate in the computer, which stores the full three-dimensional information (i.e. the atomic coordinates XYZ of each atom) about the macromolecule of interest and then displays it in projection that simulates the view of the left eye ("left panel") and of the right eye ("right panel"). The projections are generated assuming that the eyes are separated by 60 mm and are viewing the object from a distance of about 30 cm. This corresponds to normal human eyes and the distance of good vision. When two such panels are generated and printed (or displayed on a computer screen), they are then put side-by-side with a separation (of their centers) of 60 mm and viewed from 30 cm. To achieve a three-dimensional impression, you must train each of the eyes on the respective panel and force your sight to focus on infinity, i.e. to pretend that your are looking "through" the picture. The two images, originally out of focus, should then start drifting towards each other. When they "meet" in the center - you will instantaneously experience an impression of a three-dimensionl object before your eyes. It is important to have both panels equally well illuminated and concentrate on the figure without distraction. Because we are viewing the figure looking "into infinity", this method is known as "wall viewing". Some people cannot achieve this experience without long training. It may be easier for them to try "cross-eye" viewing, where the panels are swapped (i.e. left panel on the right, right panel on the left) and are viewed with crossed eyes. In this method you are trying to focus your eyes on the tip of your nose, which is close enough to crossing your eyes. When you use the cross-eye method without swapping the panels, you will have an upleasant sensation that something is wrong: things that should be in the back (e.g. drawn with a thin line and fadded) will be popping up front. If you are unable to achieve the stereo effect with naked eyes, you will be able to do so using special viewers. Stereoviewers have magnifying lenses that help you see better the usually small stereofigure, and have a wall that separates the two optical paths so that there is no interference between the panels. A pair of stereoviewers should be placed exactly between the two panels of a well and uniformly illuminated stereo figure, with its base exactly parallel to the base line of the drawing. Try to see this image in stereo. The panel separation is about 100 mm, so you will have to view it from a distance of about 50 cm.

HIV Protease

In the next part of the exercise you will be working with the amino acid sequence of a protein from the HIV retrovirus. This protein is an enzyme called protease. Proteases are hydrolytic enzymes that digest (cleave) other proteins. The cleavage takes place at a peptide bond, which is why proteases are a group of a larger family of enzymes called peptidases. Proteases that cleave far from the terminal peptide bonds of the substrate are called endopeptidases. If they remove one amino acid from the end, they are exopeptidases. Exopeptidases are divided into aminopeptidases and carboxypeptidases. The HIV protease is an endopeptidase. From the point of view of the catalytic mechanism, proteases are divided into aspartyl proteases (the active site contains two aspartic acid residues) also known as acid proteases because they usually work at very low pH (for instance, pepsin works best in hydrochloric acid of pH 2 in your stomach), serine proteases (the active-site nucleophile is a serine residue), cysteine proteases (the active-site nucleophile is a cysteine residue) and metalloproteases (they have a metal-cation-assisted mechanism). The HIV protease is an unusual aspartic protease.

The exercise step-by-step

Step 1

Reference entry from PDB

You will be analyzing in this exercise the sequences (primary structure) of retroviral proteases. However, since retroviruses encrypt their proteins in large genes, encoding many proteins, it is not straightforward to find the isolated amino acid sequences corresponding to retroviral proteases in protein (or DNA) sequence databases. To circumvent this difficulty, we will consult first the PDB looking for entries reporting the structure of retroviral proteases. In those studies, the sequences of the investigated proteins have been usually precisely tailored to correspond strictly to the mature protease. Thus, although the use of full structural information is not strictly the purpose of this exercise, we will start by searching the PDB for the first determined structure of a retroviral protease.

Step 2

Sequence editing

Step 3

Sequence database text search

Step 4

Sequence database Blast search

Using your Sequence as a template, you will now search all the available genetic and protein sequence information, looking for similar sequences. This is a formidable task because the global sequence databases store many billions of sequence data. Yet this search can be performed very efficiently using the program Blast

Step 5

Sequence alignment using ClustalW

In your notepad, you have now many sequences in Fasta format. They should look something like this:

>RSV PR
LAMTMEHKDR.....
>ALV PR
LAMTMEHKDR.....
>MAV PR
LAMTMEHKDR.....
>HIV-1 PR
PQITLWQRPL.....
>M-PMCV PR
VTYNLEKRPT.....
>SRV PR
SRKSLTTPSG.....
>SIV PR
PQFHLWKRPV.....
.....

Step 6

Analysis of results

In the output, the sequences will be aligned according to the conservation of their corresponding residues,with the most similar sequences at the top and the least similar at the bottom. It is easy to detect identical residues, the assignment of "similarity" is more problematic. Evidently, aspartic acid (D) and glutamic acid (E) are similar, also aspartic acid (D) and asparagine (N) are similar. Residues with aliphatic side chain are also similar, for instance leucine (L), isoleucine (I), valine (V), alanine (A), but of course the degree of similarity decreases from L to A as the chain gets shorter and shorter. Ananlogously, residues with aromatic side chains are certainly similar. There is now a lot of knowledge and experience in this field, and this knowledge is usually expressed in the form of a matrix, where the pairwise similarities can be described numerically. Below the aligned sequences, you will find characters like * (identity) and : or . highlighting the degree of similarity of all the compared sequences at each position. If there is no mark under a given column, no reliable homology could be detected at this position; even though the residues are typed in one column in such a place, this may be simply a consequence of a better agreement down- and up-stream in the sequence. Dashes within a sequence indicate a gap. Sequence alignment programs are not very happy about gaps and usually count a huge penalty for the necessity of such a break. This is because gaps must be connected with insertions or deletions of long chuncks of sequence, which are much less likely than a simple point mutation changing the identity of a given residue. (Sometimes such point mutations in the genomic sequence do not even show up in the protein sequence because of the degeneracy of the genetic code.) You can easily explore the Amino acid properties in the ClustalW results.

Summary

In this part of exercise you have learned:

Can you spell-out and explain the abbreviations in the above list?

PyMOL

Get to know PyMOL

Step 1

Have a lookat at the PDB

Step 2

Let's have a look at the structure

Step 3

Step 4

Superimposing two structures

Fun

If you have finished your assignments quickly, take a few moments and test your knowledge of molecular biology by taking Swiss-Quiz from the ExPASy website. Alternatively, or in addition, you may try my PX (Protein Crystallography) Quiz. Unfortunately, it is a mixture of English and Polish questions.

back


Last modification: Mar 21, 2014 (SW)