Monday, March 5, 2012

Predicting active site residue annotations in the Pfam database.(Database)(Report)

Authors: Jaina Mistry (corresponding author) [1]; Alex Bateman [1]; Robert D Finn [1]

Background

Enzymes play a considerable role in controlling the flow of metabolites within a cell; they catalyze virtually all of the reactions that make and modify the molecules required in biological pathways. Only a small number of residues within an enzyme are directly involved in catalysis and the structure and chemical properties of these residues (termed the active site) determine the chemistry of the enzyme. For this reason active site residues are highly conserved.

Pfam [1] is a database of 8296 protein families (as of Pfam release 20.0). Only ~0.4% of the sequences contained within the enzymatic Pfam families (i.e. those families that contain at least one characterized catalytic site) have the active site residues experimentally determined. There are families within Pfam which we know are catalytic, yet the residues that perform catalysis have not been characterized for any of the sequences within them, for example family YgbB (PF02542). Even where a structure is known, there are cases where the catalytic residues have not been identified (e.g. Swiss-Prot:P30085). Although the proportion of characterised catalytic residues known is low, many enzymatic sequences within a Pfam alignment are homologous to a protein whose catalytic residues have been characterised.

The fraction of characterized sequences continues to diminish as high throughput genome sequencing projects generate more and more data. To overcome the lack of experimental data we can use computational methods to predict functional residues on new protein sequences.

A range of approaches has been applied to the task of predicting active sites in protein sequences computationally. These can be split into two broad categories: those that transfer experimentally characterized active site data by similarity and those that predict active site residues

ab initio .

The

ab initio methods for catalytic site prediction exploit some of the known properties of active sites: active sites are usually found buried within a cleft of a protein, mutations in them can often increase the stability of an enzyme and they are highly conserved. This has led to the use of geometry data [2, 3, 4, 5], stability profiles [4, 6, 7] and sequence conservation [8, 9, 10, 11] in active site prediction. In addition, the different approaches can be used in combination. Evolutionary trace (ET) is one such method which first identifies the most highly conserved residues in related sequences, maps them onto the structure of the protein and then examines the structure for clusters of residues which could correspond to active sites or other functional sites [12]. ET has been applied in automated approaches that have been reported to predict active sites successfully for structures in 60-80% of test cases [13, 14, 15]. There has been some work on developing motif based methods to predict functional sites, however these have generally shown a high rate of false positives (FPs) [16, 17, 18]. Neural networks [19] and support vector machines [20, 21] are other types of computational approaches which use structure and sequence information to predict active site residues. The different methods are hard to compare to each other in terms of accuracy since a range of tests have been used and in each case the tests are performed on a relatively small set of different enzymes (<200 structures in the case of the structural methods). However, it is clear that they all have a relatively high rate of FPs.

Similarity transfer based methods use tools such as BLAST searches, hidden Markov models (HMMs), pattern matching and structural templates to first identify sequences homologous to those with known active site residues, and then transfer active site residues from the characterized sequences to the uncharacterized sequences. The Catalytic Site Atlas (CSA) [22] is a database that collates active site residues from the literature for proteins with a known structure. It also provides active site residue predictions for proteins with a known structure which it infers on the basis of PSI-BLAST hits, and it is one of the largest resources for catalytic sites. Another database containing literature collated active site residues and predicted active site residues is UniProtKB [23], the central repository for protein sequences. UniProtKB is composed of two sections, the hand annotated 'UniProtKB/Swiss-Prot' section and the automatically generated 'UniProtKB/TrEMBL' section. UniProtKB however, currently only predicts active site residues by similarity for sequences in UniProtKB/Swiss-Prot, and not for the sequences in the automatically generated UniProtKB/TrEMBL entries which form ~94% of this database. Additionally, it can sometimes be difficult to trace the evidence for a particular active site prediction in UniProtKB. PROSITE [24] is a database that contains a collection of regular expressions (patterns) against which sequences can be searched. Each regular expression represents a conserved motif such as an active site region. Each PROSITE pattern is searched against UniProtKB/Swiss-Prot and the resulting matches are manually annotated by curators as true positives (TP), false positives (FP), false negatives (FN) or potential (P). PROSITE matches to UniProtKB/TrEMBL sequences are available via InterPro [25]. These matches are verified using a set of secondary patterns derived from the PROSITE pattern which are computed with the eMotif algorithm [26]. A stringent threshold of E = 10

-9 is used so that each eMotif pattern is expected to produce a random false positive hit in 1 in 109 matches. Based on the results of eMotif, UniProtKB/TrEMBL matches are annotated as 'true' or 'unknown'. Although not specifically designed for active site predictions, large scale PROSITE matches are available for UniProtKB sequences making them a useful resource for comparing our predicted data with.

Protein domain databases such as SMART [27] and

MEROPS [28] also collate active site data from the literature and use sequence similarity based transfer to annotate active site residues onto the sequences in their protein families.

Pfam contains a large collection of protein alignments and is one of the leading protein domain databases in terms of sequence coverage; 74% of the sequences in UniProtKB have at least one match to a Pfam domain (statistics taken from Pfam 20.0). Pfam contains the experimental active site annotations present in UniProtKB. To enrich the sequence annotations in Pfam, we have taken known active site residues defined by UniProtKB that occur within a Pfam alignment and used them to predict active site residues on other sequences within the same alignment. Using this methodology we have created one of the largest databases of active site predictions. Here we outline our methodology for active site residue transfer and compare our prediction data to four other databases. We also estimate the specificity and sensitivity of our methodology.

Construction and content

The manually curated thresholds for each Pfam family are chosen such that the family contains no known FPs, therefore all sequences within a family can be considered homologous [1]. The active site Pfam families can contain both active and inactive homologues. This gives us an initial starting point of an alignment of sequences that share a particular domain.

The Pfam flatfiles originally contained the active site residue annotations present in UniProtKB/Swiss-Prot. As authors of the Pfam database we noticed that within the catalytic Pfam families, very few sequences had active site residue annotations and within the large alignments, the known active …

No comments:

Post a Comment