Authors: Jaina Mistry (corresponding author) [1]; Alex Bateman [1]; Robert D Finn [1]
Background
Enzymes play a considerable role in controlling the flow of metabolites within a cell; they catalyze virtually all of the reactions that make and modify the molecules required in biological pathways. Only a small number of residues within an enzyme are directly involved in catalysis and the structure and chemical properties of these residues (termed the active site) determine the chemistry of the enzyme. For this reason active site residues are highly conserved.
Pfam [1] is a database of 8296 protein families (as of Pfam release 20.0). Only ~0.4% of the sequences contained within the enzymatic Pfam families (i.e. those families that contain at least one characterized catalytic site) have the active site residues experimentally determined. There are families within Pfam which we know are catalytic, yet the residues that perform catalysis have not been characterized for any of the sequences within them, for example family YgbB (PF02542). Even where a structure is known, there are cases where the catalytic residues have not been identified (e.g. Swiss-Prot:P30085). Although the proportion of characterised catalytic residues known is low, many enzymatic sequences within a Pfam alignment are homologous to a protein whose catalytic residues have been characterised.
The fraction of characterized sequences continues to diminish as high throughput genome sequencing projects generate more and more data. To overcome the lack of experimental data we can use computational methods to predict functional residues on new protein sequences.
A range of approaches has been applied to the task of predicting active sites in protein sequences computationally. These can be split into two broad categories: those that transfer experimentally characterized active site data by similarity and those that predict active site residues
The
Similarity transfer based methods use tools such as BLAST searches, hidden Markov models (HMMs), pattern matching and structural templates to first identify sequences homologous to those with known active site residues, and then transfer active site residues from the characterized sequences to the uncharacterized sequences. The Catalytic Site Atlas (CSA) [22] is a database that collates active site residues from the literature for proteins with a known structure. It also provides active site residue predictions for proteins with a known structure which it infers on the basis of PSI-BLAST hits, and it is one of the largest resources for catalytic sites. Another database containing literature collated active site residues and predicted active site residues is UniProtKB [23], the central repository for protein sequences. UniProtKB is composed of two sections, the hand annotated 'UniProtKB/Swiss-Prot' section and the automatically generated 'UniProtKB/TrEMBL' section. UniProtKB however, currently only predicts active site residues by similarity for sequences in UniProtKB/Swiss-Prot, and not for the sequences in the automatically generated UniProtKB/TrEMBL entries which form ~94% of this database. Additionally, it can sometimes be difficult to trace the evidence for a particular active site prediction in UniProtKB. PROSITE [24] is a database that contains a collection of regular expressions (patterns) against which sequences can be searched. Each regular expression represents a conserved motif such as an active site region. Each PROSITE pattern is searched against UniProtKB/Swiss-Prot and the resulting matches are manually annotated by curators as true positives (TP), false positives (FP), false negatives (FN) or potential (P). PROSITE matches to UniProtKB/TrEMBL sequences are available via InterPro [25]. These matches are verified using a set of secondary patterns derived from the PROSITE pattern which are computed with the eMotif algorithm [26]. A stringent threshold of E = 10
Protein domain databases such as SMART [27] and
Pfam contains a large collection of protein alignments and is one of the leading protein domain databases in terms of sequence coverage; 74% of the sequences in UniProtKB have at least one match to a Pfam domain (statistics taken from Pfam 20.0). Pfam contains the experimental active site annotations present in UniProtKB. To enrich the sequence annotations in Pfam, we have taken known active site residues defined by UniProtKB that occur within a Pfam alignment and used them to predict active site residues on other sequences within the same alignment. Using this methodology we have created one of the largest databases of active site predictions. Here we outline our methodology for active site residue transfer and compare our prediction data to four other databases. We also estimate the specificity and sensitivity of our methodology.
Construction and content
The manually curated thresholds for each Pfam family are chosen such that the family contains no known FPs, therefore all sequences within a family can be considered homologous [1]. The active site Pfam families can contain both active and inactive homologues. This gives us an initial starting point of an alignment of sequences that share a particular domain.
The Pfam flatfiles originally contained the active site residue annotations present in UniProtKB/Swiss-Prot. As authors of the Pfam database we noticed that within the catalytic Pfam families, very few sequences had active site residue annotations and within the large alignments, the known active …

No comments:
Post a Comment