Secondary Databases in Bioinformatics

Secondary databases are called so because they contain the analysis results of the sequences in the primary sources. SWISS-PROT has emerged as the most popular primary source and many secondary databases are based on SWISS-PROT due to its versatility.

Need for Secondary database

Simply it is a database that contains information derived from primary sequence data It will be in the form of regular expressions (patterns), Fingerprints, profiles blocks or Hidden Markov Models. The type of information stored in each of the secondary databases is different. But in secondary databases homologous sequences may be gathered together in multiple alignments. In multiple alignments there are conserved regions that show little or no variation between the constituent sequences. These conserved regions are called motifs. Motifs reflect some vital biological role and are crucial to the structure of function of protein. This is the importance of secondary database. So by concentrating on motifs, we can find out the common conserved regions in the sequences and study the functional and evolutionary details or organisms. Some of the common secondary databases are discussed below.

a) Prosite

It was the first secondary database developed. Protein families usually contain some most conserved motifs which can be encoded to find out various biological functions. So by using such a database tool we can easily find out the family of proteins when a new sequence is searched. This is the importance of PROSITE. Within PROSITE motifs are encoded as regular expression (called patterns). Entries are deposited in PROSITE in two distant files. The first file give the pattern and lists all matches of pattern, where as the second one gives the details of family, description of biological role etc. The process used to derive patterns involves the construction of a multiple alignment and manual inspection. So PROSITE contains documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.

b) Prints- fingerprint database

PRINTS is another secondary database. Most protein families are characterized by several conserved motifs. All of these motifs can be aid in constructing the `signatures' of different families. This principle is highlighted in constructing PRINT database. Within PRINTS motifs are encoded as un weighted local alignments. So a small initial multiple alignments are taken to identify conserved motifs. Then these regions are searched in the database to find out similarities. Results are analyzed to find out the sequences which matched all the motifs within the finger print. PROSITE and PRINTS are the only manually annotated secondary databases. Print is a diagnostic collection of protein fingerprints.

c) Blocks

The limitations of above two databases led to the formation of Block database. In this database the motifs (here called Blocks) ate created automatically by highlighting the and detecting the most conserved regions of each family of proteins. Block databases a fully automated one. Keyword and sequence searching are the two important features of this type of database. Blocks are ungapped Multiple Sequence Alignment representing conserved protein regions.

d) Profiles

Profile database is used to find out the most conserved regions in the sequence alignment. Profile is weighted to indicate modifications (in bioinformatics wording-INDELS) are allowed in the sequence. Indels may be the insertion of a new sequence or deletion from the sequence. Profiles are also known as 'weight matrices' to provide a means of detecting distant sequence relationships.

No comments:

Post a Comment