Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database hosted by the Wellcome Trust Sanger Institute in collaboration with Janelia Farm. Rfam is designed to be similar to the Pfam database for annotating protein families.
Unlike proteins, ncRNAs often have similar secondary structure without sharing much similarity in the primary sequence. Rfam divides ncRNAs into families based on evolution from a common ancestor. Producing multiple sequence alignments (MSA) of these families can provide insight into their structure and function, similar to the case of protein families. These MSAs become more useful with the addition of secondary structure information. Rfam researchers also contribute to Wikipedia's . The INFERNAL package can also be used with Rfam to annotate sequences (including complete genomes) for homologues to known ncRNAs.
Methods
In the database, the information of the secondary structure and the primary sequence, represented by the MSA, is combined in statistical models called profile stochastic context-free grammars (SCFGs), also known as covariance models. These are analogous to hidden Markov models used for protein family annotation in the Pfam database.[ Each family in the database is represented by two multiple sequence alignments in Stockholm format and a SCFG.
]
The first MSA is the "seed" alignment. It is a hand-curated alignment that contains representative members of the ncRNA family and is annotated with structural information. This seed alignment is used to create the SCFG, which is used with the Rfam software INFERNAL to identify additional family members and add them to the alignment. A family-specific threshold value is chosen to avoid false positives.
Performing Rfam searches using profile SCFG is very computationally expensive, and even for a small ncRNA family takes an unreasonable amount of time for a computer search. To reduce the search time, an initial BLAST search is used to reduce the search space to a manageable size.[
]
The second MSA is the “full” alignment, and is created as a result of a search using the covariance model against the sequence database. All detected homologs are aligned to the model, giving the automatically produced full alignment.
History
Version 1.0 of Rfam was launched in 2003 and contained 25 ncRNA families and annotated about 50 000 ncRNA genes. In 2005, version 6.1 was released and contained 379 families annotating over 280 000 genes. As of January 2010, the current version 10.0 contains 1446 RNA families annotating over 3,192,596 genes.
Problems
#Use of a BLAST search to reduce the ncRNA search space to a computationally manageable size causes reduced sensitivity in finding true homologs of the ncRNA family.
[
#The genomes of higher eukaryotes contain many ncRNA-derived pseudogenes and repeats. Distinguishing these non-functional copies from functional ncRNA is a formidable challenge.][
#Introns are not modeled by covariance models.
]
References
External links
Rfam Web site at the Sanger Institute
INFERNAL software package
miRBase
Category:Molecular biology
Category:RNA
Category:Biological databases
Category:Wellcome Trust