UniGene

NCBI Home Page

UniGene
	Home Page
	Frequently Asked Questions
	Query Tips
	Library Differential Display
	Download UniGene
UniGene Organisms
	Homo sapiens
	Mus musculus
	Rattus norvegicus
	Danio rerio
	Bos taurus
Related Resources
	Human Genome Guide
	LocusLink
	HomoloGene
	dbEST-Database of Expressed Sequence Tags
	Cancer Genome Anatomy Project
	I.M.A.G.E. Quality Control

UniGene Build Procedure

Clustering is the process of finding subsets of sequences which belong together within a larger set. This is done by converting discrete similarity scores to boolean links between sequences. That is, two sequences are considered linked if their similarity exceeds a threshold. UniGene clustering proceeds in several stages, with each stage adding less reliable data to the results of the preceding stage. This staged clustering affords greater control than a more egalitarian treatment of all links between sequences. The stages are:

The Stages

Screening for contaminants, repeats, and low-complexity sequence is performed. Low-complexity screening is performed using NCBI's Dust. Mitochondrial and ribosomal sequences are screened for, as are vector contaminants and repetitive elements. After screening, a sequence must contain at least 100 informative bp to be a candidate for entry into UniGene.

Gene links are found. The set of gene sequences [mRNA or genomic sequences, many of which are complete CDSs] is compared with itself. Sequence pairs which are sufficiently similar are linked together to form initial clusters.

EST to gene links and EST to EST links are added to these clusters. The set of ESTs is compared with the set of genes using megablast, and sufficiently similar sequence pairs are added to the clusters. Any links which would join two distinct clusters from the preceding stage (that is, join two sets of genes not linked to form one cluster without the addition of ESTs) are discarded. Any resulting cluster which does not contain a sequence with a polyadenylation signal or two labelled 3' ESTs is discarded. Clusters which meet these criteria are called anchored clusters, since their 3' end is presumed to be known.

Clone-based edges are added; these ensure that nonoverlapping 5' and 3' ESTs belong to the same cluster. First, clone based edges which link at least two 5' ends to a single cluster which contains at least two 3' ends from the same clones are found. These clone ID based edges which are duplicated within a cluster are retained even if they cause clusters from the preceding stage to be merged. Due to imperfect clone labelling, a single clone-ID based edge is insufficient to merge two clusters.

ESTs which do not belong to an anchored cluster are rechecked at a lower level of stringency than in the preceding passes. An EST which passes this less stringent test is then added to the cluster which contains the sequence which is the best match to the EST; it is a guest member.

Clusters of size 1 (that is, clusters which seem to identify infrequently expressed genes) are compared against the rest of the sequences in UniGene at a lower level of stringency, and merged with the cluster containing the most similar sequence.

The resulting clusters are compared with the preceding week's build and renumbered in an attempt to maintain continuity. Since the sequences which make up a cluster may change from week to week, and since the cluster identifier may disappear (typically when two clusters merge) using the cluster identifier as a reference is ill-advised. Using the GB accession numbers of the sequences which comprise the cluster is a safe alternative.

Questions or Comments?
Write to the NCBI Service Desk