First BLAST Search

Bioinformatics Tutorial

First BLAST Search

Preview

In this section, you will use a FASTA sequence as an input (query) to BLAST, a program that searches a genomic database for similar sequences (hits). You will also learn how to judge whether a hit arises by chance or by common ancestry.

What proteins in humans are similar to the red opsin?

Now return to the NCBI Map Viewer. You will search the human genome for sequences similar to that of the red opsin.

Click the BLAST symbol (circled B) next to Homo sapiens (human).

This is the NCBI's BLAST search tool. BLAST is a widely used program for finding sequences similar to a "query" sequence that you're interest in. Pick these options from the various menus:

Database: Build Protein for PREVIOUS build (look at bottom of the Database menu). This means that you will search the protein sequences in the previous build of the database. (Sometimes not all tools needed later are available in the latest build, which is currently under construction.)
Program: BLASTP (Use the version of BLAST that compares protein sequences, unlike BLASTN, which compares nucleotide sequences.)
Other Parameters: Make no changes.

Next, copy the FASTA data from your file protred.txt to your clipboard, and paste it into the BLAST search box, above which it says, "Enter an accession..." Check to be sure that the first character in the box is the ">" at the beginning of the FASTA data. Then click Begin Search.

The next page is for formatting your search results. Accept all default settings, and just click the View Report button. When your results are ready, the results of BLAST page appears. Look down the page to the Graphic Summary, a box containing lots of colored lines. Each line represents a hit from your blast search. If you pass your mouse cursor over a red line, the narrow box just above the box gives a brief description of the hit. You'll find that the first hit is your red opsin. That's encouraging, because the best match should be to the query sequence itself, and you got this sequence from that gene entry. The second hit is the green opsin -- remember that the PubMed entry reported that the red and green pigments are the most similar. The third and fourth hits are the blue opsin and the rod-cell pigment rhodopsin. Other hits have lower numbers of matching residues, and are color coded according to a score of matches. If you click on any of the colored lines, you'll skip down to more information about that hit, and you can see how much similarity each one has to the red opsin, your original query sequence. As you go down the list, each succeeding sequence has less in common with red opsin. Each sequence is shown in comparison with red opsin in what is called a pairwise sequence alignment. Later, you'll make multiple sequence alignments from which you can discern relationships among genes.

See what you can figure out about what the scores mean. Identities are residues that are identical in the hit and the query (red opsin), when the two are optimally aligned. Positives are residues that are very similar to each other (see residue number 1 in the blue opsin—it's threonine in red opsin, and the very similar serine in the blue). Gaps are sometimes introduced into a hit to improve its alignment with the query. The more identities and positives, and the fewer gaps, the higher the score. Note that blue opsin and rhodopsin are only about 45% identical to the red opsin. Other proteins, which are apparently not visual pigments, have even lower scores.

Interlude: Expectation Values and Blast Scores

The displays contain two prominent measures of the significance of the hit, 1) the BLAST Score [lableled Score (bits)], and 2) the Expectation Value (labeled Expect or E).

The BLAST Score indicates the quality of the best alignment between the query sequence and the found sequence (hit). The higher the score, the better the alignment. Scores are reduced by mismatches and gaps in the best alignment. Calculation of the score is complex, involving a substituion matrix, which is a table that assigns a score to each pair of residues aligned. The most widely used matrix for protein alignment is known as BLOSUM62.

The expectation value E of a hit tells whether the hit is likely be result from chance likeness between hit and query, or from common ancestry of hit and query. (If E is smaller than 10^-100, it is sometimes given as 0.0.) The expectation value is the number of hits you would expect to occur purely by chance if you searched for your sequence in a random genome the size of the human genome. E = 25 means that you could expect to find 25 matches in a genome of this size, purely by chance. So a hit with E = 25 is probably a chance match, and does not imply that the hit sequence shares common ancestry with your search sequence. Expectation values of around 0.1 may or may not be biologically significant (other tests would be needed to decide). But very small values of E mean that the hit is biologically significant; that is, the correspondence between your search sequence and this hit must arise from common ancestry of the sequences, because the odds are are simply too low that the match could arise by chance. For example, E = 10^-18 for a hit in the human genome means that you would expect only one chance match in one billion billion different genomes the same size of the human genome.

The reason we believe that we all come from common ancestors is that massive sequence similarity in all organisms is simply too unlikely to be a chance occurrence. Any family of similar sequences across many organisms must have evolved from a common sequence in a remote ancestor.

One place to find out more about BLAST searches and statistics is The BLAST Sequence Analysis Tool in the NCBI Handbook.

Now you will see where all these hits are found on human chromosomes.

Where (in the human genome) are all the genes for these other proteins?

Just above the Graphic Summary, click Human Genome View.

You have come full circle. You are back at the human chromosome diagram, and you see all the hits of your search, in the colors that signify their BLAST scores as they were shown in the Graphic Summary. Notice that there are about 100 proteins that have 40% or more positives in alignment with red opsin. The opsins are members of the much larger family of G protein-coupled receptors, key players in signal transduction.

NEXT