Bioinformatics Tutorial

Summary

You have used these categories of tools in this tutorial:

  1. Databases like GenBank, UniProt, and PDB store sequence and structural data in the form of entries (each with a unique code) that correspond to a single gene or its protein product. The databases provide extensive information about each entry, ranging from brief pop-up information, to links that submit the entry to various search and analysis tool (below), to encyclopedias of information about the entry, or to the results of automated searches in PubMed for publications related the entry. Databases also provide sequences in formats (like FASTA) that serve as search queries in the same or other databases.
  2. Search Tools can be integral parts of databases, or stand-alone programs. Integral search tools allowing you to search with keywords, with FASTA sequences, or with entry numbers from other databases. Stand-alone search tools like BLAST allow you to find sequences (hits) similar to sequences of interest to you (queries).
  3. Analysis Tools (example: PROSITE) use single sequences to determine properties or identify functions of genes and their products. Sequence comparison tools like ClustalW and Tcoffee perform multiple sequence alignments and produce phylogenetic trees, showing vividly how genes are related to each other. Consensus tree-building tools like Phylip and PhyML build trees based on many interations of random sampling and alignment of the sequences being compared, thus reducing the possibility of bias from a single sequence alignment. Phylodendron lets you print trees to you liking, using tree data in Newick format from any tree-building tool.
  4. Modelling Tools like Swiss-Model provide, or assist you in building, homology models of proteins of unknown structure. The modeling program DeepView (also knowns as Swiss-PdbViewer) helps you to build homology models, as well as to study and judge the quality of all types of models (homology, X-ray, NMR). DeepView and SWISS-MODEL are integrated, so you can move back and forth between them at any point in a modeling project.

More Than Meets The Eye

All of the tools you have used here are much more complex and powerful, and require more judgement to use properly, than you might think from your use of them so far. You have only scratched their surfaces. For example, programs like BLAST and ClustalW have many settings that allow the user to control many aspects of the analysis. When you click a link to ClustalW and get a multiple-sequence alignment with no fuss, you have used default settings that might not be the best for your task. For serious scientific work, you need to visit sites that provide full implementations of search, alignment, and analysis tools, giving you full control of the task, but also requiring deeper understanding of the kind of analysis you are doing. This kind of knowledge is crucial to judging the quality of your results (an aspect in which this tutorial is very weak).

To learn more about specific tools, go directly to any network service, such as ExPASy or NCBI, that provides the tool you want to use. First, you will find links to extensive user manuals that tell how the analysis tools work. You might also find lists of frequently asked questions (FAQs) about the tool. Finally, you will find a direct link to a form for running the tool, in which you can make all settings, put in a query, and run the tool. Only trouble is, as a beginner, you often do not know what settings to put in.

In my opinion, the best services for beginners are those that provide settings in pull-down menus that show you all of the allowed settings. As an example, go to EMBL-EBI, another great online service, and click Sequence Similarity and Analysis. In the left-hand column, under Sequence Analysis, click ClustalW2. The resulting form shows all of the ClustalW settings in the form of pull-down menus, so you don't have to know the possible settings and type them in—all allowed settings are displayed in the menus, so you can't go wrong. The settings shown when you arrive (called the defaults) are probably the same settings applied to your analysis when you clicked the quick link from your table of opsin entries at UniProt to get your Clustalw multiple-sequence analysis. In fact, if you go back to that page, you will see that the box at the top contains all FASTA files in sequence. If you want to see how other settings affect the analysis, you can use paste this set of files, as one block of text, into the EMBL-EBI Clustalw form, play with settings, and get multiple-sequence analyses to your heart's delight. This is a great way to learn more about a tool that you want to use wisely. EMBL-EBI provides most of the common bioinformatics in this beginner-friendly kind of environment.

Where Do You Go From Here?

Now you have had a very basic introduction to bioinformatics. With the tools you've tried out, you can explore the vast stores of genetic and structural information available on the Internet. Every page you have visited has many more links to other tools. You can figure out a lot just by visiting them and playing around, and there is usually plenty of built-in help and. I hope this tutorial spurs you to learn more about how to use bioinformatics in your work.

For a more rigorous and systematic, yet readable and clear, survey of the full range of bioinformatics, get the latest editionof Bioinformatics for Dummies, by Claverie and Notredame, Wiley Publishing, Inc. It will help you learn to use the tools wisely, and to judge the reliability of your results. I recently bought the 2007 edition, and I have learned a lot of cool new stuff. The new edition helped immensely in updating this tutorial. It's the best thing I know of to take you further.

NEXT -- Test Your New Skills