It used to be that the only "database and access system" upon which biologists had to rely was the information published in printed journals. While this information resource is still of the utmost value, genome researchers must now be able to utilize electronic data that is stored in vast repositories spread out all over the world. Selected information retrieval systems will be discussed at length in the Integrated Information Retrieval section, while the submission of new or updated data to public data repositories will be covered in Submitting Data to Public Databases. In this section, we provide an introduction to basic Internet concepts and navigation tools and some pointers regarding where to look for data and services relevant to readers of this book. Be advised, though, that as the Internet expands, it also changes; in fact, the average half-life of an address on the Internet is about four years, meaning that some of the resources and addresses cited in this chapter may no longer be available or located at the same place where they were at the time of this writing. All of the software mentioned in this article can be found on the Internet free of charge, and most are available for use on both Macintosh, PC, and UNIX platforms.
More detailed descriptions of and tutorials on the Internet may be found in a wide variety of publications available in many bookstores. Two useful, general reference books are The Whole Internet User's Guide & Catalog, 2nd ed. by Krol and Ferguson, and The Internet Roadmap by Falk. These and other similar titles will provide a more in-depth treatment of Internet issues such as communications protocols, networked applications, and connectivity.
The most familiar form of electronic communication to biologists is undoubtedly electronic mail, or E-mail. The popularity of E-mail lies in its convenience in both sending, receiving, and replying to messages. Communication tends to be direct and to-the-point, and the recipient has the ability to assess whether a message requires an immediate reply (or any reply at all). Additional advantages of E-mail is its obvious speed over traditional postal mail, or "snail-mail", the ability to save and forward copies of messages in an orderly way, and the fact that it is a low-cost or no-cost alternative to postal mail. The major disadvantage of E-mail lies in security in that, as your mail passes from node to node on its way to the intended recipient, there is the possibility that the message could me read or intercepted by a systems administrator or someone else with similar access. In an academic environment, this is more than likely a non-issue, but in a corporate environment, E-mail systems may be treated as an "asset of the company" and be subject to monitoring. The advantages of E-mail far outweigh the disadvantages, though, and these advantages are what have made E-mail one of the primary forms of communication within the academic community.
In addition to its usefulness in sending messages to a single individual, E-mail can be used to communicate with large numbers of people all at once through what is termed newsgroups. The members of these newsgroups are able to obtain information and exchange ideas on an extremely wide number of topics of interest. Subscribing to these groups is as simple as sending an E-mail message containing the word subscribe to a given E-mail address. The BIOSCI newsgroups are amongst the most highly subscribed to forums for biologists; information on subscribing to individual BIOSCI newsgroups can be obtained by sending an E-mail message to biosci-server@net.bio.net, leaving the subject blank and placing the words info faq in the body of the message. A copy of the Frequently Asked Questions (FAQ) will then be returned in response.
Finally, E-mail can be used to perform computations, make predictions,
or do database searches. By sending a formatted E-mail message to a
remote computer (a server), the user can ask the remote
computer to perform a defined operation and return the result, again
via E-mail. One advantage of this type of system is that it frees the
user from both developing and maintaining software, as those functions
are performed by the persons maintaining the server. Disadvantages
include the lack of real-time interactivity and the limitation to
strictly text-based output. A list of E-mail servers of particular
value to molecular biologists is presented in
Table 1. In addition, an up-to-date
list of E-mail servers is maintained by Amos Bairoch of the University
of Geneva; this list can be obtained via anonymous file
transfer protocol (described below) at expasy.hcuge.ch
(directory /databases/info, file
serv_ema.txt). With few exceptions, sending the message
"
help"
(without quotation marks)
to any E-mail server will return a detailed set of instructions
for using that server. Practical examples on how to use E-mail to
perform homology searches will be provided later in this chapter.
While E-mail provides an excellent mechanism for transmitting messages, it is limited in its ability to transfer files. Even though most commercial mail packages have the ability to send files as an attachment to a message, it is not uncommon for the attached file to be unusable by the recipient: the file may be too big to be transferred, or it may be in a format that is unrecognizable by the recipient's computer.
A simpler and more efficient method of transferring files is through a file transfer protocol, or ftp. When using ftp, a connection is made between the user's computer and the remote computer, a connection which stays in place for the duration of the file transfer session. Making such a connection requires that the user has both an account and password on the remote computer.
Alternatively, a user can perform what is termed anonymous ftp, which requires neither a username or a password. This method is most often used in making public domain software freely available and accessible. Often, announcements of newly-available software available through this mechanism are made over E-mail newsgroups, as discussed above. Under this method, when prompted for a username, the user replies with the word anonymous to signify that an anonymous ftp session is being requested. When prompted for a password, the user then provides their E-mail address. Supplying the E-mail address allows the systems administrator at the remote site to maintain relevant access statistics, of use to those providing the public domain software. Once granted access, a user can navigate through the public directories and download any software of interest.
While ftp takes place in the context of a UNIX environment, programs are available which allow a user to use a graphical front-end to point-and-click their way through directories rather than by issuing UNIX commands. The most popular ftp program with a graphical user interface (GUI) for the Macintosh is called Fetch, and similar programs are also available for the PC environment. If a user knows the name of some public domain software of interest but does not have information as to where it can be downloaded from E-mail-based search engine named ARCHIE ( archie@archie.rutgers.edu) can be used to locate ftp sites containing that file. Related programs (XARCHIE for UNIX and ANARCHIE for the Mac) can both perform the search and download the file in a single operation. A selected list of relevant ftp sites is given in Table 2. An example of how to perform an ftp download can be obtained by sending an E-mail message to info@sunsite.unc.edu, with the word help in the body of the message.
While ftp allows for documents or programs to be easily disseminated, it requires that a user actually download a file to examine its contents. One of the first-generation Internet tools, Gopher, addressed the need to distribute text documents to users without requiring an actual download. Gopher was developed at the University of Minnesota (hence its name, after the school mascot) and is a good example of what is termed a "distributed document delivery system". Gopher also falls into the client-server class of applications, as its use requires connection to a remote computer and is interactive.
One of the key features of Gopher is that it provided for relatively effortless travel around the Internet. The information stored at Gopher sites is organized in a series of hierarchical menus, and movement through these menus is accomplished either through the use of arrow keys or by clicking a mouse. Movement through this hierarchy is not restricted to a single site; users can traverse the Internet, visiting other Gopher sites that are interconnected through "Gopher holes". This is one of Gopher's strengths, as it does not require a user to know the exact location of the information that is being searched for; users sequentially follow the hierarchy until the desired information is found.
As with E-mail and ftp, a short list of relevant Gopher sites is provided in Table 3. For an extensive manual on Gopher developed by the University of Minnesota, ftp to boombox.micro.umn.edu (directory /pub/gopher/docs/).
The logical next step in the development of Internet tools was to provide an interface through which information could be accessed directly, as well as providing the ability to provide non-text-based media, such as images, sounds, and video. The outgrowth of this need resulted in the development of the World Wide Web, a concept developed by the European Nuclear Research Council (CERN).
While the programs that are used to traverse "the Web" are client-server applications, as are Gopher programs, the similarities in the overall presentation of information between Gopher and the Web end soon thereafter. As already alluded to, the information distributed on the Web is not strictly text-based; it can include images, produce sounds, and playback video and other types of animation. Navigation on the Web is accomplished by clicking on specific text, buttons, or pictures within a document: hyperlinks. When pressed, these hyperlinks transport the user to another Web location, whether it be at the same site or across the globe. Locations on the Web are called Web sites, and the individual files that are stored and displayed at these sites are called Web pages. Through a process that has been nicknamed "Web-surfing", users can follow hyperlinks from page to page until information of interest is found; the process here is different than with Gopher, in that the links are not organized in a rigid hierarchy.
In addition, users can access a specific site directly by typing in its address. Web addresses are also called uniform resource locators, or URLs. The "uniform" part of URL refers to the fact that the software used to look at documents on the Web (browsers) are capable of visiting not only Web sites, but Gopher and ftp sites as well by specifying the appropriate protocol. In order to accomplish this, a uniform method for specifying sites was introduced (the URL) that would indicate to the Web browser both the address of the remote site and what type of site it was. URLs take the general form protocol://somewhere.domain, where protocol specifies the type of site, and somewhere.domain specifies the remote location. Examples of URLs are as follows:
ftp site | ftp://ftp.bio.indiana.edu | |
Gopher site | gopher://hobbes.jax.org | |
Web site | http://www.ncbi.nlm.nih.gov |
The http:// in the Web address stands for hypertext transfer protocol, the method used to transfer Web files from the server to the client.
As mentioned above, browsers are used to look at documents on the Web. These browsers interpret the code underlying a Web page (the hypertext markup language, or HTML) and display it in the correct format, regardless of whether the user is on a Macintosh, PC, or UNIX system. By far, the most widely-used browser software is Netscape Navigator, with estimates by two different market research firms placing Netscape's share of the Web browser market at 75-85%. Netscape became the de facto standard by offering users an interface which displayed Web pages much faster thanpreviously available products, an important factor for users using either dial-up connections or commercial Internet providers. Netscape also provided special features which enabled Web developers to take advantage of HTML enhancements not available in other browsers. Other browsers available include the America Online browser and Microsoft's Internet Explorer, both of which are estimated to have about a 10% market share based on the same surveys.
While the most prevalent way of finding information on the Web is by word-of-mouth, such as the tables which accompany this text, users may consult compiled lists of Web sites, known as virtual libraries, in order to locate Web sites that most likely have the information that is sought. Three popular virtual libraries of interest to biologiests are the WWW Virtual Library maintained by Keith Robison at Harvard, Pedro's Biomolecular Research Tools, compiled by Pedro Coutinho at Iowa State, and the EBI BioCatalog a collaborative project based at the European Bioinformatics Institute. The addresses for these and other useful Web sites are given in Table 4.
A Web surfer can also find Web sites of interest by using special programs called search engines. These search engines use a variety of methods to perform either keyword or full-text searches across the Web, returning a hyperlinked list of results which the user can then scan and click on to visit any or all of the found sites. Since each search engine program uses a different method to search the Web, and in some cases only search subsets of the Web, the resulting hit lists can vary tremendously. Consider two searches which we performed using three different search engines:
Search Engine | "human genome" | "positional cloning" |
---|---|---|
Web Crawler | 752 | 16 |
Infoseek | 1252 | 44 |
Inktomi | 18713 | 794 |
These results should not be interpreted as "more is better", since a single page could conceivably produce multiple hits if the phrase appears more than once, or the search parameters may be loose in the sense that the algorithm allows for the words to be slightly separated within the same document. Rather than perform search after search using each different engine, meta-search engines have been developed that automatically polls different search engines, collects the results, filters out any duplicates, and returns a single hitlist to the user. While these searches necessarily take longer to perform, a user can have more confidence in having found most, if not all, of the sites that would fit a given search query. Two such meta-search engines are SavvySearch ( http://guaraldi.cs.colostate.edu:2000/form) and MetaCrawler ( http://www.metacrawler.com/).
Given the increasing reliance of scientists on the Internet for communication and the change in the nature of experimental data, many scientific journals are now establishing a World Wide Web presence. The advantages of having traditional print journals available on the Web are many, including:
Amongst the journals that have established Web sites are Science, Nature, Cell, and The Journal of Biological Chemistry. The content on these journal Web sites will vary. For example, Cell presents tables of contents and abstracts for all articles. In contrast, The Journal of Biological Chemistry presents full text and figures for all published papers, as well as links to the relevant sequence and bibliographic databases. The URLs for these and other journal sites are provided in Table 4.
Specialized client-server applications
While Gopher and Netscape are excellent examples of Internet navigation tools that operate as client-server programs, their universality may also be a limitation under certain conditions. As such, more powerful client-server systems have been developed to take advantage of scientific knowledge and data interrelationships in specialized fields. One such system is Entrez, an integrated information retrieval system designed to seamlessly traverse sequence, structure, genomic, and bibliographic databases. (Entrez will be discussed in depth in the following section on Integrated Information Retrieval.)