"This article is reprinted with permission from the InterNIC News, published by the InterNIC. This newsletter and its contents may not be sold for profit or incorporated in commercial documents without the written permission of the copyright holder. This material is based on work sponsored by the National Science Foundation under Cooperative Agreement #NCR-9218742. The Government has certain rights in this material."
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
These powerful search capabilities are necessary in order to retrieve just the requested items. The learning curve for these tools can be steep. Since users must pay for the use of these indexes, the librarian cannot just sit down at the terminal and "play around" with different searches. He or she must rigorously construct a search strategy before ever going online. Inefficiency is far too costly when using these services.
Web searching is becoming similar to proprietary data retrieval services in that users are trying to filter through terabytes of data in order to find just what they want. However, because Internet search indexes are free, users tend to take a fairly cavalier attitude about using them, seldom taking the time to learn their features. This kind of searching may return useful results, but it may also return a frustrating mass of irrelevant information.
In this article we will define the components of automatic search indexes, discuss procedures for making the most effective use of them, explain some basic search features that all search indexes should (but do not) explicitly contain, and identify which indexes are the best from the point of view of those search features. These features are summarized in a table at the end of this column.
As we discussed in an earlier column (June 1996), automated search indexes aren't necessarily the most effective way to find useful information. Someone who has already sifted through that information can offer the most precise searching pointers. But search indexes are among the most popular sites on the net, indicating that users have a need to seek out information on their own. So let's try to make some sense of how users might best use these indexes.
First, it's important to make the distinction between an automated search index and a web directory. Automated search indexes consist of three components: a "robot" of some sort that automatically collects links, titles, and text from Internet sites; a database where the resource information is stored; and a search engine which allows the user to query the database for sites. Most search indexes have added a browsable subject directory of some sort, but these sites are still primarily used to search, not browse, the net. All of the indexes collect large numbers of links, and this can be both an advantage and a disadvantage. The advantage is that everything on the Web is waiting for you to find it. The disadvantage is that you have to know how to find it.
Subject directories like Yahoo, Magellan, Galaxy, or Point, although they can be searched, are primarily categorizations of Internet resources. They are meant to be browsed through, just as you would browse the shelves of a library. Subject directories will be the focus of next month's column.
Here we will discuss basic search index features. Eight of the more popular and powerful search indexes will be compared in terms of their support of these features. Seven have been in existence for some time, and the eighth is a new product that is in beta testing (that is, it is available to the public but is not yet in final form). They are:
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
Proprietary search engine workbooks suggest making a worksheet that connects the concepts you want to use before you start. For example, for information on teenage alcoholism, the two concepts to examine are:
teenage AND alcoholismHowever, there are more than just two terms for these concepts. Think about what they might be.
teenage AND alcoholism OR adolescents AND alcohol abuse OR secondary school AND alcoholic students beverages OR youth AND drinkingThen combine the queries: (teenage or adolescents or secondary school students or youth) AND (alcoholism or alcohol abuse or alcoholic beverages or drinking)
This (or a variation of it) allows you to use as many terms as possible to search for your concepts. Once you have done this, return to the computer. Now you'll want to know which search indexes can handle your query most effectively.
Since automated search indexes cover so many sites, they must contain query features that allow you to retrieve exactly the information you need. If an index contains 100 items about teenage alcoholism, ideally your query should retrieve those 100 items. You would then have everything in the database that relates to your query. A query's effectiveness in this regard is its "recall." While you want a search to deliver high recall, you also want all retrieved items to be specifically about teenage alcoholism; you don't want 100 retrievals, of which only 15 are about teenage alcoholism. A query's success in this sense it its "precision." Ideally, a query should return high recall and high precision. However, it is less frustrating to achieve high precision with less recall than to receive hundreds (or thousands) of sites, many of which may be only loosely (or not at all) connected to your query.
Here is where the syntactical tools of the trade, the search features that each index allows you to use, become crucial. There are several of these features, of which we will discuss only a few. Judging the indexes by their provision of these features is one of the most important ways to analyze which index is the best for you. Remember, whether there are 250,000, 30 million, or 50 million items behind the curtain of the search index, you need to be able to retrieve just the items of use to you.
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
What really sets Alta Vista, Open Text, and possibly Ultra apart are their field search capabilities. The producers of these indexes realize that it is crucially important to not only provide millions of web pages, but also to provide the end user with the tools to achieve precise retrieval.
Of course, you might not agree with these "best" picks; no one search index is right for every user. The point is to find an index that is comfortable for you and that provides you with the best results. Hopefully, Table 1 will help you do that.
There are inherent problems with all these search indexes. Because they cannot discriminate between pages that are at the same site, they can become the equivalent of searching a card catalog for "George Washington," and retrieving citations for every page of every book that name is listed on. Sites are often mirrored, and the index can return numerous duplicates. There is also the ever-present problem of quality. Even when you have found the indexes of choice for you, spent time learning their syntax, and have sent queries that return a manageable amount of retrieval that appears to be relevant to your query, how do you know if the sites retrieved are good sites? Information quality in the Internet environment will be discussed in a later column.
For more information on automated search indexes, along with other ways to search the Internet, see the Scout Toolkit.
Form Based:
The feature is present but only through menu picks on a form-based interface. For field searches that are form based, the searchable fields are listed below the words "Form Based."
"...":
Type terms between double quotation marks.
NEAR/N:
This is a proximity operator where the user specifies the range of the number of words one word should be from another. NEAR/10 would mean within 10 words.
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
The Best Alta Vista OpenText WebCrawler Boolean AND: AND Form Based AND or ampersand OR: OR or | Form Based OR NOT: AND NOT or ! Form Based NOT Phrase: "..." Form Based "..." Proximity: NEAR Form Based ADJ (10 words) Near /N (User specifies N) Truncation: * * NA Field Search: anchor:terms Form Based NA applet:terms Summary host:terms Title image:terms 1st Heading link:terms URL text:terms title:terms url:terms ------------------------------------------------- The Rest (in no particular order) Infoseek Excite Lycos HotBot Boolean + AND Form Based Form Based AND OR Type terms OR Form Based Form Based NOT - AND NOT Form Based Form Based Phrase "..." NA NA Form Based Proximity term-term NA NA NA (adjacent) [term-term] (100 words) Truncation NA NA NA NA Field NA NA NA Form Based Search Media Type Location URL ------------------------------------------------- And a new one that shows promise. Infoseek Ultra Boolean AND + OR Type terms NOT - Phrase "..." Proximity NA Truncation NA Field Search link:terms site:terms url:terms title:terms
Ironically, one of the most difficult things about using the Internet for research is finding the information you need. In Part I we discussed automated search indexes, one way of finding Internet information. However, as we will see, there are also Internet search guides that are manual. The subject directories and hierarchies people maintain are, for all their shortcomings, more powerful to users (especially new users) who are asking the question: "What can I find on the Internet about history, or economics, or women's studies, or medicine?" or any of hundreds of subjects.
Automated search indexes are poor at answering these questions because they provide little organization or structure to the results they spit back in response to a query. The search indexes receive a relevance score, but that is no substitute for organization and structure. The structure and organization of resources which have always helped traditional library users are available in other kinds of Internet search tools, which we call subject directories.
This month's column will be devoted to a discussion of subject directories. Subject directories are categorizations of Internet resources, and are meant to be browsed, although most can also be searched. As we discussed last month, search indexes are collections of Internet links, built by "spider" programs that automatically deposit links in a searchable database. Subject directories, on the other hand, are produced and maintained by people, and resources are collected by either resource-owner submission or selection by librarians, editors, or subject specialists.
Most of these directories contain search interfaces, but they are often more rudimentary than the ones discussed in last month's column, serving instead as a gateway to a subject hierarchy which the user can browse for information about a topic.
The main difference between subject directories and search indexes is the level of human intervention in the creation of the directory. It is this human intervention that filters and classifies resources so that busy researchers can quickly find what is of use to them, rather than searching every page of hundreds of thousands of sites. These directories (except for the very largest ones) contain far fewer resources than search indexes. However, this can actually be advantageous to the user. There is much less "chaff" to cut through to obtain the "wheat."
As with all things human, each directory is unique, with its own set of advantages and disadvantages. Which one is best for you is a personal preference, but we will point out some of the better ones.
We will categorize subject directories by the amount of human intervention. The categories are subject catalogs, annotated directories, and subject guides.
A "subject catalog" is very much like a library subject card catalog. Users look in the catalog under the subject heading that they are interested in and find resources.
An "annotated directory" has resources listed in a subject hierarchy, but each resource is further analyzed by an editor, librarian, or subject specialist. It is then annotated to give the user a more detailed idea of what the resource is, and, in some cases, rated based on an established set of criteria.
A "subject guide" contains a still deeper level of human analysis, in that a person or persons (editors, librarians, or subject experts) have filtered resources in a single subject and created a guide (sometimes annotated) to that subject. Implicit in the notion of a guide is that its resources will be of high quality because of the amount of filtering and the level of expertise of its author. Having a set of these guides at one site would give users the highest level of filtering and analysis, and thus the highest quality resources.
Eight directories that fall into these categories are old Internet veterans, well established and respected. The directories and their features are presented in Table 2 below for your convenience. We will not discuss the intricacies of using their search engines. Interested users should use last month's column as a guide. We will discuss certain features of the directories to help users analyze which ones are most applicable to them.
These features should help you to determine the amount of filtering and quality analysis that has taken place in each directory. Some of the features you should look for in a directory are:
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
McKinley's Magellan is the best annotated directory because of both the number of annotated sites, and the level of annotation of each site.
While Yahoo is the most comprehensive subject catalog, it takes almost anything submitted and puts it into a hierarchy that is difficult to navigate without prior searching. It straddles the line between subject directory and search index, and many people use it both ways. A better, although much less comprehensive subject catalog is the Bulletin Board for Libraries (BUBL). Its producers provide the catalog in both Universal Decimal Classification and alphabetic subject format. Its selectors are librarians, and while this does not guarantee excellence, it does guarantee that people whose job it is to select and categorize information are doing that job.
You may not agree with these picks, or may feel there are better subject directories on the Internet than the ones discussed here. The point is to find the directory that is best for you, that consistently provides you with the best resources, and then use it. This quick comparison will show you that these directories, because they have different strengths, can be used in combination to provide better results. Yahoo, Galaxy, and the Internet Directory of Directories contain lots of resources but little filtering. Magellan and the Lycos Top 5% give high ratings to very different kinds of resources. Argus Clearinghouse and W3C Virtual Library produce entire guides on single subjects. The important thing is to know what you're looking at when you look at a subject directory.
As with search indexes, subject directories have inherent problems. The above-mentioned problem of arbitrary and uncontrolled hierarchies is the biggest. It is sometimes difficult to determine who puts resources where in the subject hierarchy--the resource submitters or the owners of the directory.
Selecting or not selecting a resource, rating it, and annotating it are very subjective processes. Because Magellan gives a site 28 points out of 30 ("four stars"), does not guarantee the site is a quality site for every user. That determination must be made by the user.
However, the fact that resources have been categorized, and in some cases selected, rated, and annotated, means that users are likely to find more quality resources in these directories than by searching an automated index. Which directory contains the most quality resources? Which contains the highest quality resources? That is for the user to determine. Users must determine quality much more on the Internet than in other avenues of publication because the filters that have long existed in those avenues do not exist at this time on the Internet. This is, of course, good and bad. It is good in the sense that the Internet can be a publishing avenue for information that doesn't make it through publishing filters. It is bad in the sense that those publishing filters have long been perceived as quality filters as well. The Internet has been criticized for having a low quality of information. How does the user determine the quality of an information resource in a networked environment? We turn next to that question.
For more information on subject directories, see the Scout Toolkit.
[Return to Table of Contents for this
page. Go to Top | Bottom]
____________________________________________
A comparison of filtering features for eight Internet subject directories. Search Site Site Ratings Rating Site Who Discrim- Ratings System Criteria Annota- Annotates ination tion SUBJECT CATALOGS: Yahoo! Y N Y Glasses Presentation/ Brief Submitters Icon Content BUBL Y Y N N/A N/A Y Librarians Galaxy Y N Y N N/A N/A N/A ANNOTATED DIRECTORIES: Content Depth Magellan Y Y Y 30 pts. Organization Y Editors Net Appeal Content Lycos N Y Y 50 pts. Presentation Y Editors Experience InterNIC Directory of Y N N N/A N/A Y Submitters Directories SUBJECT GUIDES: Level of resource description Level of Argus 1 - 5 resource Guide Clearing- Y Y Y check evolution Varies maintainer house marks Guide design Guide organization Guide meta- info W3C Virtual Guide Library N Y N N/A N/A Varies maintainer Key: Y=Yes N=No N/A=Not Applicable
Directories that list "Y" under annotation do not necessarily annotate
every site in the directory.
Note that the Argus Clearinghouse rating system rates the guides, not the individual resources within the guides.
[Return to Table of Contents for this
page. Go to Top | Bottom]
.
____________________________________________
.
.
Our Web-Counter
says you are visitor number:
to this website since mid-June 1996.
.
Return to the Table of Contents or the TOP of this page.