Searching the Internet: Parts I & II

... Some Basic Considerations and Automated Search Indexes.

... Subject Catalogs, Annotated Directories, and Subject Guides.

Table of Contents for this page:

Preface.

Searching the Internet - Part I. Some Basic Considerations and Automated Search Indexes
Creating a Search Strategy.
The Basic Features.

Boolean searching.
Phrase searching.
Proximity searching.
Truncation.
Field searching.

Analysis.
Table 1.

Searching the Internet - Part II. Subject Catalogs, Annotated Directories, and Subject Guides.
Eight Example Directories with links.
Search Capability.
Site Discrimination.
Site Rating.

Rating system.
Rating Criteria.
Who Rates?

Site Annotation.
Who Annotates?
Analysis.
Table 2.

___________________________________

- Preface

This page contains copies of articles from the InterNIC News, Volume 1, Issues 6 and 7, September and October, 1996 written by Jack Solock, Special Libarian, InterNIC Net Scout. It deals with getting what you want from an automated search engine such as AltaVista, Lycos, Webcrawler, or others.

"This article is reprinted with permission from the InterNIC News, published by the InterNIC. This newsletter and its contents may not be sold for profit or incorporated in commercial documents without the written permission of the copyright holder. This material is based on work sponsored by the National Science Foundation under Cooperative Agreement #NCR-9218742. The Government has certain rights in this material."

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

- Searching the Internet - Part I

Some Basic Considerations and Automated Search Indexes

Whenever librarians or information professionals have to think about searching the Web using search indexes like Alta Vista, HotBot, WebCrawler, or Lycos, they must cringe just a little bit. They are accustomed to using the extraordinarily powerful proprietary indexes that drive such services as Dialog, Lexis/Nexis, or H.W. Wilson's bank of bibliographic search databases. These engines are extremely powerful, allowing the user to effectively search through terabytes of data.

These powerful search capabilities are necessary in order to retrieve just the requested items. The learning curve for these tools can be steep. Since users must pay for the use of these indexes, the librarian cannot just sit down at the terminal and "play around" with different searches. He or she must rigorously construct a search strategy before ever going online. Inefficiency is far too costly when using these services.

Web searching is becoming similar to proprietary data retrieval services in that users are trying to filter through terabytes of data in order to find just what they want. However, because Internet search indexes are free, users tend to take a fairly cavalier attitude about using them, seldom taking the time to learn their features. This kind of searching may return useful results, but it may also return a frustrating mass of irrelevant information.

In this article we will define the components of automatic search indexes, discuss procedures for making the most effective use of them, explain some basic search features that all search indexes should (but do not) explicitly contain, and identify which indexes are the best from the point of view of those search features. These features are summarized in a table at the end of this column.

As we discussed in an earlier column (June 1996), automated search indexes aren't necessarily the most effective way to find useful information. Someone who has already sifted through that information can offer the most precise searching pointers. But search indexes are among the most popular sites on the net, indicating that users have a need to seek out information on their own. So let's try to make some sense of how users might best use these indexes.

First, it's important to make the distinction between an automated search index and a web directory. Automated search indexes consist of three components: a "robot" of some sort that automatically collects links, titles, and text from Internet sites; a database where the resource information is stored; and a search engine which allows the user to query the database for sites. Most search indexes have added a browsable subject directory of some sort, but these sites are still primarily used to search, not browse, the net. All of the indexes collect large numbers of links, and this can be both an advantage and a disadvantage. The advantage is that everything on the Web is waiting for you to find it. The disadvantage is that you have to know how to find it.

Subject directories like Yahoo, Magellan, Galaxy, or Point, although they can be searched, are primarily categorizations of Internet resources. They are meant to be browsed through, just as you would browse the shelves of a library. Subject directories will be the focus of next month's column.

Here we will discuss basic search index features. Eight of the more popular and powerful search indexes will be compared in terms of their support of these features. Seven have been in existence for some time, and the eighth is a new product that is in beta testing (that is, it is available to the public but is not yet in final form). They are:

WebCrawler:

http://www.webcrawler.com/

Search help:

http://www.webcrawler.com/WebCrawler/Help/Help.html

Advanced searching:

http://www.webcrawler.com/WebCrawler/Help/Advanced.html

excite:

http://www.excite.com/

Search help:

http://www.excite.com/Info/search_intro.html

Infoseek Guide:

http://guide.infoseek.com/

Search help:

http://guide.infoseek.com/Help?pg=HomeHelp.html& sv=IS& lk=frames

Advanced Searching and syntax guide:

http://guide.infoseek.com/Help?pg=SearchHelp.html& sv=IS& lk=frames

Lycos:

http://www.lycos.com/

HotBot:

http://www.hotbot.com/

Infoseek Ultra (currently in beta release):

http://ultra.infoseek.com/

Search help:

http://ultra.infoseek.com/Help?pg=help.html& sv=US& lk=1

Each of these indexes has advantages and disadvantages, and while certain ones are recommended (see "The Best" in Table 1), users should find one or two that they are comfortable with, spend some time learning the searching systems, and then practice, practice, practice.

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

- Creating a Search Strategy

First a searcher should step away from the computer. This recommendation is a throwback to the days when searching cost money and "playing on the computer" was unheard of, but it is still valid because your search will be much more efficient if you think about what you want to search for, and write it down before you start.

Proprietary search engine workbooks suggest making a worksheet that connects the concepts you want to use before you start. For example, for information on teenage alcoholism, the two concepts to examine are:

  teenage AND alcoholism

However, there are more than just two terms for these concepts. Think about what they might be.

        teenage               AND        alcoholism
  OR    adolescents           AND        alcohol abuse
 
  OR    secondary school      AND        alcoholic
        students                         beverages
 
  OR    youth                 AND        drinking

Then combine the queries: (teenage or adolescents or secondary school students or youth) AND (alcoholism or alcohol abuse or alcoholic beverages or drinking)

This (or a variation of it) allows you to use as many terms as possible to search for your concepts. Once you have done this, return to the computer. Now you'll want to know which search indexes can handle your query most effectively.

Since automated search indexes cover so many sites, they must contain query features that allow you to retrieve exactly the information you need. If an index contains 100 items about teenage alcoholism, ideally your query should retrieve those 100 items. You would then have everything in the database that relates to your query. A query's effectiveness in this regard is its "recall." While you want a search to deliver high recall, you also want all retrieved items to be specifically about teenage alcoholism; you don't want 100 retrievals, of which only 15 are about teenage alcoholism. A query's success in this sense it its "precision." Ideally, a query should return high recall and high precision. However, it is less frustrating to achieve high precision with less recall than to receive hundreds (or thousands) of sites, many of which may be only loosely (or not at all) connected to your query.

Here is where the syntactical tools of the trade, the search features that each index allows you to use, become crucial. There are several of these features, of which we will discuss only a few. Judging the indexes by their provision of these features is one of the most important ways to analyze which index is the best for you. Remember, whether there are 250,000, 30 million, or 50 million items behind the curtain of the search index, you need to be able to retrieve just the items of use to you.

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

- The Basic Features

Here are some of the basic syntactical features, along with very brief explanations of how they work.

Boolean searching:

This allows terms to be put into logical groups by the use of connective terms. The basic connective terms are AND, OR, and NOT. Searching "cats OR dogs" will retrieve items containing either term. Searching "cats AND dogs" will retrieve only items containing both terms, narrowing the search. Searching "Mexico NOT New" will retrieve items about Mexico but not about New Mexico, narrowing the search in another way.

Phrase searching:

This allows searching words as phrases and can be very useful in narrowing a search. If you can find sites with the words "teenage alcoholism" as a phrase, rather than just the two words mentioned anywhere in the site, you're on your way to higher precision.

Proximity searching:

When available, proximity operators allow you to specify how many words one word is from another. The closer the words teenage and alcoholism are, the more likely they are to be pertinent to your query. The most common proximity operator is NEAR.

Truncation:

This allows you to add a wild card symbol (usually a *) at the end of a root term, in order to retrieve different variants of the term. Many search indexes do this implicitly, but you have another tool in your arsenal if you can do it explicitly. A search on "historic*" should return historic, historical, historically, etc.

Field searching:

This is probably the most important feature available for searching indexes, and it is what really separates the great ones from the good ones. A web page is a data record which can be divided into fields. Title, URL, text, summary, and heading are just a few of the fields. The more fields that can be searched, the better, because in combination, field searches increase precision dramatically. If you can search on the phrase "teenage alcohol*" in the title of web pages, and combine that with "treatment method*" in the text of the page, you can narrow the search significantly.

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

- Analysis

The above features offer the user flexibility and power. Based on the number of features offered, Alta Vista, Open Text, and WebCrawler are the most powerful. (See Table 1 - a simple comparison across the board of the search features explicitly available in each index--taken from the search index help pages.) These three indexes are the best not because they contain the most sites or return retrieved sites the fastest, but because they allow you to hone your search the most, increasing precision. Infoseek's Ultra is not listed as one of the best at this time, because it is currently in beta release, but a look at its capabilities shows that it has much promise to join this list.

What really sets Alta Vista, Open Text, and possibly Ultra apart are their field search capabilities. The producers of these indexes realize that it is crucially important to not only provide millions of web pages, but also to provide the end user with the tools to achieve precise retrieval.

Of course, you might not agree with these "best" picks; no one search index is right for every user. The point is to find an index that is comfortable for you and that provides you with the best results. Hopefully, Table 1 will help you do that.

There are inherent problems with all these search indexes. Because they cannot discriminate between pages that are at the same site, they can become the equivalent of searching a card catalog for "George Washington," and retrieving citations for every page of every book that name is listed on. Sites are often mirrored, and the index can return numerous duplicates. There is also the ever-present problem of quality. Even when you have found the indexes of choice for you, spent time learning their syntax, and have sent queries that return a manageable amount of retrieval that appears to be relevant to your query, how do you know if the sites retrieved are good sites? Information quality in the Internet environment will be discussed in a later column.

For more information on automated search indexes, along with other ways to search the Internet, see the Scout Toolkit.

http://scout.cs.wisc.edu/scout/toolkit/index.html

Table 1 below is a Summary of basic search features of eight major automated Internet search indexes. (Note that exact syntax of your query using these operators is given. For more information on operator usage, along with examples, see the search index help pages.) Always replace "term" or "terms" with your query terms. For the purposes of this column, "must" is treated as "AND." "Must not" is treated as "NOT."

Form Based:

The feature is present but only through menu picks on a form-based interface. For field searches that are form based, the searchable fields are listed below the words "Form Based."

"...":

Type terms between double quotation marks.

NEAR/N:

This is a proximity operator where the user specifies the range of the number of words one word should be from another. NEAR/10 would mean within 10 words.

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

TABLE 1

The Best 
               Alta Vista       OpenText         WebCrawler 
 
Boolean AND:     AND            Form Based          AND 
              or ampersand 
 
OR:           OR or |           Form Based           OR 
 
NOT:          AND NOT or !      Form Based          NOT 
 
Phrase:         "..."           Form Based         "..." 
 
Proximity:       NEAR           Form Based           ADJ 
              (10 words)                             Near /N 
                                                     (User specifies N) 
 
Truncation:       *               *                   NA 
 
Field Search: anchor:terms      Form Based            NA 
              applet:terms      Summary 
              host:terms        Title 
              image:terms       1st Heading 
              link:terms        URL 
              text:terms 
              title:terms 
              url:terms 
 
------------------------------------------------- 
 
The Rest (in no particular order) 
 
             Infoseek       Excite       Lycos         HotBot 
Boolean      +              AND          Form Based    Form Based 
AND 
 
OR           Type terms     OR           Form Based    Form Based 
NOT          -              AND NOT      Form Based    Form Based 
 
Phrase       "..."          NA           NA            Form Based 
 
Proximity    term-term      NA           NA            NA 
             (adjacent) 
             [term-term] 
             (100 words) 
 
Truncation   NA             NA           NA            NA 
 
Field        NA             NA           NA            Form Based 
Search                                                 Media 
                                                       Type 
                                                       Location 
                                                       URL 
 
------------------------------------------------- 
 
And a new one that shows promise. Infoseek Ultra 
 
Boolean AND                       + 
OR                                Type terms 
NOT                               - 
Phrase                            "..." 
Proximity                         NA 
Truncation                        NA 
Field Search                      link:terms 
                                  site:terms 
                                  url:terms 
                                  title:terms

[Return to Table of Contents for this page. Go to Top | Bottom]

____________________________________________

- Searching the Internet - Part II

Subject Catalogs, Annotated Directories, and Subject Guides.

The Internet is about computerized information made readily available at fantastic speeds to people all over the world. It promises an incredible increase in the transmission of information through the passage of bytes from computer to computer. It's automated, and it's fast!

Ironically, one of the most difficult things about using the Internet for research is finding the information you need. In Part I we discussed automated search indexes, one way of finding Internet information. However, as we will see, there are also Internet search guides that are manual. The subject directories and hierarchies people maintain are, for all their shortcomings, more powerful to users (especially new users) who are asking the question: "What can I find on the Internet about history, or economics, or women's studies, or medicine?" or any of hundreds of subjects.

Automated search indexes are poor at answering these questions because they provide little organization or structure to the results they spit back in response to a query. The search indexes receive a relevance score, but that is no substitute for organization and structure. The structure and organization of resources which have always helped traditional library users are available in other kinds of Internet search tools, which we call subject directories.

This month's column will be devoted to a discussion of subject directories. Subject directories are categorizations of Internet resources, and are meant to be browsed, although most can also be searched. As we discussed last month, search indexes are collections of Internet links, built by "spider" programs that automatically deposit links in a searchable database. Subject directories, on the other hand, are produced and maintained by people, and resources are collected by either resource-owner submission or selection by librarians, editors, or subject specialists.

Most of these directories contain search interfaces, but they are often more rudimentary than the ones discussed in last month's column, serving instead as a gateway to a subject hierarchy which the user can browse for information about a topic.

The main difference between subject directories and search indexes is the level of human intervention in the creation of the directory. It is this human intervention that filters and classifies resources so that busy researchers can quickly find what is of use to them, rather than searching every page of hundreds of thousands of sites. These directories (except for the very largest ones) contain far fewer resources than search indexes. However, this can actually be advantageous to the user. There is much less "chaff" to cut through to obtain the "wheat."

As with all things human, each directory is unique, with its own set of advantages and disadvantages. Which one is best for you is a personal preference, but we will point out some of the better ones.

We will categorize subject directories by the amount of human intervention. The categories are subject catalogs, annotated directories, and subject guides.

A "subject catalog" is very much like a library subject card catalog. Users look in the catalog under the subject heading that they are interested in and find resources.

An "annotated directory" has resources listed in a subject hierarchy, but each resource is further analyzed by an editor, librarian, or subject specialist. It is then annotated to give the user a more detailed idea of what the resource is, and, in some cases, rated based on an established set of criteria.

A "subject guide" contains a still deeper level of human analysis, in that a person or persons (editors, librarians, or subject experts) have filtered resources in a single subject and created a guide (sometimes annotated) to that subject. Implicit in the notion of a guide is that its resources will be of high quality because of the amount of filtering and the level of expertise of its author. Having a set of these guides at one site would give users the highest level of filtering and analysis, and thus the highest quality resources.

Eight directories that fall into these categories are old Internet veterans, well established and respected. The directories and their features are presented in Table 2 below for your convenience. We will not discuss the intricacies of using their search engines. Interested users should use last month's column as a guide. We will discuss certain features of the directories to help users analyze which ones are most applicable to them.

These features should help you to determine the amount of filtering and quality analysis that has taken place in each directory. Some of the features you should look for in a directory are:

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

- Search capability:

Most, but not all subject indexes have this and it is crucial, especially with the larger, multi-hierarchy indexes. The reason for this is that each subject index uses different subject terms and arranges subject hierarchies differently. Hierarchies are, in all examples but one (BUBL) "home grown," and arbitrary. Which subject hierarchy a resource is listed under is also arbitrary (again, except for BUBL). You may or may not find the same resource under the same term or hierarchy in Yahoo and Galaxy, for example. In this case, you use a two-step process: first, search the index to find the subject terms your query is listed under, and then browse that (or those) categories for more resources.

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

- Site Discrimination:

Does the directory choose what it thinks are quality resources, or does it take almost anything submitted and place it in the subject hierarchy?

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

- Site Rating:

Are resources rated? Rating acts as a filter to alert users to whether the resource they are looking at is of high quality, and how good (in the opinion of the rater) it is. Ratings are of course subjective, and depend largely on the following.

Rating system:

What is it? Is it "good or bad," "four star," "35 points," etc.? You must know what the rating system is in order to determine what it means.

Rating Criteria:

This answers the question "What qualities does 'four star' represent?" Most rated directories use similar criteria, but they should state those criteria clearly.

Who rates:

Since ratings are subjective, it is always helpful to know who is doing the rating. Most of the subject indexes that rate are fairly obscure when it comes to actually identifying who does the rating. The terms "editor" and "writer" are often used.

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

- Site Annotation:

The more site annotation available, the better, because it tells the user that someone has analyzed the site long enough to summarize its contents. The annotation should give the user a concise idea of site content before he or she connects to it. If an index discriminates in the sites it contains, and annotates those sites, the user has the benefit of a double filter, and thus has a better chance of finding quality resources.

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

- Who Annotates:

It is best to have subject specialists or trained information specialists analyze a site. However, the user must always make the final judgement of the site based on its content.

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

- Analysis:

The best subject indexes are those with the most human intervention. They intervene in discriminating which sites they pick, rating those sites, and annotating them. However, in the case of the Argus Clearinghouse, the entire process of site selection for their subject guides has been given over to subject specialists, allowing for a level of site discrimination that (although guide quality varies from subject to subject) makes their guides the place to start when looking for subject-specific information.

McKinley's Magellan is the best annotated directory because of both the number of annotated sites, and the level of annotation of each site.

While Yahoo is the most comprehensive subject catalog, it takes almost anything submitted and puts it into a hierarchy that is difficult to navigate without prior searching. It straddles the line between subject directory and search index, and many people use it both ways. A better, although much less comprehensive subject catalog is the Bulletin Board for Libraries (BUBL). Its producers provide the catalog in both Universal Decimal Classification and alphabetic subject format. Its selectors are librarians, and while this does not guarantee excellence, it does guarantee that people whose job it is to select and categorize information are doing that job.

You may not agree with these picks, or may feel there are better subject directories on the Internet than the ones discussed here. The point is to find the directory that is best for you, that consistently provides you with the best resources, and then use it. This quick comparison will show you that these directories, because they have different strengths, can be used in combination to provide better results. Yahoo, Galaxy, and the Internet Directory of Directories contain lots of resources but little filtering. Magellan and the Lycos Top 5% give high ratings to very different kinds of resources. Argus Clearinghouse and W3C Virtual Library produce entire guides on single subjects. The important thing is to know what you're looking at when you look at a subject directory.

As with search indexes, subject directories have inherent problems. The above-mentioned problem of arbitrary and uncontrolled hierarchies is the biggest. It is sometimes difficult to determine who puts resources where in the subject hierarchy--the resource submitters or the owners of the directory.

Selecting or not selecting a resource, rating it, and annotating it are very subjective processes. Because Magellan gives a site 28 points out of 30 ("four stars"), does not guarantee the site is a quality site for every user. That determination must be made by the user.

However, the fact that resources have been categorized, and in some cases selected, rated, and annotated, means that users are likely to find more quality resources in these directories than by searching an automated index. Which directory contains the most quality resources? Which contains the highest quality resources? That is for the user to determine. Users must determine quality much more on the Internet than in other avenues of publication because the filters that have long existed in those avenues do not exist at this time on the Internet. This is, of course, good and bad. It is good in the sense that the Internet can be a publishing avenue for information that doesn't make it through publishing filters. It is bad in the sense that those publishing filters have long been perceived as quality filters as well. The Internet has been criticized for having a low quality of information. How does the user determine the quality of an information resource in a networked environment? We turn next to that question.

For more information on subject directories, see the Scout Toolkit.

[Return to Table of Contents for this page. Go to Top | Bottom]
____________________________________________

TABLE 2

A comparison of filtering features 
for eight Internet subject directories. 
 
               Search   Site       Site      Ratings    Rating         Site        Who 
                        Discrim-   Ratings   System     Criteria       Annota-     Annotates 
                        ination                                         tion 
 
SUBJECT CATALOGS: 
 
Yahoo!         Y        N          Y         Glasses    Presentation/  Brief       Submitters 
                                             Icon       Content 
BUBL           Y        Y          N         N/A        N/A            Y           Librarians 
Galaxy         Y        N          Y         N          N/A            N/A         N/A 
 
 
ANNOTATED DIRECTORIES: 
 

                                                        Content Depth 
Magellan       Y        Y          Y         30 pts.    Organization   Y           Editors 
                                                        Net Appeal 
                                                        Content 
Lycos          N        Y          Y         50 pts.    Presentation   Y           Editors 
                                                        Experience 
InterNIC 
 Directory of  Y        N          N         N/A        N/A            Y           Submitters 
 Directories 
 
 
SUBJECT GUIDES: 
                                                        Level of 
                                                        resource 
                                                        description 
                                                        Level of 
Argus                                        1 - 5      resource                   Guide 
Clearing-      Y        Y          Y         check      evolution      Varies      maintainer 
 house                                       marks      Guide design 
                                                        Guide 
                                                        organization 
                                                        Guide meta- 
                                                        info 
W3C Virtual                                                                        Guide 
Library        N        Y          N         N/A        N/A            Varies      maintainer 
 
 
Key:  Y=Yes   N=No   N/A=Not Applicable

Directories that list "Y" under annotation do not necessarily annotate every site in the directory.

Note that the Argus Clearinghouse rating system rates the guides, not the individual resources within the guides.

[Return to Table of Contents for this page. Go to Top | Bottom]
.
____________________________________________

.
.
Our Web-Counter says you are visitor number: to this website since mid-June 1996.
.

Return to the Table of Contents or the TOP of this page.

Return to suite of Internet HELP pages.

Return to the Meek's HOMEPAGE.

Title: The Meek Family Website - Help with searching for web pages using search engines and search indexes; also search catalogs and search directories.

Contact for further information about this page: Chet Meek.

Voice: 780+433-6577; E-mail: cmeek@ocii.com

The primary URL for this page is at: http://www.GoChet.ca/h_srch1.htm

Page last updated: 30 March 2012 (N4.8, w/SC). Page created: 10 September 1996.