Indexing the Internet

Q: Which is stronger: your teeth or your lips?
A: Your lips, because your teeth can be broken or fall out.

- Chinese riddle

One of the things computers have not done for an organization is to be able to store random associations between disparate things, although this is something the brain has always done relatively well.
- Tim Berners-Lee

There are a wealth of options available for today’s Internet searcher. Computer-driven search engines offer the ultimate in depth of indexing by completely crawling through Websites and compiling full-text databases. Internet directories present a more disseminable record structure by cataloging sites within a hierarchical classification scheme. Giant portal pages are driven by databases with hundreds of millions of entries, in contrast to a range of specialized finding tools, designed to provide comprehensive or quality coverage within limited areas. Other search sites offer variations, combinations, or even compilations of the above types of searching tools.

While acknowledging that no one search tool is right for all search needs, this paper will analyze the question of what is the best way to index the Internet. Cosmetic concerns over the best user interface design and style will be addressed only insofar as they relate to the actual content representation of the search tools. It is my contention that in order to meet the needs of Internet searchers with the types of human and technological resources currently available, a reliance on human-powered indexing methods - especially the classification and description of documents by topical experts - is and probably will always be necessary. Allowing for the differentiation between the needs of Internet and library searchers, methods of library cataloging can be adapted to the Internet environment, and enhanced with the power of humans over machines to best catalog information.

Information scientists have spent thousands of years developing systems for classifying and retrieving information. Library cataloging systems such as the Dewey Decimal or Library of Congress classification schemes were developed in accordance with the physical limitations of the cataloged material. The Internet explosion, in contrast, can be partially accredited to the virtual characteristics of online information. The flexibility of being able to offer multiple access points to information reveals some inappropriateness in using archaic schemes like the Dewey Decimal System to catalog the Internet. Many users of the Internet, moreover, may often want nothing to do with anything that looks like traditional library systems or research methodologies.

The sophisticated nature of library cataloging tools, often involving many intermediary steps between searching and finding information, such as using a printed directory of reference works to locate a subject index to find a journal article, is reflective of those tools being developed by experts for use by expert searchers or those with help from a trained librarian. The ease of use of online search engines, where terms can be easily entered and searched, has generated mass appeal by offering quick responses and alleviating the psychological vulnerability that people may feel when asking a librarian for search help. Yet in many ways a need still exists to educate users so that they can make a more productive and efficient use of searching resources. An analysis of search engine queries, for example, shows that searchers seldom use the available tools to hone search statements, such as Boolean operators or quotes to form phrases (Jansen, et. al., 1998), while a generally low retrieval effectiveness of search engine results has been documented (Gordon & Pathak, 1999). One solution to improving the search results from ambiguous queries is the enforcement of more rigorous indexing systems.

Before applying library cataloging methods to the Internet outright, the differences between libraries and the Internet need to be considered. Without being drawn into a debate of libraries versus the Internet, it should be understood that while the exponentially increasing amount of materials available on the Internet represent a broad range of subjects, the current depth and density of any serious Internet collections are usually vastly outnumbered by frivolous, dubious, or redundant indexable matter and outweighed by the caliber of printed works found in any medium-sized public library. The entailing major difference between libraries and the Internet is that while libraries depend on funding to supplement scholarly collections, the Internet, as a more interactive and entertaining medium, is primarily driven by commercial or personal interests and accompanied with virtually no bibliographic control.

As a consequence of the economic tendencies of online information, many free Internet services, including the major search engines, are subsidized by corporate sponsorships. This should not necessarily be taken by users as an overwhelming detriment to their functionality, no more than is paying taxes to support libraries, as long as the advertising within Internet indexes is not ambiguously disguised or confused with the supposedly objective editorial content. The problem here is instead the presence of unwanted and unauthorized advertising, prevalent within many automated Internet search tools that rely on author indexing and lacking human editorial control, resulting in deceptive and faulty database entries being successfully made for commercial gain.

The production of fake entries in search engines helps in exposing the vulnerabilities of cataloging methods that work so well in library automation when they are instead used in an anarchistic commercial setting. The practices of those Internet publishers ruthlessly out for commercial gain, such as ‘spamming the index’ - done by submitting pages with attributes, sometimes as comprehensive as a small dictionary, that do not reflect the true content of the site with the hopes of creating more access points - are constantly being matched with the editorial efforts of computerized databases, producing a seemingly endless technological cycle between the opposing sides. More aggressive methods of deception, such as ‘pagejacking’ - done by stealing and representing the work of reputable organizations but then forwarding Internet visitors to unrelated destinations, a practice which an estimated 25 million pages employ - are being met with limited legal countermeasures such as policing actions by the Federal Trade Commission, done in the name of consumer protection and punishing misleading Website advertisers (Sullivan, 1999b).

It should be emphasized that these spamming vulnerabilities of search engines are almost entirely due to their automated nature. Efforts to present search results not just based on author-presented data, such as the frequency, positioning, and proximity of search terms, but with also somehow computing more objective data based on the source domain of the indexed file, how often searchers choose the link, and especially a sophisticated type of citation analysis that charts authoritative pages and hubs by counting the number of links pointing to a page, do hold promise for offering more relevant search results (Brin & Page, 1998; Chakrabarti, et. al., 1999; Notess, 1999). It is reasonable to assume, however, that no matter how sophisticated the spamming countermeasures adopted by automated indexes become, new ways of fooling the machines could be crafted. [See Henzinger, et. al. (2002) for an update regarding this predition.] Some amount of human editorial power therefore seems necessary.

As well as the need for human control against fake records, there are potentially insurmountable difficulties in unleashing computers to comprehend language. Language is replete with synonyms, polysemy, homonyms, spelling variations, and slang, and discourse is full of variable contextual meanings and linguistic nuances such as puns, poetry, and sarcasm, making full-text databases rather blunt tools in their over-reaching attempts to process natural language. A human indexer examines a document and identifies its principal concepts with a controlled vocabulary, using a caliber of mental comprehension unparalleled in the most advanced computer science or any so-called artificial intelligence. Until computers can comprehend language and hold their own in a conversation, there is a gap in their capabilities in analyzing and indexing text. Indeed, book indexing by computers, despite all its promises of mechanized efficiency, has remained unsuccessful (Korycinski & Newell, 1990), and continues to be the task of trained professionals. Technological aids currently exist for humans to apply book indexing methods in cataloging the Internet, not only to acknowledge accurately author-cataloged sites, but to accurately map the mental structure of search terms in a cross-referenced thesaurus of subject headings (Humphreys, 1999).

Information scientists with experience managing searchable catalogs have learned how to enhance retrieval effectiveness by matching indexing tools with search needs, and are familiar with the inherently related topics in the psychology of language, such as the frequency and distribution of subjects and vocabulary terms and the high variance of natural language terms for identical concepts, all of which illustrate the benefits to search effectiveness of presenting a hierarchical classification of information (Bates, 1998). Another benefit to search precision is narrowing search domains to specific subjects, accomplished by honing the scope of what information is searched, perhaps by limiting searches to certain source domains or languages, or conducting specialized searches in subject-oriented search engines, such as FindLaw.

One potential problem that diminishes with the hierarchical classification of Websites by humans is the difficulty of accurately ranking the results of simple searches to full-text databases. Although the exact ranking algorithms used by search tools are company secrets, the relationships of the search terms to the general Webpage attributes, such as the frequency of search terms within the document, title, headings, or metatags, are calculated before displaying search results. A survey of the quality of results ranking by five popular search engines, which measured the relevance of results from various topical searches, found a “generally good” presentation ranking of results, but not without errors and inconsistencies (Courtois & Berry, 1999). Navigators of a hierarchy using categorical searches and site descriptions can avoid a reliance on automated results ranking, and be sure of examining all entries within selected categories.

Merely classifying and categorizing the Internet, rather than compiling a full-text index by computer, does leave users without the capability to conduct the best searches on obscure topics, for which access to full-text databases is useful. The coverage of the largest search engines, which by a recent analysis have indexed no more than about 16% of the Web (Lawrence & Giles, 1999), however, still leaves much to be desired. Meta-search engines (which attempt to integrate the unique content of individual search engines and directories, each with their own features and interfaces, by presenting the search results from many services from just one query) do offer timesaving features for those doing an exhaustive search on obscure topics. But because meta-search engines cannot take full advantage of the unique features of the individual search engines, the quality of results from meta-searches is in a sense is only as strong as the weakest of links offered. They may, however, help users to find which search tools are best for their type of search needs. Conducting meta-searches may also often not be worth the effort for those seeking basic information on general topics, as this can be found through most any search engine. Likewise, even searchers with specific needs will find that these are better met using search tools devoted to specialized topics.

Using specialized search engines or human-constructed directories does sacrifice the comprehensiveness of the large full-text databases. When weighing the impact of this sacrifice we must again consider the characteristics of Internet and the needs of its searchers. Unlike the doctoral student who scours all available library catalogs to exhaust the coverage on a topic, most public Internet searchers often want just one good result per search. Topical clearinghouses that point to quality information are designed to serve these search needs, and may also hold even more entries for their subjects than are available through comprehensive indexes. Concerns over not having comprehensive results are therefore outweighed by the need for the quality and relevance offered by individual Internet catalogers. [Added: Guernsey (2001) provides a good overview of this topic.

Even if they were to successfully survey the entire Internet, the limitation of wholly computerized Internet indexing systems is that the automated spider robots that crawl through Websites retrieving data can only read open-text formats, such as HTML files, and cannot record any more that the basic file attributes of non-text format files, including PDF, sound, image, and video files. It is more difficult to mechanically extract cataloging data from multimedia objects because they are more complicated in format and abstract in subject. While most comprehensive search engines do offer multimedia searching capabilities (Jacsó, 1999), the image, sound, video, and other collections of specialized media search engines, such as Corbis and MP3.com, have devoted fuller resources towards creating databases with more useful methods for finding genre-specific file formats. Furthermore, most search engines cannot survey frame-based sites or dynamic pages, such as those in a database pulled with cgi or perl scripting, and have problems with pages in XML format (Sullivan, 1999a).

It could be argued that a reliance on individual subject catalogers and specialized indexes results in an unacceptable variance among the array of available finding tools. A comprehensive but automated Internet indexing system, however, also varies in composition from a reliance on individual page owners to submit and properly code their pages. The use of keywords in the HTML meta tag, for example, has been shown to cause a significant improvement in the retrievability of a document (Turner & Brackbill, 1998), and more refined conventions modeled from the database fields in library catalogs, such as the Dublin Core (Weibel, 1997), offer more detailed descriptions and capabilities for specialized access points. Yet without a consistent use format being adopted by the millions of Internet publishers, and again as well as the inherent spamming vulnerabilities of an automated self-cataloging system, there remain little consistent benefits for using automated cataloging over human selection. Any information or codes hidden from the screen of Web browsers, such as meta tags, could be just as likely to be up to tricks like pagejacking rather than providing authentic cataloging information. Quality pages not properly tagged or submitted to a search engine, or especially those restricting access from automated indexing robots, may not be included in computerized indexes, whereas human indexers will be more likely to include only important and relevant sites.

One way to alleviate the problems of the improper use of metadata is by diminishing the scope of indexed pages from that of Internet-wide searching services to more trusted domains. The full capabilities of automated cataloging tools such as metadata can best be harnessed within a realm of authors known to be responsible or inside well-regulated domains, such as corporate intranets or academic institutions. Human-driven quality control efforts within limited-scope search engines, such as at Noesis [Offline] - a finding tool for online philosophical research that only allows entries by authors with a doctoral degree - successfully demonstrate that: “it is technologically possible and economically feasible to build a system of dissemination for academic resources that is completely administrated by the scholarly world without the intervention of economic interests.” (Beavers, 1998). Specialized searching services can also focus on creating a directory of their topic on the Internet more easily than can the staffs of large search engines who must maintain a broader coverage of links.

Considering the benefits of an Internet directory, we also find that many of the historical difficulties of library cataloging disappear when classification systems are augmented within a computerized format. Since a digitized hierarchy easily allows room for expansion when new terms and categories arise, and natural language queries can be readily mapped to retrieve terms in the classification system, many of the supposed disadvantages to using a classification scheme and its controlled vocabulary are avoided (Mitchell, 1998). Rather than letting searchers only wade through an unstructured mass of open text database entries (even if accompanied with automated tools that attempt to cross-link entries, such as ‘more like this’ links, which by their mechanized nature will produce results of variable quality and accuracy), it is preferable to allow users to search and browse organized categories with multiple access points to information (Ellis & Vasconcelos, 1999).

The need for Internet directories is exemplified by AltaVista, one of the largest, oldest, and reputable full-text search engines. In addition to their full-text database, AltaVista has added a hierarchical format available for searchers, taken from the Open Directory Project. In addition, phrased questions presented to the search engine are processed by the Ask Jeeves system - a database with human-indexed pointers to Websites providing the answers to many commonly asked questions.

Further alternatives to both search engines and hierarchical directories will always be available because any Website can make links to other pages. While the practice of simply surfing through links provides searchers with random experiences (although perhaps with the benefit of serendipity), more structured surfing methods are available. Webrings, for example, are a collection of common sites that are all interlinked, allowing their navigators to browse through Websites with related topics and communally maintained connections (Casey, 1998).

The great promise of automated indexing tools is that they provide a level of detail greater than any humanly powered method of indexing. Automated searching aids are therefore somewhat necessary to keep up with the millions of pages being added to the Internet. Finding dead links maintaining timely information within any Internet search tool are needs that have increasingly improving automated solutions. When it comes to sorting through all of this data, however, and making the best sense of what is available online, the power and assurance of human understanding and editorial control must also be called upon.

One possible difficulty of Internet finding tools driven by human power is that they cannot keep up with the capabilities of automated systems. While they cannot do full-text indexing, the combined efforts of the Internet publishers who maintain quality subject indexes do in fact meet most searching needs. Virtually every Website contains a list of links maintained by the author. Offering the authoritative and best subject indexes available to searchers can produce a far greater information retrieval system than any single search tool. Links for Chemists, an index to over 8,000 Chemistry-related Websites, is a cooperative example of a broad subject index that includes subsection contributions from different editors. Such gateway pages can be found with intermediate finding tools such as Invisible Web, a database of over 10,000 specialized online search tools, or Argus Clearinghouse, a selective index of quality Internet subject catalogs.

An Internet-wide example of a communally produced directory is the Open Directory Project. Maintained mostly by volunteers, the Open Directory has cataloged over 1.2 million Websites within a hierarchical classification system. In contrast to a staff-run directory such as Yahoo!, which seems more intent on retaining surfers for more advertising opportunities - accomplished by offering services such as online games, fantasy league competitions, chat, and e-mail - than it has devoted resources towards maintaining a quality directory, the challenges of a human-maintained comprehensive Internet index seem to have been met by the combined efforts of the over 30,000 Open Directory contributors (Dunn, 1999).

Depending on the type of information being sought, it may be best to use a large full-text search engine with sophisticated relevance ranking abilities, such as Google, an Internet-wide hierarchical classification system such as the Open Directory Project (which as been added to Google), or locate a quality subject index through an intermediate directory such as the Argus Clearinghouse. Due to the inabilities of computers to comprehend language or practice quality editorial control, the available capabilities of human-powered cataloging systems for now and in the foreseeable future remain essential tools for indexing the Internet.

References

Bates, M. (1998). Indexing and access for digital libraries and the Internet: Human, database, and domain factors. Journal of the American Society for Information Science, 49(13), 1185-1205.

Beavers, A. F. (1998). Evaluating search engine models for scholarly purposes. D-Lib Magazine, December 1998. Available: http://www.dlib.org/dlib/december98/12beavers.html.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Seventh International World Wide Web Conference, Brisbane, Australia, April 14-18. Available: http://infolab.stanford.edu/~backrub/google.html.

Casey, C. (1998). Web rings: An alternative to search engines. College & Research Libraries News, 59(10), 761-763.

Chakrabarti, S., Dom, B., Kumar, S. R., Raghavan, P., Rajagopalan, S., Tomkins, A., Kleinberg, J. M., & Gibson, D. (1999). Hypersearching the Web. Scientific American, June 1999.

Courtois, M. P., & Berry, M. W. (1999). Results-ranking in Web search engines. Online, 23(3), 39-40.

Dunn, A. (1999). Open Directory in search of the best of the Web. Los Angeles Times, 18 October 1999, C1.

Ellis, D., & Vasconcelos, A. (1999). Ranganathan and the Net: Using facet analysis to search and organise the World Wide Web. Aslib Proceedings, 51(1), 3-10.

Gordon M., & Pathak P. (1999). Finding information on the World Wide Web: The retrieval effectiveness of search engines. Information Processing & Management, 35(2), 141-180.

Guernsey, L. (2001). Mining the 'Deep Web' With Specialized Drills. New York Times, January 25, 2001 . Available: http://www.nytimes.com/2001/01/25/technology/25SEAR.html.

Humphreys, N. K. (1999). Mind maps: Hot new tools proposed for cyberspace librarians. Searcher, 7(6).

Jacsó, P. (1999). Sorting out the wheat from the chaff. Information Today, 16(6), 38.

Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998). Real life information retrieval: A study of user queries on the Web. SIGIR Forum, 32(1), 5-17.

Korycinski, C., & Newell, A. F. (1990). Natural-language processing and automatic indexing. The Indexer, 17(1), 21-29.

Lawrence, S., & Giles, L. (1999). Accessibility of information on the Web. Nature, 400(6740), 107-109.

Mitchel, J. S. (1998). In this age of WWW is classification redundant? Catalogue & Index, 127, 5.

Notess, G. R. (1999). Rising relevance in search engines. Online, 23(3), 84-86.

Sullivan, D. (1999a). Crawling under the hood: An update on search engine technology. Online, 23(3), 30-32.

Sullivan, D. (1999b). FTC steps in to stop spamming. The Search Engine Report, 4 October 1999. Available: http://searchenginewatch.com/showPage.html?page=2167501.

Turner, T. P., & Brackbill, L. (1998). Rising to the top: Evaluating the use of the HTML meta tag to improve retrieval of World Wide Web documents through internet search engines. Library Resources & Technical Services, 42(4), 259-271.

Weibel, S. (1997). The Dublin Core: A simple content description model for electronic resources. Bulletin of the American Society for Information Science, 24(1), 9-11.

This essay was written by John Hubbard for the Drexel University College of Information Science and Technology course "INFO 622: Content Representation" in December 1999. Although changes such as updating Web links, numbers, and appending additional references have been made, it has not been significantly altered from the original version; the last modified date shown below indicates when this Webpage was last uploaded in its present form.

Created, maintained and © by John Hubbard (write to me). Disclaimers. Hosted by Dreamhost. Last modified: August-09-2007.