|[ Up (Contents) ]||Page Last Updated: 13 April 1998|
Site Maintenance/Management Tools
General | Syntax Verifiers/Validators | Link Validators | Mail/News Archive | Misc. Tools | Server Log Analysers
Web Indexing and Searching
Robots, Wanderers and Spiders | Indexing/Searching
Browsers and Document Processing
Browsers, Servers, Editors, Translators
This chapter describes programs and services useful in managing and maintaining collections of HTML documents or other resources. These description are divided into six sections. The first section describes programs, mostly written in perl, useful for managing Web resources. There are tools that can generate a table of contents for a set of documents, convert mail or news archives into HTML files, check the accuracy of the HTML syntax of a given document, or verify the accuracy of hypertext references within a document. The next section summarizes HTTP server log analysis tools--tools that can analyze Web server usage and indicate how site resources are being accessed. Following is a brief section on the "robots" and other benign critters that roam the Web, and how to exclude them from a site. The fourth section looks at Web indexing and search tools--software that can index and catalog a Web site or collection of Web sites, and provide a searchable index for those resources. Fifth is a brief section on intranet suites--these are comprehensive Web/Internet software packages designed for automating large portions of business or enterprise operation. The sixth and final section is a URL resource list for topics and resources not covered in this chapter, or this book: in particular, resources related to Web browser and server software, HTML editors, and HTML document converters and translators.
URLs related to the topic or tool under discussion are located at the beginning of each section, right-aligned on the page. You should use these URLs to obtain up-to-date information on a particular resource, or to search for new resources not described here. In addition, the names of tools described in this chapter are italicized when they appear outside the section specific to the tool.
The programs listed here are useful in maintaining document collections, and can be used to automatically generate links, create HTML versions of mail archives, generate tables of contents, and so on. Most of these tools were developed under UNIX, and may require modification for other operating systems. Here you will find only brief descriptions of the packages; you are referred to the listed URLs for additional information.
Cap2html--Gopher Directory to HTML
Cap2html, by Victor Parada, is a simple UNIX shell script for converting Gopher *.cap directories into a HTML document with links to the relevant resources. This is useful when migrating data from a Gopher to an HTTP server.
Curl--Automatic Link Generator
Curl, by Andrew Davidson, is a C-language HTML document management tool that automatically creates links between HTML documents. Curl constructs links based on information maintained by the Web collection author in a special contents file. This file lists the names of the HTML documents along with the relationships between them. Curl takes this information and modifies each document to create the necessary links to its neighbors, parents, starting page, and so on. Curl can generate several different types of contents lists, which are, in turn, inserted into automatically generated contents pages. A search engine is included.
Dtd2html--HTML Analysis of a DTDhttp://www.oac.uci.edu/indiv/ehood/dtd2html.doc.html
Dtd2html, part of the perlSGML package developed by Earl Hood, takes an SGML Document Type Definition file (DTD) and generates a collection of HTML documents explaining the structural relationship among the elements defined in the DTD. This is useful if you wish to learn about the structure of an HTML DTD.
Htmltoc--Table of Contents Generator
Htmltoc, by Earl Hood, is a perl program that can automatically generate a table of contents (ToC) for a single HTML document, or for a collection of related documents. Htmltoc uses the HTML H1-H6 headings to locate sections within a single document, and uses the hierarchical @@@Au: okay?@@@ order of the heading (H1, H2, and so on) to determine the hierarchical relationship of the ToC. This, among many other features, can be significantly customized.
When htmltoc creates a table of contents, it creates hypertext links from the table of contents to the documents themselves. It does this by editing the original documents and adding the appropriate hypertext anchors. The original documents are saved in backup files during this process, so that the original material is never damaged or lost.
Many commercial HTML editors and document management systems incorporate HTML validation with the document authoring software. However, many authors do not use such tools, but still recognize the need to "validate" their documents and check for hidden errors. Fortunately, there are many stand-alone tools for checking document syntax. Most of these tools use the HTML Document Type Definition (DTD) to define valid HTML syntax. Thus, to use these tools, you will often need to obtain the latest HTML DTD. The "official" W3C home of HTML information (including DTDs) is:
A collection of HTML DTDs, including those for HTML 3 and other, experimental, versions of HTML can also be obtained from:
New DTD files appear with each revision of the language. To test your documents compatibility with each new standard, you need only download the revised DTD and plug it into your verification program.
TIP: Why Validate?
Why bother validating your HTML as long as the rendered result looks good, you may ask? Looks can be deceiving. For a brief discussion of the virtues of validating your HTML, see:
CSE 3310 HTML Validator
CSE 3310 HTML Validator is an inexpensive but capable product for Windows 95 and Windows NT. It can be used as a stand-alone program, or as an extension to Allaires HomeSite HTML editor. It identifies a variety of syntactical errors including misspelled tags, tag attributes, and attribute values, invalid character entities, missing quotation marks, and unpaired or incorrectly nested tags, and presents the results in an easily understood format. Also included are filters to check documents against specific HTML versions (including versions that support browser-specific extensions), and a set of features usually associated with HTML editing suites: switch tag case, strip tags, UNIX-MacOS-DOS text file conversion, and a set of templates for applying changes to parts of documents. A business license for CSE 3310 HTML Validator is currently priced at $US40.
Doctor HTML, by Imagiware, Inc., is the HTML verification component from Imagiware's suite of products designed for optimizing Web document collections. It is available as an on-line service for checking documents accessible via an HTTP server, and supports server-based password access. Licenses for local use can be purchased through the company's Web site. While not a formal validator, Doctor HTML will report on errors in spelling, image syntax, and hyperlink accuracy, as well as form, table, and overall document structure (it primarily detects mismatched tags). Of particular note is the Show Command Hierarchy feature, which filters out non-HTML text and hierarchically indents the remaining code, making it easier to track down extraneous tags.
Kinder, Gentler Validator (KGV)
Like the Webtechs validation service, this interactive validator, by Gerald Oskoboiny of the University of Alberta, uses sgmls as the underlying syntax checker. However, this package uses a more complex error parsing interface, and the reported errors are designed to be more easily understood by non-specialists. The most recent version checks for compliance with the HTML 2.0, 3.0, and 4.0 DTDs.
Sgmls, written by James Clark, is a formal SGML syntax-checking program. This program takes an SGML file as input and checks the document structure against a specified document type definition (DTD). URLs pointing to the current HTML DTD were listed at the introduction to this section, and in the "References" section at the end of Chapter 6. As output, the program prints a list of syntax errors and the line numbers at which the errors occurred.
As an example, consider the file test.html shown in Figure 12.1. Figure 12.2 shows the output of sgmls after "testing" this file against the HTML DTD.
Figure 12.1 Example HTML document test.html. The line numbers (in italics) have been added for comparison with Figure 12.2.
1 <HTML> 2 <HEAD> 3 <TITLE> <em> Instructional </em>and Research Computing Home Page</TITLE> 4 </HEAD> 5 <BODY> 6 7 <h1> Instructional and Research Computing </H1> 8 <hr> 9 This is the Instructional and Research Computing Group <B>(IRC)</B> 10 World Wide Web home page. If you get lost try the 11 <a href="big%20dog.html"> big dog help </a> or 12 <a href="http://www.university.ca/</a>home.html"> right here </a> 13 <hr> 14 <oL> 15 <LI> consulting services in <A HREF="InsT/intro.html"> instructional 16 technology and applications</A> 17 applications</A>.<P> 18 19 </ol> 20 <HR> 21 22 </BODY> 23 </HTML>
The following command on a UNIX computer will validate the file test.html--note how the DTD is specified in the command line:
sgmls -s html.dtd test.html
Often the DTD comes in two parts: the DTD itself, plus a second file called the SGML declaration, often with a name like html.decl; on occasion it also comes with an SGML catalog, with a name such as html.cat or html.catalog. You usually need all these files for sgmls to work. You can either append them together (e.g, html.decl first, followed by html.dtd), or you can pass them as subsequent arguments to sgmls, as in:
sgmls -s html.decl html.dtd test.html
The output lists the errors and the line numbers at which they occurred. Figure 12.2 shows the sgmls output for the file in Figure 12.1.
Figure 12.2 Sgmls error output after parsing test.html (shown in Figure 12.1).
sgmls: SGML error at test.html, line 3 at ">": EM end-tag ignored: doesn't end any open element (current is TITLE) sgmls: SGML error at test.html, line 3 at ">": Bad end-tag in R/CDATA element; treated as short (no GI) end-tag sgmls: SGML error at test.html, line 3 at "d": HEAD end-tag implied by data; not minimizable sgmls: SGML error at test.html, line 3 at ">": TITLE end-tag ignored: doesn't end any open element (current is HTML) sgmls: SGML error at test.html, line 4 at ">": HEAD end-tag ignored: doesn't end any open element (current is HTML) sgmls: SGML error at test.html, line 21 at ">": A end-tag ignored: doesn't end any open element (current is OL)
The errors at line 3 are due to the illegal character markup inside the TITLE element. The subsequent errors at lines 3 and 4 are a result of this same mistake. The error at line 21 is a very typical error: the file has a duplicate </A> ending tag.
There is also a successor to sgmls, called nsgmls, that comes as part of a larger SGML package called SP. For more information about SP and nsgmls, see:
Designed to pick "fluff" off Web pages, Weblint, by Neil Bowers, is a perl script that checks basic structure and identifies the following errors: unknown elements, unknown tag context, overlapped elements, illegally nested elements, mismatched opening and closing tags, unclosed elements, unpaired quotes, and unexpected heading order. This is not a rigorous syntax checker, but is very useful as a first pass on a document, and for picking out basic mistakes. A list of Weblint gateways is found at the second URL given here. Weblint currently checks for errors against HTML 3.2 by default, but also includes support for Netscape Navigator 4 and Microsoft Internet Explorer 4 HTML extensions. Error messages can be selectively enabled or disabled, and a site-wide configuration file allows multiple users to share a common configuration.
WebTechs HTML Validation Service
If you want to check your files but do not feel comfortable downloading and using sgmls, you can instead use it remotely via the WebTechs HTML Validation Service. This service, accessible vie the Web due to the efforts of Mark Gaither, can check an entire document in place at a specific URL (you enter the URL into a fill-in form), or it can check a small sample of HTML, which you type or paste into a TEXTAREA input element. There are options for selecting "strictness" of the validation, and for specifying the version of HTML to check against--thus, you can check for valid Netscape or Microsoft enhancements. The validator returns all errors, à la sgmls. The service is available at a number of mirror sites, listed at the second URL given above.
SGML parsers such as sgmls can verify that the HTML tags are correctly placed, but cannot ensure that the hypertext links go to valid locations. To check hypertext links you need a "link verifier." This is a program that reads your document, extracts the hypertext links, and tests the validity of the URLs.
InfoLink Link Checker, by Biggbyte Software, is a site maintenance package that monitors site integrity, with an emphasis on link management. InfoLink can check HTTP and FTP URLs. A variety of flexible views and file/project manager modes makes it easy to handle multiple document collections of almost any size. Reported data include image reference details, page size comparisons, and indices of files which have been modified since the last verification. InfoLink is currently priced at approximately $US50.
CyberSpyder is a shareware, Microsoft Windowsbased link validator that can iterate through a Web site, looking for (and reporting) broken links. The package can check most standard URLs, including news and mailto (it is unclear from the documentation how it validates mailtos). The package runs on all Windows platforms, including Windows 3.1. Features worth noting include support for multiple configurations in execution and report generation, support for server-based password access, compatibility with the Robots Exclusion Protocol (see the section later in this chapter on "Robots, Wanderers, and Spiders"), an interrupt/resume function, intelligent indexing which only validates pages modified since the last run, and task scheduling.
Linkcheck is a perl program written by David Sibley of Pennsylvania State University. Linkcheck can checks gopher, ftp, and http URLs in a document, but cannot verify other URL schemes. Linkcheck tests gopher URLs by fully accessing the indicated URL, which can be slow if the URL points to a large file. It tests ftp URLs by listing directory contents rather than fetching the document, which is a lot nicer. Http URLs are checked by using the HTTP HEAD method, which is just as nice. If the HEAD method access fails, linkcheck tries the GET method (some servers do not understand the HTTP HEAD method).
Linkcheck sometimes fails when checking partial URLs. The program has not been updated since 1994, so it does not properly understand many newer elements, such as EMBED and APPLET.
Linklint is a shareware link verifier, written in perl by James B. Bowlin. It has three modes of operation. "Local Site Check" is ideal for developers who wish to verify links in documents which are not accessible via an HTTP server. "Remote Site Check" retrieves "live" documents already located on and accessible via an HTTP server, including documents which are generated dynamically or ones to which redirection is required. "Remote URL Check" is useful for verifying if a remote resource exists or has recently been modified, but does not verify links contained in that resource.
Missinglink is a shareware link verifier written in perl and designed for UNIX systems. It does not process ftp, gopher, or telnet URLs, but does account for BASE element content in evaluating HTTP references. In addition to anchor element HREF and image element SRC links, this product also validates links in NCSA-style and client-side imagemaps as well as frame element SRC links, both locally and on other HTTP servers.
Lvrfy, by Preston Crow, is a program that verifies internal links by starting with a given page, parsing all the hyperlinks, including images, and then recursively checking linked documents. Lvrfy is a regular shell script, and uses the standard UNIX programs sed, awk, csh, touch, and rm. One drawback of this verifier is speed--it is slow, averaging several seconds per file. There are also a number of problems that manifest themselves if the HTML markup is not accurate.
Previously known as Netcarta WebMapper, Site Analyst is the Web site management component of Microsoft's Site Server software suite. This full-featured product offers sophisticated site and document resource mapping features, a variety of customized navigational views, powerful content searching and analysis tools, and extensive reporting capabilities. Microsoft Site Server runs under Windows 95/NT and is priced at approximately $US1499 ($US4999 for the Enterprise Edition, which includes Commerce Server and Usage Analyst).
SiteSweeper, by Site Technologies Inc., is a commercial, Windows 95/NTbased Web site management and analysis tool. Among other things, the package includes a link checker as well as tools that prepare regular daily reports on site problems and changes. Reports are generated as HTML documents and contain details of internal and external linkages, unsupported protocol usage, broken server connections, broken page links, temporarily moved pages, Robots Exclusion Protocol features, and more. The current release (Version 2.0) is priced at just under $US500, and the upgrade from Version 1.0 at approximately $US249.
WebAnalyzer, by InContext Inc., is a commercial Windows 95/NTbased link validator. The package can check most URLs, producing useful reports of Web site properties including broken links, overall site statistics, page summaries, internal and external links, image and multimedia resource galleries, duplicated resources, and more. The package includes a graphical interface with a variety of different content navigation views for selecting the portions of the site to be validated. This validator, currently priced at approximately $US200, has received fine reviews in both PC Week and PC Magazine. Windows 3.1 users should note that an earlier version of WebAnalyzer is available for this operating system.
WebMaster, by Coast Software, is a commercial Windows 95/NTbased visual tool for managing Web sites. The package includes a link validator, as well as tools for locating orphaned pages, and for managing the file structure of a Web site. Other key features include global search and replace, automated remote file updates, and HTML FORM testing. The software is currently priced at approximately $US495 or $CDN695.
Hypermail: Mail to HTML Archive
Hypermail, by Kevin Hughes of Enterprise Integration Technologies, is a C-language program that takes a file of mail messages in UNIX mailbox format, and generates a set of cross-referenced HTML documents. Hypermail converts each letter in the mailbox into a separate HTML file, with links to other, related articles. It also converts e-mail addresses and hypertext anchors in the original letters into HTML hypertext links. Periodic updating of hypermail archives is made significantly easier by the ability to update incrementally.
MHonArc: Mail to HTML Archive
MHonArc, by Earl Hood, is a perl package for converting Internet mail messages, both plain text and MIME encoded, into HTML documents. This can be extremely useful, for example, if you are archiving electronic mail messages or newsgroup postings and want to make them available on the WWW. The package uses the letter's subject line for the HTML TITLE and as an H1 heading in the HTML version of the letter, and converts relational headers such as References or In-Reply-To into the appropriate hypertext links, if possible. MHonArc can also sort letters according to their topical thread and connect them together with Next and Previous hypertext links. In addition, MHonArc creates an index of the letters or articles, and creates a link from each converted letter to this index (and vice versa).
MOMspider: Web Maintainer and Indexer
The Multi-Owner Maintenance spider, or MOMspider, is a Web-roaming robot designed to help in maintaining distributed collections of HTML documents. A perl package written by Roy Fielding of the Department of Information and Computer Science at the University of California, Irvine, MOMspider traverses a list of webs, and constructs an index of the collection, recording the attributes and connections of the web within a special HTML map document. MOMspider can be used to report changes in web layout, to report linking and other problems, and to generate an overview of a large web collection. Since MOMspider explores links dynamically and autonomously, it is formally a robot, and obeys the Robots Exclusion Protocol. Robots and the Robots Exclusion Protocol are discussed later in this chapter.
TreeLink: Hypergraph of Links
Treelink, by Karten Gaier, is a tk/tcl package that draws a hypergraph of the hypertext links in a document web, starting from a given hypertext document. Treelink analyzes the connections and draws a tree-like graph until it reaches a certain predefined depth (number of links). This graph often gives useful insights into the connection and arrangement of links to a particular document. Treelink also has the ability to access and generate hypergraphs for remote documents.
WebCopy: Batch Document Retrieval
WebCopy, by Victor Parada, is a perl program that retrieves a specified HTTP URL. There are many control switches that permit recursive retrieval of documents (but only from the same server--WebCopy will not retrieve documents from domain names other than the one specified in the initial URL) and the retrieval of included inline image files. WebCopy does not comply with the Robots Exclusion Protocol (discussed later in this chapter), since it is largely intended for retrieving a single document, or a small number of documents. The numerous command line switches also allow retrieval of password-protected documents, FRAME-referenced documents, imagemap-referenced documents, and output generated by CGI scripts.
All HTTP servers produce log files that record information about each server request. Most HTTP servers use the same format, known as the common log file format, to record these data, and most analysis tools are designed around this standard. The programs listed in this section can read the log files and produce lists, charts, and graphs describing server usage.
There are many different analysis programs, and not all are listed below. For further and more up-to-date information, you should consult the URLs for the reference sites listed above.
A simple C-language package, 3Dstats analyzes Web server log files and generates three-dimensional VRML models charting server usage. These can be explored using any VRML browser. Perhaps not terribly useful, but very cool!
AccessWatch, by Dave Maher, is a perl 5 program that works as a gateway interface. It produces a graphical HTML representation of server usage, which can be customized to a particular user, or a particular collection of documents. Statistics generated include hourly server load, page demand, accesses by domain, accesses by host, and more. The package runs under both UNIX and NT systems. AccessWatch is free to noncommercial users only.
Fwgstat, by Jonathan Magid, is a perl script that can read any number of different log file types including FTP, Gopher, WAIS, Plexus, NCSA 1.1, and HTTP Common Log Format, and summarize the results in a single report. Data gathered include the numbers of real and anonymous users, hourly traffic rates, and traffic by path.
Getstats, by Kevin Hughes of Enterprise Integration Technologies Inc., is a C-language program that produces log summaries for standard HTTP server log files. The getstats package is exceptionally well documented at the indicated URL, which includes links to the software.
The package CreateStats is a useful front-end for getstats, and consists of a collection of tools for managing server log files. CreateStats can be obtained at:http://www-bprc.mps.ohio-state.edu/usage/CreateStats.html
The shell script getstats_plot can convert getstats output into a graph illustrating server usage. Another similar program is getgraph.pl. These can be obtained, respectively, at:http://infopad.EECS.Berkeley.EDU/~burd/software/getstats_plot/
Gwstat is a UNIX package of programs and scripts that can convert HTML output from the program wwwstat (see listing in this section) into GIF format graphs of server statistics. Gwstat does not do all this itself--it requires the packages Xmgr (a data plotting package), ImageMagick (an image format conversion package), ghostscript (a PostScript interpreter), and perl versions 4 or 5.
Hit List, by Marketwave, is a commercial log analyser and reporting tool. It can read log files from many servers, including Netscape's, and Microsoft's, and perform a variety of different analyses. This package is marketed as being easier to run than shareware or freeware equivalents, and boasts a range of productivity features including superior scalability, a large collection of predefined report formats, and real-time report generation.
Mswc (Multi Server WebCharts ), by Tobias Oetiker, is a perl 5 log analysis tool designed to measure usage across a number of different Web servers. This is useful when a Web site is actually running on multiple machines, such that there is no central log file for the entire collection. The latest version is HTTP 1.1 compatible.
Net.Analysis, from net.Genesis Corp., is a commercial log analysis tool for Windows NT and Solaris 2.4/2.5 platforms. This package uses an underlying database called the Net.Analysis DataStore that purportedly allows sophisticated analysis of the log information beyond what is possible with traditional freeware tools. The "Reporter module" for Windows 95/NT includes over 100 report formats and a set of filters that allows side-by-side comparisons of multiple sites or of different parts of the same site. Reports can be exported automatically to a variety of formats including HTML, Microsoft Word, and Microsoft Excel. The NT version of this software is priced at approximately $US2,495 and the Reporter module is around $US495.
PressVu for Windows NT/95 is a commercial package that import logs in EMWACS HTTP format as well as Microsoft Standard Log Format and NCSA Common Log Format into an xBase database. An integrated xBase-compatible expression language allows highly flexible queries not possible with other log analysis packages. PressVu also includes a File Browser feature that can be used to automatically access the InterNIC WHOIS domain name registration database, and retrieve details on a particular record, including names and addresses of the companies and individuals to whom the domain is registered, and administrator e-mail addresses as well as technical and billing contacts. This product is approximately priced at $US55.
RefStats, by Benjamin Franz, is a perl program that analyses NCSA 1.4-format referer_log files and produces a list of referring URLs, along with a count of the number of times each referring URL was reported. This is useful for tracking bad links to your site. The sister package BrowserCounter monitors the agent_log file, and produces a report summarizing the types of browsers that are accessing the server.
WebTrends is a popular, commercial Windows 95/NTbased package for tracking server access. It is compatible with a wide variety of log file formats, and generates elegantly illustrated and organized reports formatted in HTML, Microsoft Word, Excel, and plain or comma-delimited text. Log data are stored in a database and analyzed in real time, reducing the delay associated with importing large log files and re-analyzing blocks of log data. WebTrends is also adept at handling high-traffic sites. Its Log Analysis cartridge can process log files as large as 10 GB at a rate of up to 30 MB per minute and handle sites receiving up to 40 million hits per day. WebTrends Professional Suite and Enterprise Suite offer addition features and functionality including proxy server and link analysis as well as the ability to export data from WebTrends' own FastTrends database into Oracle 7/8, Microsoft SQL, Sybase, Informix, and other ODBC-compliant formats, for further analysis. The base package retails at $US299; the Professional and Enterprise versions are priced at about $US499 and $US1499 respectively.
Wusage, by Thomas Boutell, is a C program that generates daily, weekly, or monthly usage reports in the form of HTML documents that include inline image graphs displaying server usage and the distribution of accesses by continent. A particularly nice feature is the ability to exclude irrelevant document retrievals (of inline images, from local machines, etc.) from the analysis. Wusage is designed to read log files in the common log format, the EMWAC log file format, and Microsoft's IIS log file format. Compiled versions are available for most UNIX platforms, and for Windows 95/NT. Evaluation copies expire after 50 days, at which time a single-domain, single-server copy is priced at $US75.
Wwwstat, by Roy Fielding, is a perl program that can read common log file formats (NCSA Version 1.2 or newer) and produce a log summary file as an HTML document suitable for publishing on your server. The package is remarkably simple to use, and has most of the required analysis features. The package gwstat (see listing in this section) can convert wwwstat output into graphical data.
Sometimes the output of wwwstat is a bit overpowering. Robin Thau has developed a perl 5 program, called metasummary, that produces a summary of wwwstat output. This useful tool is available at:http://www.ai.mit.edu/tools/usum/usum.html
On the World Wide Web, robots, wanderers, and spiders are essentially synonyms, and indicate programs that automatically traverse the Web, successively retrieving HTML documents and then accessing the links contained within those documents. These are usually autonomous programs, in that they access links without human intervention. There are many uses for such programs, ranging from web mapping and link verifying programs (such as MOMspider), to programs that retrieve Web documents and generate searchable Web indexes (such as ALIweb, Harvest, and Lycos). Indeed, if it were not for robots, it would now be almost impossible to find anything on the Web, given the millions upon millions of resources that are now available.
However, sometimes a Web site administrator wants to keep robots away from certain collections of documents, perhaps because the documents are only temporary, and so should not be indexed, or perhaps because the documents are internal resources that should not be indexed outside of the site. Alternatively, a server may be heavily loaded with users, in which case you don't want your service to human customers slowed by a bunch of eager little robots, happily grabbing all your documents as fast as they can.
The Robot Exclusion Protocol
Martijn Koster developed a convention that lets a Web server administrator tell robots whether or not they are welcome to access the server and, if they are welcome, which files and directories they should avoid. This information is stored in a file named robots.txt, which must always be at the URL
where domain.name.edu is the site's domain name. Robots complying with the robot exclusion standard check this file, and use the contents to determine what they can access. An example robots.txt file is:User-Agent: * # Applies to all robots Disallow: /localweb/docs/ # local web documents -- do not index Disallow: /tmp/ # Temporary Files -- do not index
which tells all Web robots that they should avoid the indicated directories.
Important Reference Sites
One common demand is for a Web index--after all, it is one thing to know that there is useful information out there, and quite another actually trying to find it. A growing number of tools have been developed to address this need. These tools provide mechanisms (most are robots, in the sense described in the previous section) for collecting and indexing large numbers of documents, and for making these indexes accessible over the Web. In some cases, these tools have been used to provide global Web indexes (such as the Lycos search engine, at www.lycos.com), but they are often also appropriate for indexing local, private collections. For example, Web indexers can index all servers at a particular company, to provide a local, searchable index of corporate Web resources.
The following is a brief review of some of these tools, but you are referred to the sites themselves for details. The references at the beginning of this section provide up-to-date information on search engines, as well as useful tips and advice.
ALIWeb is a Web indexing tool that only indexes sites that wish to be indexed. If a site wishes to be indexed, the administrator must contact the ALIWeb server, and register his or her site with the ALIWeb system using a FORM interface. In addition, the site administrator must construct a specially formatted index file for his or her site, and place this in a location accessible to ALIWeb. The ALIWeb data-collection robot then automatically visits the site and retrieves the index file, which is used to generate the ALIWeb index.
The AltaVista database system, a product of the Digital Equipment Corporation, has become one of the most popular search tools on the Internet, as the database is both fast and flexible. The AltaVista software is also available for commercial use, to index private or local webs, or to index material on personal workstations.
Excite is a combination database and Web indexing system that is reputedly both fast and accurate. An important component of Excite is its concept-based technology, which supposedly allows searches on concepts, as opposed to keywords. Excite also provides Web server software that lets a Web server administrator index server content and provide database access to this content. Information about this product (EWS-Excite for Webserver) is found at the second of the two URLs listed here.
An idealab! product, Go2 is the latest step in the evolution of one of the original Internet search engines--the World Wide Web Worm--developed by Oliver McBryan of the University of Colorado at Boulder.
The product of the Internet Research Task Force Research Group on Resource Discovery (IRTF-RD), based at the University of Colorado at Boulder, Harvest is a collection of tools for gathering, organizing, and indexing resources, combined with utilities that allow this information to be replicated and distributed across a number of different sites. Funding for the project was discontinued in August 1996, but links to information on ongoing volunteer-driven research and a number of commercial derivatives spawned by the original Harvest project can be found at the above URL. Current discussion on Harvest can be followed in the Harvest newsgroup, comp.infosystems.harvest.
Infoseek provides one of the large searchable Web indexes, available at the URL listed. Infoseek also offers Infoseek Desktop, a free toolbar for desktop Web searches on a Windows 95 or Macintosh-based PC.
Co-founded by University of California at Berkeley researchers Eric Brewer and Paul Gaulthier, Inktomi Corp. is a leader in high-performance scalable network applications, particularly those for managing and searching large-scale text databases. Their flagship products include Inktomi Search (the database used by the HotBot search engine), Inktomi Traffic Server, and Inktomi SmartCrawl.
The Lycos indexing system, now a commercial product of the Computer Science faculty of Carnegie-Mellon University, utilizes a Web robot that wanders the Web and retrieves documents, which are subsequently indexed by the Lycos search engine. The search engine provides a number of ways of searching through the database, and has become one of the most widely used global search tools on the Web.
OpenText Corp. is a provider of a wide variety of database and Web tools, including the famous OpenText index, which is one of the more complete indexes of the World Wide Web. OpenText provides full-text indexing database tools for indexing and managing Web sites, both as stand-alone packages, and as components of their LiveLink intranet suite.
Intranet is probably the most overblown word of the late 1990s. The word merely indicates the use of Internet- and Web-based technologies to support internal business operations, but this simple concept has grown into an efficiency mantra that defies rational thought. In a sense, the phrase intranet software can imply just about anything--from the simplest HTML editor to the most sophisticated database management system.
Intranet suites reflect efforts by software developers to create a collection of generic Internet tools that can be adapted to generic business needs. The components often include document management, access control, indexing and search capabilities, workflow monitoring, messaging (mail or other), and groupware. Of course, not all packages support all these features, while at the same time not all these features are needed--or wanted--in all workplaces. Nevertheless, they form the core of most intranet suites.
If you are looking for intranet tools, you should first research the issues involved--the URLs at the beginning of this section provide useful links to (largely) unbiased discussions. Table 12.1 lists some of the companies currently selling intranet suites, with URLs corresponding to their product. This list is of course incomplete, so you should complement it by using the listed references at the beginning of this section.
Table 12.1 Companies Providing Intranet Software Suites
Fulcrum Technologies http://www.fulcrum.com Hummingbird http://www.hummingbird.com Lotus Notes http://www.lotus.com Mustang Wildcat! http://www.mustang.com Netscape Collabra http://www.netscape.com OpenText LiveLink http://www.opentext.com/livelink Radnet Webshare http://www.radnet.com/webshare/main_webshare.html Speedware Dallas http://dallas.speedware.com
Of course, there are many other resources Web developers need to know about, from browsers and servers to HTML editors, editing systems, document format translators, and converters. Refer to the following URL references, which provide information on these rapidly changing topics.
Web Browsershttp://www.browserwatch.com (List of all browsers)
http://www.webtrends.com/ (Browser usage surveys)
HTTP Servershttp://serverwatch.internet.com/ (List of all servers)
http://webcompare.internet.com / (Server comparisons)
http://www.webtrends.com/ (Browser and server usage surveys)
http://www.netcraft.com/survey/ (Netcraft Web server usage survey)
HTML Editors and Editing Systemshttp://www.utoronto.ca/webdocs/HTMLdocs/
HTML Translators and Convertershttp://www.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/HTML_Converters/
ftp://src.doc.ic.ac.uk/computing/information-systems/www/tools/translators/ (Program source code)
|The HTML Sourcebook -- Fourth Edition||© 1995-1998 by Ian S. Graham|