Chapter 11 -- Indexing a Web Site
Chapter 11
Indexing a Web Site
by Rod Clark
CONTENTS
A Brief Introduction to Searching
Simple Searches
Concept-Based Searching
Search Functions as Part of Your Site
Search Links for Fast-Changing Subjects
Presenting Search Results in Context
Adding Keywords to Files
Searching a Single File, Line by Line
Simple Search Engines for Smaller Sites
ICE
SWISH, the Simple Web Indexing System for Humans
Hukilau 2
GLIMPSE, Briefly
An Overview of Search Engines for Business Sites
Dedicated Search Engines
Built-in Search Tools in Web Servers
Finding information tucked away in complicated, unfamiliar Web
sites takes time. Often enough, users want to correlate the information
in ways that the authors and the menu builders never envisioned.
Especially at large sites, no matter how good the navigation is,
finding all the files that mention a topic not listed separately
on the menus can be difficult and uncertain. After a few failed
attempts to find something at a new site, most users give up and
move on.
Even the best Web site with a good menu system can present a faster,
friendlier interface to its information by offering a supplementary
search tool. The good news is that more and more good search tools
are becoming available.
In this chapter, you'll learn more about
Information to help you choose from the many search tools
available
Installing and using some non-commercial search engines
Commercial search engines for your Web site
Adding keywords to files for more productive searching
CGI programming examples
Some search-related sites that you can examine on the Web
A Brief Introduction to Searching
Today's search tools have much to offer compared to the tools
of a few years ago. Many search techniques remain the same, but
there have been some new developments. One active area in search
engine development is concept-based searching.
Some newer search tools can cross-check many different words that
people tend to associate together, either by consulting thesauri
while carrying out their search operations, or by analyzing patterns
in the files in which the query terms appear and then looking
for similar documents. Some use a combination of both techniques.
The following sections discuss a few things to keep in mind when
considering search tools for your site.
Simple Searches
A review of some terminology and of common search functions may
help you better choose among the search tools available. You'll
also need to be familiar with what follows before you dive into
the source code for the Hukilau 2 Search Engine later in the chapter.
AND, OR, and Exact-Phrase Searches
Most search engines let you conduct searches in more than one
way. Some common options include AND, OR, and
exact-phrase searches. Each of these has its place, and it's hard
to get useful results in every situation if you can use only one
of them. Several search engines also allow you to use more complex
syntax and specify other operators, such as NOT and NEAR.
In general, to narrow a search in a broad subject area, you can
AND several search terms together. You might also search
for whole words instead of substrings. To narrow things even more,
you can search for an exact phrase and specify a case-sensitive
search.
To broaden the scope of a search, you can OR several
search terms together, use a substring search instead of a whole-word
search, and specify case-insensitive searching.
OR is the default for some popular search tools; AND
is the default for others. Because the results of an OR
search are much different from those of an AND search,
which you prefer depends on what you're trying to find.
If you consistently prefer to use a search method other than the
default for a given tool, and it runs as a CGI program on another
site, you can generally make a local copy of its search form and
edit its settings to whatever you like.
NOTE
Here's an example of this approach that sets consistent AND defaults for a number of Net search services. You can individually download these drop-in search forms and include them in other HTML pages.
Http://www.aa.net/~rclark/search.html
Some search tools let you search for an exact phrase. For example,
the web-grep.cgi UNIX shell script in the "Searching a Single
File, Line by Line" section later in this chapter searches
only for exact phrases. But with it, you can type an exact phrase
(or word fragment) that also happens to be a substring of some
other words or phrases. The script then finds everything that
matches, whether or not the match is a separate whole word. But
this still isn't as flexible as many users would like.
Substring and Whole-Word Searches
Suppose that a friend mentions a reference to "dogs romping
in a field." It could be that what he actually saw, months
ago, was the phrase "while three collies merrily romped in
an open field." In a very literal search system, searching
for "dogs romping" could turn up nothing at all. Dogs
aren't collies. And romping isn't romped.
But the query "romp field" might yield the exact reference,
if the same very literal tool searches for substrings.
Whole words start and stop at word boundaries. A word
boundary is a space, tab, period, comma, colon, semicolon,
hyphen, exclamation point, question mark, quotation mark, apostrophe,
line feed, carriage return, parenthesis, or other such word-beginning
or word-ending character.
Now let's say that you've searched for romp field and found
hundreds of references to romper rooms, left fielders, the infield
fly rule, and, of course, the three romping collies. To narrow
these search results further and gather the references to the
article about romping collies into a shorter search results list,
you could run an AND search for the whole words three
collies romped.
Relevance Ranking
Many search engines rank search results from the most relevant
to the least relevant. No one agrees on the best relevance ranking
scheme. Some engines simply rank the results by how many instances
of the search keywords each file contains. The file with the most
keywords is listed first on the search results page.
Other search tools weight keywords found in headings and in other
emphasized text more than keywords found in plain text. Some programs
take into consideration the ratio of keywords to total text in
the file, and also weight the overall file size. All of these
methods are appropriate to consider when programming relatively
simple CGI search scripts,
Searching Stored Indexes
Search engines rarely search through the actual document files
on a Web site each time you submit a query. Instead, for the sake
of efficiency, they search separate index files that contain stored
information about the documents. Building index files is a slow
process. But once built, the index files' special format lets
the engine search them very fast. Sometimes the index files can
take up as much space on the server's disk drives as the original
document files.
The index files contain a snapshot of the contents of the document
files that was current whenever the search engine last ran an
indexing pass on the site. That might have been a few hours ago,
yesterday, or last week. Often, a search engine's indexing process
runs as an automatically scheduled job in the dead of night, when
it won't slow down more important activities. Sometimes you can
find out when the indexes were last updated at a site, and sometimes
you can't.
Some large, complex search engines continuously update their indexes,
incrementally. This doesn't mean that all the index entries are
always up to the minute. Some portion of the entries are very
current, and the rest range in age depending on how long it takes
the indexing software to traverse the entire document library.
There are many different formats for index files, and comparatively
few interchangeable standards. Some of the more complex search
engines can read several types of index files that were originally
generated by different kinds of indexing software, such as Adobe
PDF "catalogs."
Concept-Based Searching
Conventional query syntax follows some precise rules, even for
simple queries, as you saw in the preceding section. But as you
also saw, people don't usually think overtly in terms of putting
Boolean operators together to form queries.
Concept-based search tools can find related information even in
files that don't contain any of the words that a user specifies
in a search query. Such tools are particularly helpful for large
collections of existing documents that were never designed to
be searched.
Thesauri
One way to broaden the reach of a search is to use a thesaurus,
a separate file that links large numbers of words with lists of
their common equivalents. Some newer thesauri automatically add
and correlate all the new words that occur in the documents they
read as they go along. A thesaurus can be a help, especially to
users who aren't familiar with a specialized terminology. But
manually maintaining a large thesaurus is as difficult as maintaining
any other large reference work. That's why some new search engines'
self-maintaining thesauri statistically track the most common
cross-references for each word, so that the top few can be automatically
added to a user's query.
Stemming
Some, but not all, search engines offer stemming. Stemming
is trimming a word to its root, and then looking for other
words that match the same root. For example, wallpapering has
as its root the word wall. So does wallboard, which
the user might never have entered as a separate query. A stemmed
search might serve up unwanted additional references to wallflower,
wallbanger, wally, and walled city, but catching
the otherwise missed references to wallboard could be worth
wading through the extra noise.
Stemming has at least two advantages over plain substring searching.
First, it doesn't require the user to mentally determine and then
manually enter the root words. And it allows assigning higher
relevance scores to results that exactly match the entered query
and lower relevance scores to the other stemmed variants.
But stemming is language-specific, too. Human languages are complex,
and a search program can't simply trim English suffixes from words
in another language.
Finding Similar Documents
Several newer search engines concentrate on some more general
non-language-based techniques. One such technique is pattern
matching, used to find similar files. For example, given a
file about marmosets, a concept-based search engine might return
references to some other files about tamarins, even though those
files don't contain the word marmoset. But many other aspects
of the marmoset files and the tamarin files would be very similar.
(They're both South American monkeys.)
Thesauri can help provide this kind of capability, to an extent.
But some new tools can analyze a file even if it's in an unknown
language or in a new file format, and then find similar files
by searching for similar patterns in the files, no matter what
those patterns actually are. The patterns in the files might be
Swahili words, graphics with Arabic characters, or CAD symbols
for freeway interchanges, for all the search program knows.
Building specific language rules into a search engine is difficult.
What happens when the program encounters documents in a language
it hasn't seen before, for which the programmers haven't included
any language rules? There are people who have spent their whole
adult lives formally recording the rules for using English and
other languages, and they still aren't finished. We hardly think
of those rules, because we've learned (or accumulated) them in
our everyday human way-by drawing conclusions from comparing and
summing up a great many unconscious, unarticulated pattern matching
events.
Even if you don't know or can't explain the rules for constructing
the patterns you see, whether those patterns are in human language,
graphics, or binary code, you can still rank them for similarity.
Yes, this one matches. No, that one doesn't. This one is very
similar, but not exact. This one matches a little. This one is
more exact than that one. This is the approach that some of
the newer search engines take to analyzing files for content similarity.
They look for patterns, nearness, and other such qualities, and
use fuzzy logic and a variety of weighting schemes.
NOTE
An active Usenet newsgroup, comp.ai.fuzzy, is devoted to explaining fuzzy logic. You can read what the experts have to say there to find out much more about this rapidly evolving area.
Search Functions as Part of Your Site
As businesses integrate their Web sites more into their everyday
activities, they're adding more and more Web-accessible documents.
At a busy site, it may be hard to keep up with the latest additions,
even from hour to hour. Search functions can supplement ordinary
links, to help users more easily sort out the flood of information.
TIP
If you offer a search capability at your site, you should consider making it easily accessible from any page. I've been to a few sites where it took a wild-goose chase to get back to the special page with the link to the search tool, among the welter of
other pages on the site.
Search Links for Fast-Changing Subjects
In rapidly changing subject areas, it makes sense to link specific
documents to menu pages but to avoid or minimize links from within
documents to other specific documents, especially to inherently
dated ones. Such a design, which minimizes document-to-document
cross links and instead emphasizes links to menus and to a search
function, can help users find the most recent material, even from
pages that were built weeks or months ago. It also makes page
maintenance easier for the administrators who maintain the site.
To provide users with a search function that's tailored to a given
subject, you can use a hidden form that sends your search engine
a preset query about the subject.
The hidden form fits easily into a page design because its only
visible element is a submit button. To avoid confusion, you can
describe the search's special purpose in the button text, rather
than use the default Submit button text or a generic word such
as Search.
The first example, shown in figure 11.1, shows a button that's
part of a hidden search form. The form's hidden text field is
preloaded with the query keywords that you'd use to stamp all
new files on the related subject.
Figure 11.1 : This hidden form displays a search button that starts up a search engine, which produces a list of related documents.
Here's the HTML code for the hidden search form in figure 11.1:
<FORM METHOD="POST" ACTION="http://www.substitute_your.com/cgi-bin/hukilau.cgi">
<INPUT TYPE="HIDDEN" NAME="Command" VALUE="search">
<INPUT TYPE="HIDDEN" NAME="SearchText" value="Project-X">
<INPUT TYPE="SUBMIT" VALUE=" Project-X "><br>
</FORM>
The next example, shown in figure 11.2, shows the same drop-in
search form, but with a visible single-line text input box that's
preloaded with the same search keywords as in the first example.
The difference is that this form lets the user type some added
words, if needed, to narrow the search.
Figure 11.2 : This is the same form, but with a visible input box preloaded with a query keyword.
Here's the HTML code for the compact search form in figure 11.2
that includes a visible text input box:
<FORM METHOD="POST" ACTION="http://www.substitute_your.com/cgi-bin/hukilau.cgi">
<INPUT TYPE="HIDDEN" NAME="Command" VALUE="search">
<INPUT TYPE="SUBMIT" VALUE=" Project-X ">
<INPUT TYPE="TEXT" NAME="SearchText" SIZE="36" value="Project-X"><BR>
</FORM>
The next example shows the same drop-in form as before (see fig.
11.3). The only change is that here, an image is used as a button.
Figure 11.3 : This compact search form uses an image for a button.
The HTML code for the search form in figure 11.3 displays a visible
input box and uses an image instead of a text submit button:
<FORM METHOD="POST" ACTION="http://www.substitute_your.com/cgi-bin/hukilau.cgi">
<INPUT TYPE="HIDDEN" NAME="Command" VALUE="search">
<INPUT TYPE="IMAGE" SRC="http://www.aa.net/~rclark/button.gif" alt="
Project-X " ALIGN="bottom" border ="0"><b> Latest Project-X Reports</b><br>
<INPUT TYPE="TEXT" NAME="SearchText" SIZE="36" value="Project-X"><BR>
</FORM>
These forms call the Hukilau 2 search script, which is described
in the "Hukilau 2" section later in this chapter. This
search script doesn't use a stored index. Instead, it searches
through the HTML files in a specific directory (but not its subdirectories)
in real time. Although that's a slow way to search, sometimes
it can be useful because it always returns absolutely current
results.
A search script such as this one is a good tool to use when it's
okay to use the computer's resources inefficiently, to find the
very latest information. Although this kind of script lets you
see up-to-the-second file changes, site administrators might not
want too many users continually running it, because it exercises
the disk drives and otherwise consumes resources. Of course, you
can always use the same kind of hidden form to call a more efficient
search engine that uses a stored index.
Time Daily's Latest News page is a good example of embedding search
forms in a page. Each search button on the Time page brings up
a list of whatever articles are available in the archives about
the related subject, as of the moment you perform the search.
To view the Time Daily page, use the following URL:
http://pathfinder.com/time/daily/time/1995/latest.html
Presenting Search Results in Context
When searching for something, users often have in mind no more
than a few scattered and fragmentary details of what they want
to find. Offering only page titles sometimes isn't enough for
the user to make a good decision.
Showing context abstracts from the files reduces the number of
trial-and-error attempts that users make when choosing from the
search results list. Displaying large enough abstracts so that
the user makes the right choice the first time instead of the
second or third time is an important usability consideration.
Programmers are often tempted to display smaller abstracts, in
the interests of efficiency, than are really needed to minimize
trial-and-error file viewing.
Some search engines let the user choose the size of the context
abstracts, along with other search conditions, by using a drop-down
menu or radio buttons on the search form. This is a worthwhile
option to include, if the CGI program supports it.
Context abstracts taken from the text surrounding the user's search
keywords are often more useful than fixed abstracts taken from
the first few lines of a file. Not every search engine can produce
keyword-specific abstracts.
Some of the simpler freeware search engines don't provide context
abstracts, but do rank files by relevance or report the numbers
of matching words found in each file.
Adding Keywords to Files
Adding keywords to files is particularly important when using
simpler search tools, many of which are very literal. But even
the simplest search scripts can work very well on pages that include
well-chosen keywords.
Keying files by hand is slow and tedious. It isn't of much use
when faced with a blizzard of seldom-read archival documents.
But new documents that you know will be searched online can be
stamped with an appropriate set of keywords when they're first
created. This provides a consistent set of words that users can
use to search for the material in related texts, in case the exact
wording in each text doesn't happen to include some of the relevant
general keywords. It's also helpful to use equivalent non-technical
terminology that's likely to be familiar to new users.
Sophisticated search engines can give good results when searching
documents with little or no intentional keying. But well keyed
files produce better and more focused results with these search
tools, too. Even the best search engines, when they set out to
catch all the random, scattered unkeyed documents that you want
to find, can't help but return information that's liberally diluted
with info-noise. Adding keywords to your files helps keep them
from being missed in relevance-ranked lists of closely related
topics.
Keywords in Plain Text
To help find HTML pages, you can add an inconspicuous line at
the bottom of each page that lists the keywords for the page,
like this:
Poland Czechoslovakia Czech Republic Slovakia Hungary Romania Rumania
This is useful. But some search engines assign a higher relevance
to words in titles, headings, emphasized text, name=
tags and other areas that stand out from plain text. The next
few sections consider how to key your files in ways other than
by placing extra keywords in the body of the text.
Keywords in HTML <META> Tags
You can put more information than simply the page title in the
<HEAD> section of an HTML page. Specifically, you
can include a standard Keywords list in a <META>
tag in the <HEAD> section.
People sometimes use <META> tags for other non-standard
information. But search engines should ordinarily pay more attention
to the <META> Keywords list. The following
is an example:
<HEAD>
<META HTTP-EQUIV="Keywords" CONTENT="Romania, Rumania">
<TITLE>This is a Page Title</TITLE>
</HEAD>
Keywords in HTML Comments
Many but not all search engines index comments in HTML files.
If yours does, putting "invisible" keywords in comments
is a more flexible way to add keywords than putting them in name=
statements, because comments have fewer syntax restrictions.
The next example shows some lines from an HTML file that lists
links to English-language newspapers. The visible link names on
the individual lines don't always include words that users would
likely choose as search queries. That makes no difference when
finding the entire file. But with a search tool that displays
matches on individual lines in the file, such as web-grep.cgi,
a query has to exactly match something in either a particular
line's URL or in its visible text. That's not too likely with
some of these lines. Only one of them comes up in a search for
Sri Lanka (see fig. 11.4). None come up in a search for South
Asia, which is the section head just above them in the file.
Figure 11.4 : In this unkeyed file, the search doesn't find all
<b><a href="http://www.lanka.net/lakehouse/anclweb/dailynew/
select.html">Sri Lanka Daily News</a></b><br>
<b><a href="http://www.is.lk/is/times/index.html">Sunday Times
</a></b><br>
<b><a href="http://www.is.lk/is/island/index.html">Sunday Island
</a></b><br>
<b><a href="http://www.powertech.no/~jeyaramk/insrep/">Inside Report:
Tamil Eelam News Review</a></b><i> - monthly</i><br>
To improve the search results, you can key each line with one
or more likely keywords. The keywords can be in <!--comments-->,
in name= statements, or in ordinary visible text. Some
of these approaches are more successful than others. The next
three code snippets show examples of each of these ways to add
keywords to individual lines in a file.
This first listing shows how you can add keywords as HTML comments:
<!--South Asia Sri Lanka--><b><a href="http://www.lanka.net/
lakehouse/anclweb/dailynew/select.html">Sri Lanka Daily News</a>
</b><br>
<!--South Asia Sri Lanka--><b><a href="http://www.is.lk/is/
times/index.html">Sunday Times</a></b><br>
<!--South Asia Sri Lanka--><b><a href="http://www.is.lk/is/
island/index.html">Sunday Island</a></b><br>
<!--South Asia Sri Lanka--><b><a href="http://www.powertech.no/
~jeyaramk/insrep/">Inside Report: Tamil Eelam News Review</a>
</b><i> - monthly</i><br>
The next listing shows similar keywords in name= statements.
But HTML doesn't allow spaces in name= statements, which
prevents searching for whole words instead of substrings. You
also can't include multiple identical name= statements
in the same file, to relate items together for searching, because
each name= statement must be unique. So overall, putting
keywords in name= statements isn't the best choice here,
although it might be workable with some search tools.
<b><a name="southasiasrilankadaily" href="http://www.lanka.net/
lakehouse/anclweb/dailynew/select.html">Sri Lanka Daily News</a>
</b><br><b><a name="southasiasrilankatimes" href="http://www.is.lk/is/
times/index.html">Sunday Times</a></b><br>
<b><a name="southasiasrilankaisland"href="http://www.is.lk/is/
island/index.html">Sunday Island</a></b><br>
<b><a name="southasiasrilankainside" href="http://www.powertech.no/
~jeyaramk/insrep/">Inside Report: Tamil Eelam News Review</a>
</b><i> - monthly</i><br>
The next listing illustrates some difficulties with adding consistent
search keywords to plain text. Repeating the keywords on several
lines can be awkward in lists like this one. For example, there's
no good way to repeat South Asia on each line here.
<b><a href="http://www.lanka.net/lakehouse/anclweb/dailynew/
select.html">Sri Lanka Daily News</a></b><br>
<b><a href="http://www.is.lk/is/times/index.html">Sri Lanka Sunday
Times</a></b><br>
<b><a href="http://www.is.lk/is/island/index.html">Sri Lanka Sunday
Island</a></b><br>
<b><a href="http://www.powertech.no/~jeyaramk/insrep/">Inside
Report: Tamil Eelam News Review, Sri Lanka </a></b><i> - monthly
</i><br>
The search results from the file with the keywords added in HTML
comments (see fig. 11.5) are more consistent than the search results
from the unkeyed file (refer to fig. 11.4).
Figure 11.5 : With the added keywords, the same search finds all the information that's been keyed together.
Searching a Single File, Line by Line
You can scan a file (which can be an HTML page) and display all
the matches found in it. The web-grep.cgi script, shown in listing
11.1, is a simple tool that you can use to do this. If the file
being searched contains hypertext links that are each written
on one line (rather than spread over several lines), each line
on web-grep's search results page will contain a valid link that
the user can click.
Listing 11.1 web-grep.cgi: UNIX Shell Script Using
grep
#! /bin/sh
echo Content-type: text/html
echo
if [ $# = 0 ]
then
echo "<HTML>"
echo "<HEAD>"
echo "<TITLE>Search the News Page</TITLE>"
echo "</HEAD>"
echo "<BODY background=\"http://www.aa.net/~rclark/ivory.gif\">"
echo "<b><a href=\"http://www.aa.net/~rclark/\">Home</a></b><br>"
echo "<b><a href=\"http://www.aa.net/~rclark/news.html\">News
Page</a></b><br>"
echo "<b><a href=\"http://www.aa.net/~rclark/search.html\">Search
the Web</a></b><br>"
echo "<hr>"
echo "<H2>Search the News Page</H2>"
echo "<ISINDEX>"
echo "<p>"
echo "<dl><dt><dd>"
echo "The search program looks for the exact phrase you
echo "<p>"
echo "You can search for <b>a phrase</b>, a whole <b>word</b> or
<b>sub</b>string.<br>"
echo "UPPER and lower case are equivalent.<br>"
echo "<p>"
echo "This program searches only the news listings page
itself.<BR>"
echo "Matches may be in publication names, URLs or section
headings.<br>"
echo "<p>"
echo "To search the Web in general, use <b>Search the Web</b> in
the menu above.<br>"
echo "<p>"
echo "</dd></dl>"
echo "<hr>"
echo "</BODY>"
echo "</HTML>"
else
echo "<HTML>"
echo "<HEAD>"
echo "<TITLE>Result of Search for \"$*\".</TITLE>"
echo "</HEAD>"
echo "<BODY background=\"http://www.aa.net/~rclark/ivory.gif\">"
echo "<b><a href=\"http://www.aa.net/~rclark/\">Home</a></b><br>"
echo "<hr>"
echo "<H2> Search Results: $*</H2>"
grep -i "$*" /home/rclark/public_html/news.html
echo "<p>"
echo "<hr>"
echo "<b><a href=\"http://www.aa.net/cgi-bin/rclark/
isindex.cgi\">Return to Searching the News Page</a></b><br>"
echo "</BODY>"
echo "</HTML>"
fi
web-grep is a UNIX shell script that uses the UNIX grep
utility. A script like this, or a version of it in Perl or C or
any another language, is a handy tool if you have Web pages with
long lists of links in them.
This script uses the <ISINDEX> tag, because some
browsers still don't support forms. Using an <ISINDEX>
interface instead of a forms interface lets users whose browsers
lack forms capability conduct this particular search.
You can edit the script to include your own menu at the top of
the page and your own return link to the page that the script
searches. If the script doesn't produce the expected results after
you edit it, you can find some debugging help in Chapter 25, "Testing
and Debugging CGI Scripts."
Troubleshooting
When I edit and run this script, I get the message Document contains no data.
Look for syntax errors in the parts you edited. Missing double quotation marks at the ends of the lines can cause this.
Simple Search Engines for Smaller Sites
Most people with Web sites are customers of commercial Internet
providers. Most of those providers, especially the big ones, run
UNIX. The following sections discuss some simple search tools
for personal and small business sites hosted at commercial service
providers.
Business users who have their own Web servers and need more powerful
search tools can skip to the section "An Overview of Search
Engines for Business Sites." The following sections discuss
the ICE, SWISH, Hukilau 2, and GLIMPSE search engines.
ICE
Christian Neuss' ICE search engine is the easiest to install of
the programs mentioned here. ICE produces relevance ranked results,
and it lists how many search keywords it finds in each file. It's
written in Perl.
There are two scripts. The indexing script, ice-idx.pl, creates
an index file that ICE can later search. The indexing script runs
from the UNIX shell prompt. It builds a plain ASCII index file,
unlike the binary index files that most other search engines use.
The search script, ice-form.pl, is a CGI script that searches
the index built by ice-idx.pl and displays the results on a Web
page.
The user input form for an ICE search includes a check box for
an optional external thesaurus (see fig. 11.6). Christian Neuss
notes that ICE has worked well with small thesauri of a few hundred
technical terms, but that anyone who wants to use a large thesaurus
should contact him for more information.
Figure 11.6 : ICE shows file dates and can search for files that are more recent than a specified number of days ago.
You can find the current version of ICE on the Net at the following
two distribution sites:
http://www.informatik.th-darmstadt.de/~neuss/ice/ice.html
http://ice.cornell-iowa.edu/
Indexing Your Files with ICE
ICE searches the directories that you specify in the script's
configuration section. When ICE indexes a given directory, it
also indexes all its subdirectories (see fig. 11.7).
Figure 11.7 : ICE ranks files by relevance and shows a summary of how many keywords (and longer variants of them) it found.
Five configuration items are at the top of the indexer script.
You'll need to edit three of them, as shown in the following code.
@SEARCHDIRS=(
"/home/user/somedir/subdir/",
"/home/user/thisis/another/",
"/home/user/andyet/more_stuff/"
);
$INDEXFILE="/user/home/somedir/index.idx";
# Minimum length of word to be indexed
$MINLEN=3;
The first directory path in @SEARCHDIRS is the default
that will appear on the search form. You can add more directory
lines in the style of the existing ones; or you can include only
one directory, if you want to limit what others can see of your
files.
NOTE
Remember that ICE automatically indexes and searches all the subdirectories of the directories you specify. You might want to move test, backup, and non-public files to a directory that ICE doesn't search.
After you set the configuration variables, run the script from
the command line to create the index. Whenever you want to update
the index, run the ice-idx.pl script again. It will overwrite
the existing index with the new one.
Searching from a Web Browser with ICE
The search form presents a choice of directories in a drop-down
selection box (see listing 11.2). You can specify these directories
in the script.
Listing 11.2 ICE Configuration Variables
# Title or name of your server:
Wyszukiwarka
Podobne podstrony:
Cisco2 ch11 FocusCH11 (18)ch11ch11ch11ch11ch11ch11 12?lki IIbudynas SM ch11ch11 12 szeregi potch11 12 zespch11 (5)ch11ch11 (25)ch11ch11BW ch11ch11 12 macierzewięcej podobnych podstron