trunk/lib/WAIT.pm

#!/usr/bin/perl
#                              -*- Mode: Cperl -*-
# $Basename: WAIT.pm $
# $Revision: 1.6 $
# Author          : Ulrich Pfeifer
# Created On      : Wed Nov  5 16:59:32 1997
# Last Modified By: Ulrich Pfeifer
# Last Modified On: Wed Nov 12 18:26:44 1997
# Language        : CPerl
# Update Count    : 4
# Status          : Unknown, Use with caution!
#
# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
#
#

package WAIT;
require DynaLoader;
use vars qw($VERSION @ISA);
@ISA = qw(DynaLoader);

$VERSION = sprintf '%.4f', map $_/10,'$ProjectVersion: 17.1 $ ' =~ /([\d.]+)/;

bootstrap WAIT $VERSION;

__END__

=head1 NAME

WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS

=head1 SYNOPSIS

A Synopsis is not yet available.

=head1 Status of this document

I started writing down some information about the implementation
before I forget them in my spare time. The stuff is incomplete at
least. Any additions, corrections, ... welcome.

=head1 PURPOSE

As you might know, I developed and maintained B<freeWAIS-sf> (with the
help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
maintained by the Clearing House for Network Information Retrieval
(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
Machine et al. During this long history - implementation started about
1989 - many people contributed to the distribution and added features
not foreseen by the original design. While the system fulfills its
task now, the code has reached a state where adding new features is
nearly impossible and even fixing longstanding bugs and removing
limitations has become a very time consuming task.

Therefore I decided to pass the maintenance  to WSC Inc. and built a
new system from scratch. For obvious reasons I choosed Perl as
implementation language.

=head1 DESCRIPTION

The central idea of the system is to provide a framework and the
building blocks for any indexing and search system the users might
want to build. Obviously the framework limits the class of system
which can be build.

       +------+     +-----+     +------+
   ==> |Access| ==> |Parse| ==> |      |
       +------+     +-----+     |      |
                       ||       |      |     +-----+
                       ||       |Filter| ==> |Index|
                       \/       |      |     +-----+
      +-------+     +-----+     |      |
   <= |Display| <== |Query| <-> |      |
      +-------+     +-----+     +------+

A collection (aka table) is defined by the instances of the B<access>
and B<parse> module together with the B<filter definitions>. At query
time in addition a B<query> and a B<display> module must be choosen.

=head2 Access

The access module defines which documents are members of a database.
Usually an access module is a tied hash, whose keys are the Ids of the
documents (did = document id) and whose values are the documents
themselves. The indexing process loops over the keys using C<FIRSTKEY>
and C<NEXTKEY>. Documents are retrieved with C<FETCH>.

By convention access modules should be members of the
C<WAIT::Document> hierarchy. Have a look at the
C<WAIT::Document::Split> module to get the idea.


=head2 Parse

The task of the parse module is to split the documents into logical
parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
manuals piped through B<nroff>(1) into the sections I<name>,
I<synopsis>, I<options>, I<description>, I<author>, I<example>,
I<bugs>, I<text>, I<see>, and I<environment>. Here is the
implementation of C<WAIT::Parse::Base> which handles documents with a
pretty simple tagged format:

  AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
  TI: Searching Structured Documents with the Enhanced Retrieval
      Functionality of freeWAIS-sf and SFgate
  ER: D. Kroemker
  BT: Computer Networks and ISDN Systems; Proceedings of the third
      International World-Wide Web Conference
  PN: Elsevier
  PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
  PP: 1027-1036
  PY: 1995

  sub split {                     # called as method
    my %result;
    my $fld;

    for (split /\n/, $_[1]) {
      if (s/^(\S+):\s*//) {
        $fld = lc $1;
      }
      $result{$fld} .= $_ if defined $fld;
    }
    return \%result;
  }

Since the original document cannot be reconstructed from its
attributes, we need a second method (I<tag>) which marks the regions
of the document with tags for the different attributes. This tagged
form is used by the display module to hilight search terms in the
documents. Besides the tags for the attributes, the method might assign
the special tags C<_b> and C<_i> for indicating bold and italic
regions.

  sub tag {
    my @result;
    my $tag;

    for (split /\n/, $_[1]) {
      next if /^\w\w:\s*$/;
      if (s/^(\S+)://) {
        push @result, {_b => 1}, "$1:";
        $tag = lc $1;
      }
      if (defined $tag) {
        push @result, {$tag => 1}, "$_\n";
      } else {
        push @result, {}, "$_\n";
      }
    }
    return @result;               # we don't go for speed
  }

Obviously one could implement C<split> via C<tag>. The reason for
having two functions is speed. We need to call C<split> for each
document when indexing a collection. Therefore speed is essential. On
the other hand, C<tag> is called in order to display a single document
and may be a little slower. It may care about tagging bold and italic
regions. See C<WAIT::Parse::Nroff> how this might decrease
performance.


=head2 Filter definition

From the Information Retrieval perspective, the hardest part of the
system is the filter module. The database administrator defines for
each attribute, how the contents should be processed before it is
stored in the index. Usually the processing contains steps to restrict
the character set, case transformation, splitting to words and
transforming to word stems. In WAIT these steps are defined naturally
as a pipeline of processing steps. The pipelines are made up by
functions in the package B<WAIT::Filter> which is pre-populated by the
most common functions but may be extended any time.

The equivalent for a typical freeWAIS-sf processing would be this
pipeline:

        [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']

The function C<isotr> replaces unknown characters by blanks. C<isolc>
transforms to lower case. C<split2> splits into words and removes
words shorter than two characters. C<stop> removes the freeWAIS-sf
stopwords and C<Stem> applies the Porter algorithm for computing the
stem of the words.

The filter definition for a collection defines a set of pipelines for
the attributes and modifies the pipelines which should be used for
prefix and interval searches.

Several complete working examples come with WAIT in the script
directory. It is recommended to follow the pattern of the scripts
smakewhatis and sman.

=cut

1	ulpfr	10	#!/usr/bin/perl
2	ulpfr	13	# -- Mode: Cperl --
3	ulpfr	10	# $Basename: WAIT.pm $
4	ulpfr	13	# $Revision: 1.6 $
5	ulpfr	10	# Author : Ulrich Pfeifer
6			# Created On : Wed Nov 5 16:59:32 1997
7			# Last Modified By: Ulrich Pfeifer
8			# Last Modified On: Wed Nov 12 18:26:44 1997
9			# Language : CPerl
10			# Update Count : 4
11			# Status : Unknown, Use with caution!
12	ulpfr	13	#
13	ulpfr	10	# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14	ulpfr	13	#
15			#
16	ulpfr	10
17			package WAIT;
18			require DynaLoader;
19			use vars qw($VERSION @ISA);
20			@ISA = qw(DynaLoader);
21
22	ulpfr	13	$VERSION = sprintf '%.4f', map $_/10,'$ProjectVersion: 17.1 $ ' =~ /([\d.]+)/;
23	ulpfr	10
24			bootstrap WAIT $VERSION;
25
26			__END__
27
28			=head1 NAME
29
30	ulpfr	13	WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
31	ulpfr	10
32	ulpfr	13	=head1 SYNOPSIS
33
34			A Synopsis is not yet available.
35
36	ulpfr	10	=head1 Status of this document
37
38			I started writing down some information about the implementation
39			before I forget them in my spare time. The stuff is incomplete at
40			least. Any additions, corrections, ... welcome.
41
42			=head1 PURPOSE
43
44			As you might know, I developed and maintained B<freeWAIS-sf> (with the
45			help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
46			maintained by the Clearing House for Network Information Retrieval
47			(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
48			Machine et al. During this long history - implementation started about
49			1989 - many people contributed to the distribution and added features
50			not foreseen by the original design. While the system fulfills its
51			task now, the code has reached a state where adding new features is
52			nearly impossible and even fixing longstanding bugs and removing
53			limitations has become a very time consuming task.
54
55			Therefore I decided to pass the maintenance to WSC Inc. and built a
56			new system from scratch. For obvious reasons I choosed Perl as
57			implementation language.
58
59			=head1 DESCRIPTION
60
61			The central idea of the system is to provide a framework and the
62			building blocks for any indexing and search system the users might
63			want to build. Obviously the framework limits the class of system
64			which can be build.
65
66			+------+ +-----+ +------+
67			==> \|Access\| ==> \|Parse\| ==> \| \|
68			+------+ +-----+ \| \|
69			\|\| \| \| +-----+
70			\|\| \|Filter\| ==> \|Index\|
71			\/ \| \| +-----+
72			+-------+ +-----+ \| \|
73			<= \|Display\| <== \|Query\| <-> \| \|
74			+-------+ +-----+ +------+
75
76			A collection (aka table) is defined by the instances of the B<access>
77			and B<parse> module together with the B<filter definitions>. At query
78			time in addition a B<query> and a B<display> module must be choosen.
79
80			=head2 Access
81
82	ulpfr	13	The access module defines which documents are members of a database.
83			Usually an access module is a tied hash, whose keys are the Ids of the
84			documents (did = document id) and whose values are the documents
85			themselves. The indexing process loops over the keys using C<FIRSTKEY>
86			and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
87	ulpfr	10
88			By convention access modules should be members of the
89			C<WAIT::Document> hierarchy. Have a look at the
90			C<WAIT::Document::Split> module to get the idea.
91
92
93			=head2 Parse
94
95	ulpfr	13	The task of the parse module is to split the documents into logical
96			parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
97	ulpfr	10	manuals piped through B<nroff>(1) into the sections I<name>,
98			I<synopsis>, I<options>, I<description>, I<author>, I<example>,
99			I<bugs>, I<text>, I<see>, and I<environment>. Here is the
100	ulpfr	13	implementation of C<WAIT::Parse::Base> which handles documents with a
101	ulpfr	10	pretty simple tagged format:
102
103			AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
104			TI: Searching Structured Documents with the Enhanced Retrieval
105			Functionality of freeWAIS-sf and SFgate
106			ER: D. Kroemker
107			BT: Computer Networks and ISDN Systems; Proceedings of the third
108			International World-Wide Web Conference
109			PN: Elsevier
110			PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
111			PP: 1027-1036
112			PY: 1995
113
114			sub split { # called as method
115			my %result;
116			my $fld;
117	ulpfr	13
118	ulpfr	10	for (split /\n/, $_[1]) {
119			if (s/^(\S+):\s*//) {
120			$fld = lc $1;
121			}
122			$result{$fld} .= $_ if defined $fld;
123			}
124			return \%result;
125	ulpfr	13	}
126	ulpfr	10
127			Since the original document cannot be reconstructed from its
128			attributes, we need a second method (I<tag>) which marks the regions
129			of the document with tags for the different attributes. This tagged
130			form is used by the display module to hilight search terms in the
131			documents. Besides the tags for the attributes, the method might assign
132			the special tags C<_b> and C<_i> for indicating bold and italic
133			regions.
134
135			sub tag {
136			my @result;
137			my $tag;
138	ulpfr	13
139	ulpfr	10	for (split /\n/, $_[1]) {
140			next if /^\w\w:\s*$/;
141			if (s/^(\S+)://) {
142			push @result, {_b => 1}, "$1:";
143			$tag = lc $1;
144			}
145			if (defined $tag) {
146			push @result, {$tag => 1}, "$_\n";
147			} else {
148			push @result, {}, "$_\n";
149			}
150			}
151			return @result; # we don't go for speed
152	ulpfr	13	}
153	ulpfr	10
154			Obviously one could implement C<split> via C<tag>. The reason for
155			having two functions is speed. We need to call C<split> for each
156			document when indexing a collection. Therefore speed is essential. On
157			the other hand, C<tag> is called in order to display a single document
158			and may be a little slower. It may care about tagging bold and italic
159			regions. See C<WAIT::Parse::Nroff> how this might decrease
160			performance.
161
162
163			=head2 Filter definition
164
165			From the Information Retrieval perspective, the hardest part of the
166			system is the filter module. The database administrator defines for
167			each attribute, how the contents should be processed before it is
168			stored in the index. Usually the processing contains steps to restrict
169			the character set, case transformation, splitting to words and
170			transforming to word stems. In WAIT these steps are defined naturally
171			as a pipeline of processing steps. The pipelines are made up by
172			functions in the package B<WAIT::Filter> which is pre-populated by the
173			most common functions but may be extended any time.
174
175			The equivalent for a typical freeWAIS-sf processing would be this
176			pipeline:
177
178			[ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
179
180			The function C<isotr> replaces unknown characters by blanks. C<isolc>
181			transforms to lower case. C<split2> splits into words and removes
182			words shorter than two characters. C<stop> removes the freeWAIS-sf
183			stopwords and C<Stem> applies the Porter algorithm for computing the
184			stem of the words.
185
186	ulpfr	13	The filter definition for a collection defines a set of pipelines for
187	ulpfr	10	the attributes and modifies the pipelines which should be used for
188			prefix and interval searches.
189
190	ulpfr	13	Several complete working examples come with WAIT in the script
191			directory. It is recommended to follow the pattern of the scripts
192			smakewhatis and sman.
193	ulpfr	10
194	ulpfr	13	=cut
195	ulpfr	10