trunk/lib/WAIT.pm

#!/usr/bin/perl
#                              -*- Mode: Cperl -*-
# $Basename: WAIT.pm $
# $Revision: 1.7 $
# Author          : Ulrich Pfeifer
# Created On      : Wed Nov  5 16:59:32 1997
# Last Modified By: Ulrich Pfeifer
# Last Modified On: Tue Apr 16 23:28:52 2002
# Language        : CPerl
# Update Count    : 8
# Status          : Unknown, Use with caution!
#
# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
#
#

package WAIT;
require DynaLoader;
use vars qw($VERSION @ISA);
@ISA = qw(DynaLoader);

$VERSION = '1.900';


bootstrap WAIT $VERSION;

__END__

=head1 NAME

WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS

=head1 SYNOPSIS

A Synopsis is not yet available.

=head1 Status of this document

I started writing down some information about the implementation
before I forget them in my spare time. The stuff is incomplete at
least. Any additions, corrections, ... welcome.

=head1 PURPOSE

As you might know, I developed and maintained B<freeWAIS-sf> (with the
help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
maintained by the Clearing House for Network Information Retrieval
(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
Machine et al. During this long history - implementation started about
1989 - many people contributed to the distribution and added features
not foreseen by the original design. While the system fulfills its
task now, the code has reached a state where adding new features is
nearly impossible and even fixing longstanding bugs and removing
limitations has become a very time consuming task.

Therefore I decided to pass the maintenance  to WSC Inc. and built a
new system from scratch. For obvious reasons I choosed Perl as
implementation language.

=head1 DESCRIPTION

The central idea of the system is to provide a framework and the
building blocks for any indexing and search system the users might
want to build. Obviously the framework limits the class of system
which can be build.

       +------+     +-----+     +------+
   ==> |Access| ==> |Parse| ==> |      |
       +------+     +-----+     |      |
                       ||       |      |     +-----+
                       ||       |Filter| ==> |Index|
                       \/       |      |     +-----+
      +-------+     +-----+     |      |
   <= |Display| <== |Query| <-> |      |
      +-------+     +-----+     +------+

A collection (aka table) is defined by the instances of the B<access>
and B<parse> module together with the B<filter definitions>. At query
time in addition a B<query> and a B<display> module must be choosen.

=head2 Access

The access module defines which documents are members of a database.
Usually an access module is a tied hash, whose keys are the Ids of the
documents (did = document id) and whose values are the documents
themselves. The indexing process loops over the keys using C<FIRSTKEY>
and C<NEXTKEY>. Documents are retrieved with C<FETCH>.

By convention access modules should be members of the
C<WAIT::Document> hierarchy. Have a look at the
C<WAIT::Document::Split> module to get the idea.


=head2 Parse

The task of the parse module is to split the documents into logical
parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
manuals piped through B<nroff>(1) into the sections I<name>,
I<synopsis>, I<options>, I<description>, I<author>, I<example>,
I<bugs>, I<text>, I<see>, and I<environment>. Here is the
implementation of C<WAIT::Parse::Base> which handles documents with a
pretty simple tagged format:

  AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
  TI: Searching Structured Documents with the Enhanced Retrieval
      Functionality of freeWAIS-sf and SFgate
  ER: D. Kroemker
  BT: Computer Networks and ISDN Systems; Proceedings of the third
      International World-Wide Web Conference
  PN: Elsevier
  PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
  PP: 1027-1036
  PY: 1995

  sub split {                     # called as method
    my %result;
    my $fld;

    for (split /\n/, $_[1]) {
      if (s/^(\S+):\s*//) {
        $fld = lc $1;
      }
      $result{$fld} .= $_ if defined $fld;
    }
    return \%result;
  }

Since the original document cannot be reconstructed from its
attributes, we need a second method (I<tag>) which marks the regions
of the document with tags for the different attributes. This tagged
form is used by the display module to hilight search terms in the
documents. Besides the tags for the attributes, the method might assign
the special tags C<_b> and C<_i> for indicating bold and italic
regions.

  sub tag {
    my @result;
    my $tag;

    for (split /\n/, $_[1]) {
      next if /^\w\w:\s*$/;
      if (s/^(\S+)://) {
        push @result, {_b => 1}, "$1:";
        $tag = lc $1;
      }
      if (defined $tag) {
        push @result, {$tag => 1}, "$_\n";
      } else {
        push @result, {}, "$_\n";
      }
    }
    return @result;               # we don't go for speed
  }

Obviously one could implement C<split> via C<tag>. The reason for
having two functions is speed. We need to call C<split> for each
document when indexing a collection. Therefore speed is essential. On
the other hand, C<tag> is called in order to display a single document
and may be a little slower. It may care about tagging bold and italic
regions. See C<WAIT::Parse::Nroff> how this might decrease
performance.


=head2 Filter definition

From the Information Retrieval perspective, the hardest part of the
system is the filter module. The database administrator defines for
each attribute, how the contents should be processed before it is
stored in the index. Usually the processing contains steps to restrict
the character set, case transformation, splitting to words and
transforming to word stems. In WAIT these steps are defined naturally
as a pipeline of processing steps. The pipelines are made up by
functions in the package B<WAIT::Filter> which is pre-populated by the
most common functions but may be extended any time.

The equivalent for a typical freeWAIS-sf processing would be this
pipeline:

        [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']

The function C<isotr> replaces unknown characters by blanks. C<isolc>
transforms to lower case. C<split2> splits into words and removes
words shorter than two characters. C<stop> removes the freeWAIS-sf
stopwords and C<Stem> applies the Porter algorithm for computing the
stem of the words.

The filter definition for a collection defines a set of pipelines for
the attributes and modifies the pipelines which should be used for
prefix and interval searches.

Several complete working examples come with WAIT in the script
directory. It is recommended to follow the pattern of the scripts
smakewhatis and sman.

=cut

1	ulpfr	10	#!/usr/bin/perl
2	ulpfr	13	# -- Mode: Cperl --
3	ulpfr	10	# $Basename: WAIT.pm $
4	ulpfr	19	# $Revision: 1.7 $
5	ulpfr	10	# Author : Ulrich Pfeifer
6			# Created On : Wed Nov 5 16:59:32 1997
7			# Last Modified By: Ulrich Pfeifer
8	ulpfr	81	# Last Modified On: Tue Apr 16 23:28:52 2002
9	ulpfr	10	# Language : CPerl
10	ulpfr	81	# Update Count : 8
11	ulpfr	10	# Status : Unknown, Use with caution!
12	ulpfr	13	#
13	ulpfr	10	# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14	ulpfr	13	#
15			#
16	ulpfr	10
17			package WAIT;
18			require DynaLoader;
19			use vars qw($VERSION @ISA);
20			@ISA = qw(DynaLoader);
21
22	ulpfr	81	$VERSION = '1.900';
23	ulpfr	10
24	ulpfr	19
25	ulpfr	10	bootstrap WAIT $VERSION;
26
27			__END__
28
29			=head1 NAME
30
31	ulpfr	13	WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
32	ulpfr	10
33	ulpfr	13	=head1 SYNOPSIS
34
35			A Synopsis is not yet available.
36
37	ulpfr	10	=head1 Status of this document
38
39			I started writing down some information about the implementation
40			before I forget them in my spare time. The stuff is incomplete at
41			least. Any additions, corrections, ... welcome.
42
43			=head1 PURPOSE
44
45			As you might know, I developed and maintained B<freeWAIS-sf> (with the
46			help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
47			maintained by the Clearing House for Network Information Retrieval
48			(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
49			Machine et al. During this long history - implementation started about
50			1989 - many people contributed to the distribution and added features
51			not foreseen by the original design. While the system fulfills its
52			task now, the code has reached a state where adding new features is
53			nearly impossible and even fixing longstanding bugs and removing
54			limitations has become a very time consuming task.
55
56			Therefore I decided to pass the maintenance to WSC Inc. and built a
57			new system from scratch. For obvious reasons I choosed Perl as
58			implementation language.
59
60			=head1 DESCRIPTION
61
62			The central idea of the system is to provide a framework and the
63			building blocks for any indexing and search system the users might
64			want to build. Obviously the framework limits the class of system
65			which can be build.
66
67			+------+ +-----+ +------+
68			==> \|Access\| ==> \|Parse\| ==> \| \|
69			+------+ +-----+ \| \|
70			\|\| \| \| +-----+
71			\|\| \|Filter\| ==> \|Index\|
72			\/ \| \| +-----+
73			+-------+ +-----+ \| \|
74			<= \|Display\| <== \|Query\| <-> \| \|
75			+-------+ +-----+ +------+
76
77			A collection (aka table) is defined by the instances of the B<access>
78			and B<parse> module together with the B<filter definitions>. At query
79			time in addition a B<query> and a B<display> module must be choosen.
80
81			=head2 Access
82
83	ulpfr	13	The access module defines which documents are members of a database.
84			Usually an access module is a tied hash, whose keys are the Ids of the
85			documents (did = document id) and whose values are the documents
86			themselves. The indexing process loops over the keys using C<FIRSTKEY>
87			and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
88	ulpfr	10
89			By convention access modules should be members of the
90			C<WAIT::Document> hierarchy. Have a look at the
91			C<WAIT::Document::Split> module to get the idea.
92
93
94			=head2 Parse
95
96	ulpfr	13	The task of the parse module is to split the documents into logical
97			parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
98	ulpfr	10	manuals piped through B<nroff>(1) into the sections I<name>,
99			I<synopsis>, I<options>, I<description>, I<author>, I<example>,
100			I<bugs>, I<text>, I<see>, and I<environment>. Here is the
101	ulpfr	13	implementation of C<WAIT::Parse::Base> which handles documents with a
102	ulpfr	10	pretty simple tagged format:
103
104			AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
105			TI: Searching Structured Documents with the Enhanced Retrieval
106			Functionality of freeWAIS-sf and SFgate
107			ER: D. Kroemker
108			BT: Computer Networks and ISDN Systems; Proceedings of the third
109			International World-Wide Web Conference
110			PN: Elsevier
111			PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
112			PP: 1027-1036
113			PY: 1995
114
115			sub split { # called as method
116			my %result;
117			my $fld;
118	ulpfr	13
119	ulpfr	10	for (split /\n/, $_[1]) {
120			if (s/^(\S+):\s*//) {
121			$fld = lc $1;
122			}
123			$result{$fld} .= $_ if defined $fld;
124			}
125			return \%result;
126	ulpfr	13	}
127	ulpfr	10
128			Since the original document cannot be reconstructed from its
129			attributes, we need a second method (I<tag>) which marks the regions
130			of the document with tags for the different attributes. This tagged
131			form is used by the display module to hilight search terms in the
132			documents. Besides the tags for the attributes, the method might assign
133			the special tags C<_b> and C<_i> for indicating bold and italic
134			regions.
135
136			sub tag {
137			my @result;
138			my $tag;
139	ulpfr	13
140	ulpfr	10	for (split /\n/, $_[1]) {
141			next if /^\w\w:\s*$/;
142			if (s/^(\S+)://) {
143			push @result, {_b => 1}, "$1:";
144			$tag = lc $1;
145			}
146			if (defined $tag) {
147			push @result, {$tag => 1}, "$_\n";
148			} else {
149			push @result, {}, "$_\n";
150			}
151			}
152			return @result; # we don't go for speed
153	ulpfr	13	}
154	ulpfr	10
155			Obviously one could implement C<split> via C<tag>. The reason for
156			having two functions is speed. We need to call C<split> for each
157			document when indexing a collection. Therefore speed is essential. On
158			the other hand, C<tag> is called in order to display a single document
159			and may be a little slower. It may care about tagging bold and italic
160			regions. See C<WAIT::Parse::Nroff> how this might decrease
161			performance.
162
163
164			=head2 Filter definition
165
166			From the Information Retrieval perspective, the hardest part of the
167			system is the filter module. The database administrator defines for
168			each attribute, how the contents should be processed before it is
169			stored in the index. Usually the processing contains steps to restrict
170			the character set, case transformation, splitting to words and
171			transforming to word stems. In WAIT these steps are defined naturally
172			as a pipeline of processing steps. The pipelines are made up by
173			functions in the package B<WAIT::Filter> which is pre-populated by the
174			most common functions but may be extended any time.
175
176			The equivalent for a typical freeWAIS-sf processing would be this
177			pipeline:
178
179			[ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
180
181			The function C<isotr> replaces unknown characters by blanks. C<isolc>
182			transforms to lower case. C<split2> splits into words and removes
183			words shorter than two characters. C<stop> removes the freeWAIS-sf
184			stopwords and C<Stem> applies the Porter algorithm for computing the
185			stem of the words.
186
187	ulpfr	13	The filter definition for a collection defines a set of pipelines for
188	ulpfr	10	the attributes and modifies the pipelines which should be used for
189			prefix and interval searches.
190
191	ulpfr	13	Several complete working examples come with WAIT in the script
192			directory. It is recommended to follow the pattern of the scripts
193			smakewhatis and sman.
194	ulpfr	10
195	ulpfr	13	=cut
196	ulpfr	10