trunk/lib/WAIT.pm

#!/usr/bin/perl
#                              -*- Mode: Perl -*- 
# $Basename: WAIT.pm $
# $Revision: 1.4 $
# Author          : Ulrich Pfeifer
# Created On      : Wed Nov  5 16:59:32 1997
# Last Modified By: Ulrich Pfeifer
# Last Modified On: Wed Nov 12 18:26:44 1997
# Language        : CPerl
# Update Count    : 4
# Status          : Unknown, Use with caution!
# 
# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
# 
# 

package WAIT;
require DynaLoader;
use vars qw($VERSION @ISA);
@ISA = qw(DynaLoader);

$VERSION = sprintf '%5.3f', map $_/10,'$ProjectVersion: 16.2 $ ' =~ /([\d.]+)/;

bootstrap WAIT $VERSION;

__END__

=head1 NAME

WAIT - a rewrite of the freeWAIS-sf engine in Perl

=head1 Status of this document

I started writing down some information about the implementation
before I forget them in my spare time. The stuff is incomplete at
least. Any additions, corrections, ... welcome.

=head1 PURPOSE

As you might know, I developed and maintained B<freeWAIS-sf> (with the
help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
maintained by the Clearing House for Network Information Retrieval
(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
Machine et al. During this long history - implementation started about
1989 - many people contributed to the distribution and added features
not foreseen by the original design. While the system fulfills its
task now, the code has reached a state where adding new features is
nearly impossible and even fixing longstanding bugs and removing
limitations has become a very time consuming task.

Therefore I decided to pass the maintenance  to WSC Inc. and built a
new system from scratch. For obvious reasons I choosed Perl as
implementation language.

=head1 DESCRIPTION

The central idea of the system is to provide a framework and the
building blocks for any indexing and search system the users might
want to build. Obviously the framework limits the class of system
which can be build.

       +------+     +-----+     +------+
   ==> |Access| ==> |Parse| ==> |      |
       +------+     +-----+     |      |
                       ||       |      |     +-----+
                       ||       |Filter| ==> |Index|
                       \/       |      |     +-----+
      +-------+     +-----+     |      |
   <= |Display| <== |Query| <-> |      |
      +-------+     +-----+     +------+

A collection (aka table) is defined by the instances of the B<access>
and B<parse> module together with the B<filter definitions>. At query
time in addition a B<query> and a B<display> module must be choosen.

=head2 Access

The access module defines which documents where members of a
database. Usually an access module is a tied hash, whose keys are the
Ids of the documents (did = document id) and whose values are the
documents themselves. The indexing process loops over the keys using
C<FIRSTKEY> and C<NEXTKEY>. Documents are retrieved with C<FETCH>.

By convention access modules should be members of the
C<WAIT::Document> hierarchy. Have a look at the
C<WAIT::Document::Split> module to get the idea.


=head2 Parse

The task parse module is to split the documents into logical parts
via the C<split> method.  E.g. the C<WAIT::Parse::Nroff> splits
manuals piped through B<nroff>(1) into the sections I<name>,
I<synopsis>, I<options>, I<description>, I<author>, I<example>,
I<bugs>, I<text>, I<see>, and I<environment>. Here is the
implementation of C<WAIT::Parse::Base> which handes documents with a
pretty simple tagged format:

  AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
  TI: Searching Structured Documents with the Enhanced Retrieval
      Functionality of freeWAIS-sf and SFgate
  ER: D. Kroemker
  BT: Computer Networks and ISDN Systems; Proceedings of the third
      International World-Wide Web Conference
  PN: Elsevier
  PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
  PP: 1027-1036
  PY: 1995

  sub split {                     # called as method
    my %result;
    my $fld;
  
    for (split /\n/, $_[1]) {
      if (s/^(\S+):\s*//) {
        $fld = lc $1;
      }
      $result{$fld} .= $_ if defined $fld;
    }
    return \%result;
  } 

Since the original document cannot be reconstructed from its
attributes, we need a second method (I<tag>) which marks the regions
of the document with tags for the different attributes. This tagged
form is used by the display module to hilight search terms in the
documents. Besides the tags for the attributes, the method might assign
the special tags C<_b> and C<_i> for indicating bold and italic
regions.

  sub tag {
    my @result;
    my $tag;
    
    for (split /\n/, $_[1]) {
      next if /^\w\w:\s*$/;
      if (s/^(\S+)://) {
        push @result, {_b => 1}, "$1:";
        $tag = lc $1;
      }
      if (defined $tag) {
        push @result, {$tag => 1}, "$_\n";
      } else {
        push @result, {}, "$_\n";
      }
    }
    return @result;               # we don't go for speed
  } 

Obviously one could implement C<split> via C<tag>. The reason for
having two functions is speed. We need to call C<split> for each
document when indexing a collection. Therefore speed is essential. On
the other hand, C<tag> is called in order to display a single document
and may be a little slower. It may care about tagging bold and italic
regions. See C<WAIT::Parse::Nroff> how this might decrease
performance.


=head2 Filter definition

From the Information Retrieval perspective, the hardest part of the
system is the filter module. The database administrator defines for
each attribute, how the contents should be processed before it is
stored in the index. Usually the processing contains steps to restrict
the character set, case transformation, splitting to words and
transforming to word stems. In WAIT these steps are defined naturally
as a pipeline of processing steps. The pipelines are made up by
functions in the package B<WAIT::Filter> which is pre-populated by the
most common functions but may be extended any time.

The equivalent for a typical freeWAIS-sf processing would be this
pipeline:

        [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']

The function C<isotr> replaces unknown characters by blanks. C<isolc>
transforms to lower case. C<split2> splits into words and removes
words shorter than two characters. C<stop> removes the freeWAIS-sf
stopwords and C<Stem> applies the Porter algorithm for computing the
stem of the words.

The filter definition for a collection defines a set of piplines for
the attributes and modifies the pipelines which should be used for
prefix and interval searches.

Here is a complete example:


  my $stem  = [{
                'prefix'    => ['unroff', 'isotr', 'isolc'],
                'intervall' => ['unroff', 'isotr', 'isolc'],
               },'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem'];
  my $text  = [{
                'prefix'    => ['unroff', 'isotr', 'isolc'],
                'intervall' => ['unroff', 'isotr', 'isolc'],
               },
                'unroff', 'isotr', 'isolc', 'split2', 'stop'];
  my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex'];
  
  my $spec  = [
      'name'         => $stem,
      'synopsis'     => $stem,
      'bugs'         => $stem,
      'description'  => $stem,
      'text'         => $stem,
      'environment'  => $text,
      'example'      => $text,  'example' => $stem,
      'author'       => $sound, 'author'  => $stem,
     ]

1	ulpfr	10	#!/usr/bin/perl
2			# -- Mode: Perl --
3			# $Basename: WAIT.pm $
4			# $Revision: 1.4 $
5			# Author : Ulrich Pfeifer
6			# Created On : Wed Nov 5 16:59:32 1997
7			# Last Modified By: Ulrich Pfeifer
8			# Last Modified On: Wed Nov 12 18:26:44 1997
9			# Language : CPerl
10			# Update Count : 4
11			# Status : Unknown, Use with caution!
12			#
13			# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14			#
15			#
16
17			package WAIT;
18			require DynaLoader;
19			use vars qw($VERSION @ISA);
20			@ISA = qw(DynaLoader);
21
22			$VERSION = sprintf '%5.3f', map $_/10,'$ProjectVersion: 16.2 $ ' =~ /([\d.]+)/;
23
24			bootstrap WAIT $VERSION;
25
26			__END__
27
28			=head1 NAME
29
30			WAIT - a rewrite of the freeWAIS-sf engine in Perl
31
32			=head1 Status of this document
33
34			I started writing down some information about the implementation
35			before I forget them in my spare time. The stuff is incomplete at
36			least. Any additions, corrections, ... welcome.
37
38			=head1 PURPOSE
39
40			As you might know, I developed and maintained B<freeWAIS-sf> (with the
41			help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
42			maintained by the Clearing House for Network Information Retrieval
43			(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
44			Machine et al. During this long history - implementation started about
45			1989 - many people contributed to the distribution and added features
46			not foreseen by the original design. While the system fulfills its
47			task now, the code has reached a state where adding new features is
48			nearly impossible and even fixing longstanding bugs and removing
49			limitations has become a very time consuming task.
50
51			Therefore I decided to pass the maintenance to WSC Inc. and built a
52			new system from scratch. For obvious reasons I choosed Perl as
53			implementation language.
54
55			=head1 DESCRIPTION
56
57			The central idea of the system is to provide a framework and the
58			building blocks for any indexing and search system the users might
59			want to build. Obviously the framework limits the class of system
60			which can be build.
61
62			+------+ +-----+ +------+
63			==> \|Access\| ==> \|Parse\| ==> \| \|
64			+------+ +-----+ \| \|
65			\|\| \| \| +-----+
66			\|\| \|Filter\| ==> \|Index\|
67			\/ \| \| +-----+
68			+-------+ +-----+ \| \|
69			<= \|Display\| <== \|Query\| <-> \| \|
70			+-------+ +-----+ +------+
71
72			A collection (aka table) is defined by the instances of the B<access>
73			and B<parse> module together with the B<filter definitions>. At query
74			time in addition a B<query> and a B<display> module must be choosen.
75
76			=head2 Access
77
78			The access module defines which documents where members of a
79			database. Usually an access module is a tied hash, whose keys are the
80			Ids of the documents (did = document id) and whose values are the
81			documents themselves. The indexing process loops over the keys using
82			C<FIRSTKEY> and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
83
84			By convention access modules should be members of the
85			C<WAIT::Document> hierarchy. Have a look at the
86			C<WAIT::Document::Split> module to get the idea.
87
88
89			=head2 Parse
90
91			The task parse module is to split the documents into logical parts
92			via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
93			manuals piped through B<nroff>(1) into the sections I<name>,
94			I<synopsis>, I<options>, I<description>, I<author>, I<example>,
95			I<bugs>, I<text>, I<see>, and I<environment>. Here is the
96			implementation of C<WAIT::Parse::Base> which handes documents with a
97			pretty simple tagged format:
98
99			AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
100			TI: Searching Structured Documents with the Enhanced Retrieval
101			Functionality of freeWAIS-sf and SFgate
102			ER: D. Kroemker
103			BT: Computer Networks and ISDN Systems; Proceedings of the third
104			International World-Wide Web Conference
105			PN: Elsevier
106			PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
107			PP: 1027-1036
108			PY: 1995
109
110			sub split { # called as method
111			my %result;
112			my $fld;
113
114			for (split /\n/, $_[1]) {
115			if (s/^(\S+):\s*//) {
116			$fld = lc $1;
117			}
118			$result{$fld} .= $_ if defined $fld;
119			}
120			return \%result;
121			}
122
123			Since the original document cannot be reconstructed from its
124			attributes, we need a second method (I<tag>) which marks the regions
125			of the document with tags for the different attributes. This tagged
126			form is used by the display module to hilight search terms in the
127			documents. Besides the tags for the attributes, the method might assign
128			the special tags C<_b> and C<_i> for indicating bold and italic
129			regions.
130
131			sub tag {
132			my @result;
133			my $tag;
134
135			for (split /\n/, $_[1]) {
136			next if /^\w\w:\s*$/;
137			if (s/^(\S+)://) {
138			push @result, {_b => 1}, "$1:";
139			$tag = lc $1;
140			}
141			if (defined $tag) {
142			push @result, {$tag => 1}, "$_\n";
143			} else {
144			push @result, {}, "$_\n";
145			}
146			}
147			return @result; # we don't go for speed
148			}
149
150			Obviously one could implement C<split> via C<tag>. The reason for
151			having two functions is speed. We need to call C<split> for each
152			document when indexing a collection. Therefore speed is essential. On
153			the other hand, C<tag> is called in order to display a single document
154			and may be a little slower. It may care about tagging bold and italic
155			regions. See C<WAIT::Parse::Nroff> how this might decrease
156			performance.
157
158
159			=head2 Filter definition
160
161			From the Information Retrieval perspective, the hardest part of the
162			system is the filter module. The database administrator defines for
163			each attribute, how the contents should be processed before it is
164			stored in the index. Usually the processing contains steps to restrict
165			the character set, case transformation, splitting to words and
166			transforming to word stems. In WAIT these steps are defined naturally
167			as a pipeline of processing steps. The pipelines are made up by
168			functions in the package B<WAIT::Filter> which is pre-populated by the
169			most common functions but may be extended any time.
170
171			The equivalent for a typical freeWAIS-sf processing would be this
172			pipeline:
173
174			[ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
175
176			The function C<isotr> replaces unknown characters by blanks. C<isolc>
177			transforms to lower case. C<split2> splits into words and removes
178			words shorter than two characters. C<stop> removes the freeWAIS-sf
179			stopwords and C<Stem> applies the Porter algorithm for computing the
180			stem of the words.
181
182			The filter definition for a collection defines a set of piplines for
183			the attributes and modifies the pipelines which should be used for
184			prefix and interval searches.
185
186			Here is a complete example:
187
188
189			my $stem = [{
190			'prefix' => ['unroff', 'isotr', 'isolc'],
191			'intervall' => ['unroff', 'isotr', 'isolc'],
192			},'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem'];
193			my $text = [{
194			'prefix' => ['unroff', 'isotr', 'isolc'],
195			'intervall' => ['unroff', 'isotr', 'isolc'],
196			},
197			'unroff', 'isotr', 'isolc', 'split2', 'stop'];
198			my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex'];
199
200			my $spec = [
201			'name' => $stem,
202			'synopsis' => $stem,
203			'bugs' => $stem,
204			'description' => $stem,
205			'text' => $stem,
206			'environment' => $text,
207			'example' => $text, 'example' => $stem,
208			'author' => $sound, 'author' => $stem,
209			]
210