trunk/lib/WAIT.pm

#!/usr/bin/perl
#                              -*- Mode: Cperl -*-
# $Basename: WAIT.pm $
# $Revision: 1.7 $
# Author          : Ulrich Pfeifer
# Created On      : Wed Nov  5 16:59:32 1997
# Last Modified By: Ulrich Pfeifer
# Last Modified On: Tue Apr 16 23:28:52 2002
# Language        : CPerl
# Update Count    : 8
# Status          : Unknown, Use with caution!
#
# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
#
#

package WAIT;
require DynaLoader;
use vars qw($VERSION @ISA);
@ISA = qw(DynaLoader);

$VERSION = '1.900';


bootstrap WAIT $VERSION;

__END__

=head1 NAME

WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS

=head1 SYNOPSIS

A Synopsis is not yet available.

=head1 Status of this document

I started writing down some information about the implementation
before I forget them in my spare time. The stuff is incomplete at
least. Any additions, corrections, ... welcome.

=head1 PURPOSE

As you might know, I developed and maintained B<freeWAIS-sf> (with the
help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
maintained by the Clearing House for Network Information Retrieval
(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
Machine et al. During this long history - implementation started about
1989 - many people contributed to the distribution and added features
not foreseen by the original design. While the system fulfills its
task now, the code has reached a state where adding new features is
nearly impossible and even fixing longstanding bugs and removing
limitations has become a very time consuming task.

Therefore I decided to pass the maintenance  to WSC Inc. and built a
new system from scratch. For obvious reasons I choosed Perl as
implementation language.

=head1 DESCRIPTION

The central idea of the system is to provide a framework and the
building blocks for any indexing and search system the users might
want to build. Obviously the framework limits the class of system
which can be build.

       +------+     +-----+     +------+
   ==> |Access| ==> |Parse| ==> |      |
       +------+     +-----+     |      |
                       ||       |      |     +-----+
                       ||       |Filter| ==> |Index|
                       \/       |      |     +-----+
      +-------+     +-----+     |      |
   <= |Display| <== |Query| <-> |      |
      +-------+     +-----+     +------+

A collection (aka table) is defined by the instances of the B<access>
and B<parse> module together with the B<filter definitions>. At query
time in addition a B<query> and a B<display> module must be choosen.

=head2 Access

The access module defines which documents are members of a database.
Usually an access module is a tied hash, whose keys are the Ids of the
documents (did = document id) and whose values are the documents
themselves. The indexing process loops over the keys using C<FIRSTKEY>
and C<NEXTKEY>. Documents are retrieved with C<FETCH>.

By convention access modules should be members of the
C<WAIT::Document> hierarchy. Have a look at the
C<WAIT::Document::Split> module to get the idea.


=head2 Parse

The task of the parse module is to split the documents into logical
parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
manuals piped through B<nroff>(1) into the sections I<name>,
I<synopsis>, I<options>, I<description>, I<author>, I<example>,
I<bugs>, I<text>, I<see>, and I<environment>. Here is the
implementation of C<WAIT::Parse::Base> which handles documents with a
pretty simple tagged format:

  AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
  TI: Searching Structured Documents with the Enhanced Retrieval
      Functionality of freeWAIS-sf and SFgate
  ER: D. Kroemker
  BT: Computer Networks and ISDN Systems; Proceedings of the third
      International World-Wide Web Conference
  PN: Elsevier
  PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
  PP: 1027-1036
  PY: 1995

  sub split {                     # called as method
    my %result;
    my $fld;

    for (split /\n/, $_[1]) {
      if (s/^(\S+):\s*//) {
        $fld = lc $1;
      }
      $result{$fld} .= $_ if defined $fld;
    }
    return \%result;
  }

Since the original document cannot be reconstructed from its
attributes, we need a second method (I<tag>) which marks the regions
of the document with tags for the different attributes. This tagged
form is used by the display module to hilight search terms in the
documents. Besides the tags for the attributes, the method might assign
the special tags C<_b> and C<_i> for indicating bold and italic
regions.

  sub tag {
    my @result;
    my $tag;

    for (split /\n/, $_[1]) {
      next if /^\w\w:\s*$/;
      if (s/^(\S+)://) {
        push @result, {_b => 1}, "$1:";
        $tag = lc $1;
      }
      if (defined $tag) {
        push @result, {$tag => 1}, "$_\n";
      } else {
        push @result, {}, "$_\n";
      }
    }
    return @result;               # we don't go for speed
  }

Obviously one could implement C<split> via C<tag>. The reason for
having two functions is speed. We need to call C<split> for each
document when indexing a collection. Therefore speed is essential. On
the other hand, C<tag> is called in order to display a single document
and may be a little slower. It may care about tagging bold and italic
regions. See C<WAIT::Parse::Nroff> how this might decrease
performance.


=head2 Filter definition

From the Information Retrieval perspective, the hardest part of the
system is the filter module. The database administrator defines for
each attribute, how the contents should be processed before it is
stored in the index. Usually the processing contains steps to restrict
the character set, case transformation, splitting to words and
transforming to word stems. In WAIT these steps are defined naturally
as a pipeline of processing steps. The pipelines are made up by
functions in the package B<WAIT::Filter> which is pre-populated by the
most common functions but may be extended any time.

The equivalent for a typical freeWAIS-sf processing would be this
pipeline:

        [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']

The function C<isotr> replaces unknown characters by blanks. C<isolc>
transforms to lower case. C<split2> splits into words and removes
words shorter than two characters. C<stop> removes the freeWAIS-sf
stopwords and C<Stem> applies the Porter algorithm for computing the
stem of the words.

The filter definition for a collection defines a set of pipelines for
the attributes and modifies the pipelines which should be used for
prefix and interval searches.

Several complete working examples come with WAIT in the script
directory. It is recommended to follow the pattern of the scripts
smakewhatis and sman.

=cut

1	#!/usr/bin/perl
2	# -- Mode: Cperl --
3	# $Basename: WAIT.pm $
4	# $Revision: 1.7 $
5	# Author : Ulrich Pfeifer
6	# Created On : Wed Nov 5 16:59:32 1997
7	# Last Modified By: Ulrich Pfeifer
8	# Last Modified On: Tue Apr 16 23:28:52 2002
9	# Language : CPerl
10	# Update Count : 8
11	# Status : Unknown, Use with caution!
12	#
13	# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14	#
15	#
16
17	package WAIT;
18	require DynaLoader;
19	use vars qw($VERSION @ISA);
20	@ISA = qw(DynaLoader);
21
22	$VERSION = '1.900';
23
24
25	bootstrap WAIT $VERSION;
26
27	__END__
28
29	=head1 NAME
30
31	WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
32
33	=head1 SYNOPSIS
34
35	A Synopsis is not yet available.
36
37	=head1 Status of this document
38
39	I started writing down some information about the implementation
40	before I forget them in my spare time. The stuff is incomplete at
41	least. Any additions, corrections, ... welcome.
42
43	=head1 PURPOSE
44
45	As you might know, I developed and maintained B<freeWAIS-sf> (with the
46	help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
47	maintained by the Clearing House for Network Information Retrieval
48	(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
49	Machine et al. During this long history - implementation started about
50	1989 - many people contributed to the distribution and added features
51	not foreseen by the original design. While the system fulfills its
52	task now, the code has reached a state where adding new features is
53	nearly impossible and even fixing longstanding bugs and removing
54	limitations has become a very time consuming task.
55
56	Therefore I decided to pass the maintenance to WSC Inc. and built a
57	new system from scratch. For obvious reasons I choosed Perl as
58	implementation language.
59
60	=head1 DESCRIPTION
61
62	The central idea of the system is to provide a framework and the
63	building blocks for any indexing and search system the users might
64	want to build. Obviously the framework limits the class of system
65	which can be build.
66
67	+------+ +-----+ +------+
68	==> \|Access\| ==> \|Parse\| ==> \| \|
69	+------+ +-----+ \| \|
70	\|\| \| \| +-----+
71	\|\| \|Filter\| ==> \|Index\|
72	\/ \| \| +-----+
73	+-------+ +-----+ \| \|
74	<= \|Display\| <== \|Query\| <-> \| \|
75	+-------+ +-----+ +------+
76
77	A collection (aka table) is defined by the instances of the B<access>
78	and B<parse> module together with the B<filter definitions>. At query
79	time in addition a B<query> and a B<display> module must be choosen.
80
81	=head2 Access
82
83	The access module defines which documents are members of a database.
84	Usually an access module is a tied hash, whose keys are the Ids of the
85	documents (did = document id) and whose values are the documents
86	themselves. The indexing process loops over the keys using C<FIRSTKEY>
87	and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
88
89	By convention access modules should be members of the
90	C<WAIT::Document> hierarchy. Have a look at the
91	C<WAIT::Document::Split> module to get the idea.
92
93
94	=head2 Parse
95
96	The task of the parse module is to split the documents into logical
97	parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
98	manuals piped through B<nroff>(1) into the sections I<name>,
99	I<synopsis>, I<options>, I<description>, I<author>, I<example>,
100	I<bugs>, I<text>, I<see>, and I<environment>. Here is the
101	implementation of C<WAIT::Parse::Base> which handles documents with a
102	pretty simple tagged format:
103
104	AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
105	TI: Searching Structured Documents with the Enhanced Retrieval
106	Functionality of freeWAIS-sf and SFgate
107	ER: D. Kroemker
108	BT: Computer Networks and ISDN Systems; Proceedings of the third
109	International World-Wide Web Conference
110	PN: Elsevier
111	PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
112	PP: 1027-1036
113	PY: 1995
114
115	sub split { # called as method
116	my %result;
117	my $fld;
118
119	for (split /\n/, $_[1]) {
120	if (s/^(\S+):\s*//) {
121	$fld = lc $1;
122	}
123	$result{$fld} .= $_ if defined $fld;
124	}
125	return \%result;
126	}
127
128	Since the original document cannot be reconstructed from its
129	attributes, we need a second method (I<tag>) which marks the regions
130	of the document with tags for the different attributes. This tagged
131	form is used by the display module to hilight search terms in the
132	documents. Besides the tags for the attributes, the method might assign
133	the special tags C<_b> and C<_i> for indicating bold and italic
134	regions.
135
136	sub tag {
137	my @result;
138	my $tag;
139
140	for (split /\n/, $_[1]) {
141	next if /^\w\w:\s*$/;
142	if (s/^(\S+)://) {
143	push @result, {_b => 1}, "$1:";
144	$tag = lc $1;
145	}
146	if (defined $tag) {
147	push @result, {$tag => 1}, "$_\n";
148	} else {
149	push @result, {}, "$_\n";
150	}
151	}
152	return @result; # we don't go for speed
153	}
154
155	Obviously one could implement C<split> via C<tag>. The reason for
156	having two functions is speed. We need to call C<split> for each
157	document when indexing a collection. Therefore speed is essential. On
158	the other hand, C<tag> is called in order to display a single document
159	and may be a little slower. It may care about tagging bold and italic
160	regions. See C<WAIT::Parse::Nroff> how this might decrease
161	performance.
162
163
164	=head2 Filter definition
165
166	From the Information Retrieval perspective, the hardest part of the
167	system is the filter module. The database administrator defines for
168	each attribute, how the contents should be processed before it is
169	stored in the index. Usually the processing contains steps to restrict
170	the character set, case transformation, splitting to words and
171	transforming to word stems. In WAIT these steps are defined naturally
172	as a pipeline of processing steps. The pipelines are made up by
173	functions in the package B<WAIT::Filter> which is pre-populated by the
174	most common functions but may be extended any time.
175
176	The equivalent for a typical freeWAIS-sf processing would be this
177	pipeline:
178
179	[ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
180
181	The function C<isotr> replaces unknown characters by blanks. C<isolc>
182	transforms to lower case. C<split2> splits into words and removes
183	words shorter than two characters. C<stop> removes the freeWAIS-sf
184	stopwords and C<Stem> applies the Porter algorithm for computing the
185	stem of the words.
186
187	The filter definition for a collection defines a set of pipelines for
188	the attributes and modifies the pipelines which should be used for
189	prefix and interval searches.
190
191	Several complete working examples come with WAIT in the script
192	directory. It is recommended to follow the pattern of the scripts
193	smakewhatis and sman.
194
195	=cut
196