/[wait]/branches/CPAN/lib/WAIT.pm
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /branches/CPAN/lib/WAIT.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 19 - (show annotations)
Tue May 9 11:29:45 2000 UTC (24 years ago) by ulpfr
File size: 6903 byte(s)
Import of WAIT-1.800

1 #!/usr/bin/perl
2 # -*- Mode: Cperl -*-
3 # $Basename: WAIT.pm $
4 # $Revision: 1.7 $
5 # Author : Ulrich Pfeifer
6 # Created On : Wed Nov 5 16:59:32 1997
7 # Last Modified By: Ulrich Pfeifer
8 # Last Modified On: Mon May 31 22:34:35 1999
9 # Language : CPerl
10 # Update Count : 5
11 # Status : Unknown, Use with caution!
12 #
13 # (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14 #
15 #
16
17 package WAIT;
18 require DynaLoader;
19 use vars qw($VERSION @ISA);
20 @ISA = qw(DynaLoader);
21
22 # $Format: "$\VERSION = sprintf '%5.3f', ($ProjectMajorVersion$ * 100 + ($ProjectMinorVersion$-1))/1000;"$
23 $VERSION = sprintf '%5.3f', (18 * 100 + (1-1))/1000;
24
25
26 bootstrap WAIT $VERSION;
27
28 __END__
29
30 =head1 NAME
31
32 WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
33
34 =head1 SYNOPSIS
35
36 A Synopsis is not yet available.
37
38 =head1 Status of this document
39
40 I started writing down some information about the implementation
41 before I forget them in my spare time. The stuff is incomplete at
42 least. Any additions, corrections, ... welcome.
43
44 =head1 PURPOSE
45
46 As you might know, I developed and maintained B<freeWAIS-sf> (with the
47 help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
48 maintained by the Clearing House for Network Information Retrieval
49 (CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
50 Machine et al. During this long history - implementation started about
51 1989 - many people contributed to the distribution and added features
52 not foreseen by the original design. While the system fulfills its
53 task now, the code has reached a state where adding new features is
54 nearly impossible and even fixing longstanding bugs and removing
55 limitations has become a very time consuming task.
56
57 Therefore I decided to pass the maintenance to WSC Inc. and built a
58 new system from scratch. For obvious reasons I choosed Perl as
59 implementation language.
60
61 =head1 DESCRIPTION
62
63 The central idea of the system is to provide a framework and the
64 building blocks for any indexing and search system the users might
65 want to build. Obviously the framework limits the class of system
66 which can be build.
67
68 +------+ +-----+ +------+
69 ==> |Access| ==> |Parse| ==> | |
70 +------+ +-----+ | |
71 || | | +-----+
72 || |Filter| ==> |Index|
73 \/ | | +-----+
74 +-------+ +-----+ | |
75 <= |Display| <== |Query| <-> | |
76 +-------+ +-----+ +------+
77
78 A collection (aka table) is defined by the instances of the B<access>
79 and B<parse> module together with the B<filter definitions>. At query
80 time in addition a B<query> and a B<display> module must be choosen.
81
82 =head2 Access
83
84 The access module defines which documents are members of a database.
85 Usually an access module is a tied hash, whose keys are the Ids of the
86 documents (did = document id) and whose values are the documents
87 themselves. The indexing process loops over the keys using C<FIRSTKEY>
88 and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
89
90 By convention access modules should be members of the
91 C<WAIT::Document> hierarchy. Have a look at the
92 C<WAIT::Document::Split> module to get the idea.
93
94
95 =head2 Parse
96
97 The task of the parse module is to split the documents into logical
98 parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
99 manuals piped through B<nroff>(1) into the sections I<name>,
100 I<synopsis>, I<options>, I<description>, I<author>, I<example>,
101 I<bugs>, I<text>, I<see>, and I<environment>. Here is the
102 implementation of C<WAIT::Parse::Base> which handles documents with a
103 pretty simple tagged format:
104
105 AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
106 TI: Searching Structured Documents with the Enhanced Retrieval
107 Functionality of freeWAIS-sf and SFgate
108 ER: D. Kroemker
109 BT: Computer Networks and ISDN Systems; Proceedings of the third
110 International World-Wide Web Conference
111 PN: Elsevier
112 PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
113 PP: 1027-1036
114 PY: 1995
115
116 sub split { # called as method
117 my %result;
118 my $fld;
119
120 for (split /\n/, $_[1]) {
121 if (s/^(\S+):\s*//) {
122 $fld = lc $1;
123 }
124 $result{$fld} .= $_ if defined $fld;
125 }
126 return \%result;
127 }
128
129 Since the original document cannot be reconstructed from its
130 attributes, we need a second method (I<tag>) which marks the regions
131 of the document with tags for the different attributes. This tagged
132 form is used by the display module to hilight search terms in the
133 documents. Besides the tags for the attributes, the method might assign
134 the special tags C<_b> and C<_i> for indicating bold and italic
135 regions.
136
137 sub tag {
138 my @result;
139 my $tag;
140
141 for (split /\n/, $_[1]) {
142 next if /^\w\w:\s*$/;
143 if (s/^(\S+)://) {
144 push @result, {_b => 1}, "$1:";
145 $tag = lc $1;
146 }
147 if (defined $tag) {
148 push @result, {$tag => 1}, "$_\n";
149 } else {
150 push @result, {}, "$_\n";
151 }
152 }
153 return @result; # we don't go for speed
154 }
155
156 Obviously one could implement C<split> via C<tag>. The reason for
157 having two functions is speed. We need to call C<split> for each
158 document when indexing a collection. Therefore speed is essential. On
159 the other hand, C<tag> is called in order to display a single document
160 and may be a little slower. It may care about tagging bold and italic
161 regions. See C<WAIT::Parse::Nroff> how this might decrease
162 performance.
163
164
165 =head2 Filter definition
166
167 From the Information Retrieval perspective, the hardest part of the
168 system is the filter module. The database administrator defines for
169 each attribute, how the contents should be processed before it is
170 stored in the index. Usually the processing contains steps to restrict
171 the character set, case transformation, splitting to words and
172 transforming to word stems. In WAIT these steps are defined naturally
173 as a pipeline of processing steps. The pipelines are made up by
174 functions in the package B<WAIT::Filter> which is pre-populated by the
175 most common functions but may be extended any time.
176
177 The equivalent for a typical freeWAIS-sf processing would be this
178 pipeline:
179
180 [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
181
182 The function C<isotr> replaces unknown characters by blanks. C<isolc>
183 transforms to lower case. C<split2> splits into words and removes
184 words shorter than two characters. C<stop> removes the freeWAIS-sf
185 stopwords and C<Stem> applies the Porter algorithm for computing the
186 stem of the words.
187
188 The filter definition for a collection defines a set of pipelines for
189 the attributes and modifies the pipelines which should be used for
190 prefix and interval searches.
191
192 Several complete working examples come with WAIT in the script
193 directory. It is recommended to follow the pattern of the scripts
194 smakewhatis and sman.
195
196 =cut
197

Properties

Name Value
cvs2svn:cvs-rev 1.1.1.3

  ViewVC Help
Powered by ViewVC 1.1.26