1 |
ulpfr |
19 |
WAIT 1.8 |
2 |
ulpfr |
10 |
|
3 |
ulpfr |
19 |
Copyright (c) 1996-2000, Ulrich Pfeifer |
4 |
ulpfr |
10 |
|
5 |
|
|
------------------------------------------------------------------------ |
6 |
|
|
This program is free software; you can redistribute it and/or |
7 |
|
|
modify it under the same terms than Perl itself. |
8 |
|
|
|
9 |
|
|
This program is distributed in the hope that it will be useful, |
10 |
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of |
11 |
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. |
12 |
|
|
------------------------------------------------------------------------ |
13 |
|
|
|
14 |
ulpfr |
19 |
News: |
15 |
ulpfr |
10 |
|
16 |
ulpfr |
19 |
Locking |
17 |
|
|
======= |
18 |
ulpfr |
10 |
|
19 |
ulpfr |
19 |
WAIT now supports some basic locking. |
20 |
|
|
|
21 |
|
|
Speed |
22 |
|
|
===== |
23 |
|
|
|
24 |
|
|
Searching large collections is now considerably faster: |
25 |
|
|
|
26 |
|
|
$table->search({attr => 'text', |
27 |
|
|
cont => $query, |
28 |
|
|
top => 1, |
29 |
|
|
picky => 0}); |
30 |
|
|
|
31 |
|
|
Table indices may now be tuned to improve search performance. The |
32 |
|
|
index tuning can be switched on and off using $table->set(top=>1/0) to |
33 |
|
|
allow for bulk inserts. |
34 |
|
|
|
35 |
|
|
Documentation |
36 |
|
|
============= |
37 |
|
|
|
38 |
|
|
WAIT is still not documented really. But Andreas König took the |
39 |
|
|
trouble to comment the example scripts. This will help you |
40 |
|
|
implementing your own applications. I added some tiny scripts to |
41 |
|
|
index e.g. your .yow file or the fourtune databases. |
42 |
|
|
|
43 |
|
|
SourceForge |
44 |
|
|
=========== |
45 |
|
|
|
46 |
|
|
WAIT is registered on SourceForge now: |
47 |
|
|
|
48 |
|
|
http://wait.sourceforge.net/ |
49 |
|
|
https://sourceforge.net/project/?group_id=4814 |
50 |
|
|
|
51 |
|
|
I will keep the CVS repository up to date. If you have some spare |
52 |
|
|
tuits, feel free to contribute. |
53 |
|
|
|
54 |
ulpfr |
10 |
Ulrich Pfeifer <upf@wait.de> |
55 |
|
|
|
56 |
|
|
------------------------------------------------------------------------ |
57 |
|
|
NAME |
58 |
ulpfr |
19 |
WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS |
59 |
ulpfr |
10 |
|
60 |
ulpfr |
19 |
SYNOPSIS |
61 |
|
|
A Synopsis is not yet available. |
62 |
|
|
|
63 |
ulpfr |
10 |
Status of this document |
64 |
ulpfr |
19 |
I started writing down some information about the implementation before |
65 |
|
|
I forget them in my spare time. The stuff is incomplete at least. Any |
66 |
|
|
additions, corrections, ... welcome. |
67 |
ulpfr |
10 |
|
68 |
|
|
PURPOSE |
69 |
ulpfr |
19 |
As you might know, I developed and maintained freeWAIS-sf (with the help |
70 |
|
|
of many people in The Net). FreeWAIS-sf is based on freeWAIS maintained |
71 |
|
|
by the Clearing House for Network Information Retrieval (CNIDR) which in |
72 |
|
|
turn is based on wais-8-b5 implemented by Thinking Machine et al. During |
73 |
|
|
this long history - implementation started about 1989 - many people |
74 |
|
|
contributed to the distribution and added features not foreseen by the |
75 |
|
|
original design. While the system fulfills its task now, the code has |
76 |
|
|
reached a state where adding new features is nearly impossible and even |
77 |
|
|
fixing longstanding bugs and removing limitations has become a very time |
78 |
|
|
consuming task. |
79 |
ulpfr |
10 |
|
80 |
ulpfr |
19 |
Therefore I decided to pass the maintenance to WSC Inc. and built a new |
81 |
|
|
system from scratch. For obvious reasons I choosed Perl as |
82 |
|
|
implementation language. |
83 |
ulpfr |
10 |
|
84 |
|
|
DESCRIPTION |
85 |
|
|
The central idea of the system is to provide a framework and the |
86 |
ulpfr |
19 |
building blocks for any indexing and search system the users might want |
87 |
|
|
to build. Obviously the framework limits the class of system which can |
88 |
|
|
be build. |
89 |
ulpfr |
10 |
|
90 |
|
|
+------+ +-----+ +------+ |
91 |
|
|
==> |Access| ==> |Parse| ==> | | |
92 |
|
|
+------+ +-----+ | | |
93 |
|
|
|| | | +-----+ |
94 |
|
|
|| |Filter| ==> |Index| |
95 |
|
|
\/ | | +-----+ |
96 |
|
|
+-------+ +-----+ | | |
97 |
|
|
<= |Display| <== |Query| <-> | | |
98 |
|
|
+-------+ +-----+ +------+ |
99 |
|
|
|
100 |
ulpfr |
19 |
A collection (aka table) is defined by the instances of the access and |
101 |
|
|
parse module together with the filter definitions. At query time in |
102 |
|
|
addition a query and a display module must be choosen. |
103 |
ulpfr |
10 |
|
104 |
|
|
Access |
105 |
|
|
|
106 |
ulpfr |
19 |
The access module defines which documents are members of a database. |
107 |
|
|
Usually an access module is a tied hash, whose keys are the Ids of the |
108 |
|
|
documents (did = document id) and whose values are the documents |
109 |
|
|
themselves. The indexing process loops over the keys using `FIRSTKEY' |
110 |
|
|
and `NEXTKEY'. Documents are retrieved with `FETCH'. |
111 |
ulpfr |
10 |
|
112 |
ulpfr |
19 |
By convention access modules should be members of the `WAIT::Document' |
113 |
|
|
hierarchy. Have a look at the `WAIT::Document::Split' module to get the |
114 |
|
|
idea. |
115 |
ulpfr |
10 |
|
116 |
|
|
Parse |
117 |
|
|
|
118 |
ulpfr |
19 |
The task of the parse module is to split the documents into logical |
119 |
|
|
parts via the `split' method. E.g. the `WAIT::Parse::Nroff' splits |
120 |
|
|
manuals piped through nroff(1) into the sections *name*, *synopsis*, |
121 |
|
|
*options*, *description*, *author*, *example*, *bugs*, *text*, *see*, |
122 |
|
|
and *environment*. Here is the implementation of `WAIT::Parse::Base' |
123 |
|
|
which handles documents with a pretty simple tagged format: |
124 |
ulpfr |
10 |
|
125 |
|
|
AU: Pfeifer, U.; Fuhr, N.; Huynh, T. |
126 |
|
|
TI: Searching Structured Documents with the Enhanced Retrieval |
127 |
|
|
Functionality of freeWAIS-sf and SFgate |
128 |
|
|
ER: D. Kroemker |
129 |
|
|
BT: Computer Networks and ISDN Systems; Proceedings of the third |
130 |
|
|
International World-Wide Web Conference |
131 |
|
|
PN: Elsevier |
132 |
|
|
PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo |
133 |
|
|
PP: 1027-1036 |
134 |
|
|
PY: 1995 |
135 |
|
|
|
136 |
|
|
sub split { # called as method |
137 |
|
|
my %result; |
138 |
|
|
my $fld; |
139 |
ulpfr |
19 |
|
140 |
ulpfr |
10 |
for (split /\n/, $_[1]) { |
141 |
|
|
if (s/^(\S+):\s*//) { |
142 |
|
|
$fld = lc $1; |
143 |
|
|
} |
144 |
|
|
$result{$fld} .= $_ if defined $fld; |
145 |
|
|
} |
146 |
|
|
return \%result; |
147 |
ulpfr |
19 |
} |
148 |
ulpfr |
10 |
|
149 |
ulpfr |
19 |
Since the original document cannot be reconstructed from its attributes, |
150 |
|
|
we need a second method (*tag*) which marks the regions of the document |
151 |
|
|
with tags for the different attributes. This tagged form is used by the |
152 |
|
|
display module to hilight search terms in the documents. Besides the |
153 |
|
|
tags for the attributes, the method might assign the special tags `_b' |
154 |
|
|
and `_i' for indicating bold and italic regions. |
155 |
ulpfr |
10 |
|
156 |
|
|
sub tag { |
157 |
|
|
my @result; |
158 |
|
|
my $tag; |
159 |
ulpfr |
19 |
|
160 |
ulpfr |
10 |
for (split /\n/, $_[1]) { |
161 |
|
|
next if /^\w\w:\s*$/; |
162 |
|
|
if (s/^(\S+)://) { |
163 |
|
|
push @result, {_b => 1}, "$1:"; |
164 |
|
|
$tag = lc $1; |
165 |
|
|
} |
166 |
|
|
if (defined $tag) { |
167 |
|
|
push @result, {$tag => 1}, "$_\n"; |
168 |
|
|
} else { |
169 |
|
|
push @result, {}, "$_\n"; |
170 |
|
|
} |
171 |
|
|
} |
172 |
|
|
return @result; # we don't go for speed |
173 |
ulpfr |
19 |
} |
174 |
ulpfr |
10 |
|
175 |
ulpfr |
19 |
Obviously one could implement `split' via `tag'. The reason for having |
176 |
|
|
two functions is speed. We need to call `split' for each document when |
177 |
|
|
indexing a collection. Therefore speed is essential. On the other hand, |
178 |
|
|
`tag' is called in order to display a single document and may be a |
179 |
|
|
little slower. It may care about tagging bold and italic regions. See |
180 |
ulpfr |
10 |
`WAIT::Parse::Nroff' how this might decrease performance. |
181 |
|
|
|
182 |
|
|
Filter definition |
183 |
|
|
|
184 |
ulpfr |
19 |
From the Information Retrieval perspective, the hardest part of the |
185 |
|
|
system is the filter module. The database administrator defines for each |
186 |
|
|
attribute, how the contents should be processed before it is stored in |
187 |
|
|
the index. Usually the processing contains steps to restrict the |
188 |
|
|
character set, case transformation, splitting to words and transforming |
189 |
|
|
to word stems. In WAIT these steps are defined naturally as a pipeline |
190 |
|
|
of processing steps. The pipelines are made up by functions in the |
191 |
|
|
package WAIT::Filter which is pre-populated by the most common functions |
192 |
|
|
but may be extended any time. |
193 |
ulpfr |
10 |
|
194 |
ulpfr |
19 |
The equivalent for a typical freeWAIS-sf processing would be this |
195 |
|
|
pipeline: |
196 |
ulpfr |
10 |
|
197 |
|
|
[ 'isotr', 'isolc', 'split2', 'stop', 'Stem'] |
198 |
|
|
|
199 |
ulpfr |
19 |
The function `isotr' replaces unknown characters by blanks. `isolc' |
200 |
|
|
transforms to lower case. `split2' splits into words and removes words |
201 |
|
|
shorter than two characters. `stop' removes the freeWAIS-sf stopwords |
202 |
|
|
and `Stem' applies the Porter algorithm for computing the stem of the |
203 |
|
|
words. |
204 |
ulpfr |
10 |
|
205 |
ulpfr |
19 |
The filter definition for a collection defines a set of pipelines for |
206 |
|
|
the attributes and modifies the pipelines which should be used for |
207 |
|
|
prefix and interval searches. |
208 |
ulpfr |
10 |
|
209 |
ulpfr |
19 |
Several complete working examples come with WAIT in the script |
210 |
|
|
directory. It is recommended to follow the pattern of the scripts |
211 |
|
|
smakewhatis and sman. |
212 |
ulpfr |
10 |
|