/[wait]/trunk/lib/WAIT.pm
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /trunk/lib/WAIT.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 81 - (hide annotations)
Sat Apr 20 15:02:33 2002 UTC (22 years, 1 month ago) by ulpfr
Original Path: cvs-head/lib/WAIT.pm
File size: 6763 byte(s)
Bump version number

1 ulpfr 10 #!/usr/bin/perl
2 ulpfr 13 # -*- Mode: Cperl -*-
3 ulpfr 10 # $Basename: WAIT.pm $
4 ulpfr 19 # $Revision: 1.7 $
5 ulpfr 10 # Author : Ulrich Pfeifer
6     # Created On : Wed Nov 5 16:59:32 1997
7     # Last Modified By: Ulrich Pfeifer
8 ulpfr 81 # Last Modified On: Tue Apr 16 23:28:52 2002
9 ulpfr 10 # Language : CPerl
10 ulpfr 81 # Update Count : 8
11 ulpfr 10 # Status : Unknown, Use with caution!
12 ulpfr 13 #
13 ulpfr 10 # (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14 ulpfr 13 #
15     #
16 ulpfr 10
17     package WAIT;
18     require DynaLoader;
19     use vars qw($VERSION @ISA);
20     @ISA = qw(DynaLoader);
21    
22 ulpfr 81 $VERSION = '1.900';
23 ulpfr 10
24 ulpfr 19
25 ulpfr 10 bootstrap WAIT $VERSION;
26    
27     __END__
28    
29     =head1 NAME
30    
31 ulpfr 13 WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
32 ulpfr 10
33 ulpfr 13 =head1 SYNOPSIS
34    
35     A Synopsis is not yet available.
36    
37 ulpfr 10 =head1 Status of this document
38    
39     I started writing down some information about the implementation
40     before I forget them in my spare time. The stuff is incomplete at
41     least. Any additions, corrections, ... welcome.
42    
43     =head1 PURPOSE
44    
45     As you might know, I developed and maintained B<freeWAIS-sf> (with the
46     help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
47     maintained by the Clearing House for Network Information Retrieval
48     (CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
49     Machine et al. During this long history - implementation started about
50     1989 - many people contributed to the distribution and added features
51     not foreseen by the original design. While the system fulfills its
52     task now, the code has reached a state where adding new features is
53     nearly impossible and even fixing longstanding bugs and removing
54     limitations has become a very time consuming task.
55    
56     Therefore I decided to pass the maintenance to WSC Inc. and built a
57     new system from scratch. For obvious reasons I choosed Perl as
58     implementation language.
59    
60     =head1 DESCRIPTION
61    
62     The central idea of the system is to provide a framework and the
63     building blocks for any indexing and search system the users might
64     want to build. Obviously the framework limits the class of system
65     which can be build.
66    
67     +------+ +-----+ +------+
68     ==> |Access| ==> |Parse| ==> | |
69     +------+ +-----+ | |
70     || | | +-----+
71     || |Filter| ==> |Index|
72     \/ | | +-----+
73     +-------+ +-----+ | |
74     <= |Display| <== |Query| <-> | |
75     +-------+ +-----+ +------+
76    
77     A collection (aka table) is defined by the instances of the B<access>
78     and B<parse> module together with the B<filter definitions>. At query
79     time in addition a B<query> and a B<display> module must be choosen.
80    
81     =head2 Access
82    
83 ulpfr 13 The access module defines which documents are members of a database.
84     Usually an access module is a tied hash, whose keys are the Ids of the
85     documents (did = document id) and whose values are the documents
86     themselves. The indexing process loops over the keys using C<FIRSTKEY>
87     and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
88 ulpfr 10
89     By convention access modules should be members of the
90     C<WAIT::Document> hierarchy. Have a look at the
91     C<WAIT::Document::Split> module to get the idea.
92    
93    
94     =head2 Parse
95    
96 ulpfr 13 The task of the parse module is to split the documents into logical
97     parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
98 ulpfr 10 manuals piped through B<nroff>(1) into the sections I<name>,
99     I<synopsis>, I<options>, I<description>, I<author>, I<example>,
100     I<bugs>, I<text>, I<see>, and I<environment>. Here is the
101 ulpfr 13 implementation of C<WAIT::Parse::Base> which handles documents with a
102 ulpfr 10 pretty simple tagged format:
103    
104     AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
105     TI: Searching Structured Documents with the Enhanced Retrieval
106     Functionality of freeWAIS-sf and SFgate
107     ER: D. Kroemker
108     BT: Computer Networks and ISDN Systems; Proceedings of the third
109     International World-Wide Web Conference
110     PN: Elsevier
111     PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
112     PP: 1027-1036
113     PY: 1995
114    
115     sub split { # called as method
116     my %result;
117     my $fld;
118 ulpfr 13
119 ulpfr 10 for (split /\n/, $_[1]) {
120     if (s/^(\S+):\s*//) {
121     $fld = lc $1;
122     }
123     $result{$fld} .= $_ if defined $fld;
124     }
125     return \%result;
126 ulpfr 13 }
127 ulpfr 10
128     Since the original document cannot be reconstructed from its
129     attributes, we need a second method (I<tag>) which marks the regions
130     of the document with tags for the different attributes. This tagged
131     form is used by the display module to hilight search terms in the
132     documents. Besides the tags for the attributes, the method might assign
133     the special tags C<_b> and C<_i> for indicating bold and italic
134     regions.
135    
136     sub tag {
137     my @result;
138     my $tag;
139 ulpfr 13
140 ulpfr 10 for (split /\n/, $_[1]) {
141     next if /^\w\w:\s*$/;
142     if (s/^(\S+)://) {
143     push @result, {_b => 1}, "$1:";
144     $tag = lc $1;
145     }
146     if (defined $tag) {
147     push @result, {$tag => 1}, "$_\n";
148     } else {
149     push @result, {}, "$_\n";
150     }
151     }
152     return @result; # we don't go for speed
153 ulpfr 13 }
154 ulpfr 10
155     Obviously one could implement C<split> via C<tag>. The reason for
156     having two functions is speed. We need to call C<split> for each
157     document when indexing a collection. Therefore speed is essential. On
158     the other hand, C<tag> is called in order to display a single document
159     and may be a little slower. It may care about tagging bold and italic
160     regions. See C<WAIT::Parse::Nroff> how this might decrease
161     performance.
162    
163    
164     =head2 Filter definition
165    
166     From the Information Retrieval perspective, the hardest part of the
167     system is the filter module. The database administrator defines for
168     each attribute, how the contents should be processed before it is
169     stored in the index. Usually the processing contains steps to restrict
170     the character set, case transformation, splitting to words and
171     transforming to word stems. In WAIT these steps are defined naturally
172     as a pipeline of processing steps. The pipelines are made up by
173     functions in the package B<WAIT::Filter> which is pre-populated by the
174     most common functions but may be extended any time.
175    
176     The equivalent for a typical freeWAIS-sf processing would be this
177     pipeline:
178    
179     [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
180    
181     The function C<isotr> replaces unknown characters by blanks. C<isolc>
182     transforms to lower case. C<split2> splits into words and removes
183     words shorter than two characters. C<stop> removes the freeWAIS-sf
184     stopwords and C<Stem> applies the Porter algorithm for computing the
185     stem of the words.
186    
187 ulpfr 13 The filter definition for a collection defines a set of pipelines for
188 ulpfr 10 the attributes and modifies the pipelines which should be used for
189     prefix and interval searches.
190    
191 ulpfr 13 Several complete working examples come with WAIT in the script
192     directory. It is recommended to follow the pattern of the scripts
193     smakewhatis and sman.
194 ulpfr 10
195 ulpfr 13 =cut
196 ulpfr 10

Properties

Name Value
cvs2svn:cvs-rev 1.3

  ViewVC Help
Powered by ViewVC 1.1.26