/[wait]/trunk/lib/WAIT.pm
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /trunk/lib/WAIT.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 13 - (hide annotations)
Fri Apr 28 15:42:44 2000 UTC (24 years ago) by ulpfr
Original Path: branches/CPAN/lib/WAIT.pm
File size: 6821 byte(s)
Import of WAIT-1.710

1 ulpfr 10 #!/usr/bin/perl
2 ulpfr 13 # -*- Mode: Cperl -*-
3 ulpfr 10 # $Basename: WAIT.pm $
4 ulpfr 13 # $Revision: 1.6 $
5 ulpfr 10 # Author : Ulrich Pfeifer
6     # Created On : Wed Nov 5 16:59:32 1997
7     # Last Modified By: Ulrich Pfeifer
8     # Last Modified On: Wed Nov 12 18:26:44 1997
9     # Language : CPerl
10     # Update Count : 4
11     # Status : Unknown, Use with caution!
12 ulpfr 13 #
13 ulpfr 10 # (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14 ulpfr 13 #
15     #
16 ulpfr 10
17     package WAIT;
18     require DynaLoader;
19     use vars qw($VERSION @ISA);
20     @ISA = qw(DynaLoader);
21    
22 ulpfr 13 $VERSION = sprintf '%.4f', map $_/10,'$ProjectVersion: 17.1 $ ' =~ /([\d.]+)/;
23 ulpfr 10
24     bootstrap WAIT $VERSION;
25    
26     __END__
27    
28     =head1 NAME
29    
30 ulpfr 13 WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
31 ulpfr 10
32 ulpfr 13 =head1 SYNOPSIS
33    
34     A Synopsis is not yet available.
35    
36 ulpfr 10 =head1 Status of this document
37    
38     I started writing down some information about the implementation
39     before I forget them in my spare time. The stuff is incomplete at
40     least. Any additions, corrections, ... welcome.
41    
42     =head1 PURPOSE
43    
44     As you might know, I developed and maintained B<freeWAIS-sf> (with the
45     help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
46     maintained by the Clearing House for Network Information Retrieval
47     (CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
48     Machine et al. During this long history - implementation started about
49     1989 - many people contributed to the distribution and added features
50     not foreseen by the original design. While the system fulfills its
51     task now, the code has reached a state where adding new features is
52     nearly impossible and even fixing longstanding bugs and removing
53     limitations has become a very time consuming task.
54    
55     Therefore I decided to pass the maintenance to WSC Inc. and built a
56     new system from scratch. For obvious reasons I choosed Perl as
57     implementation language.
58    
59     =head1 DESCRIPTION
60    
61     The central idea of the system is to provide a framework and the
62     building blocks for any indexing and search system the users might
63     want to build. Obviously the framework limits the class of system
64     which can be build.
65    
66     +------+ +-----+ +------+
67     ==> |Access| ==> |Parse| ==> | |
68     +------+ +-----+ | |
69     || | | +-----+
70     || |Filter| ==> |Index|
71     \/ | | +-----+
72     +-------+ +-----+ | |
73     <= |Display| <== |Query| <-> | |
74     +-------+ +-----+ +------+
75    
76     A collection (aka table) is defined by the instances of the B<access>
77     and B<parse> module together with the B<filter definitions>. At query
78     time in addition a B<query> and a B<display> module must be choosen.
79    
80     =head2 Access
81    
82 ulpfr 13 The access module defines which documents are members of a database.
83     Usually an access module is a tied hash, whose keys are the Ids of the
84     documents (did = document id) and whose values are the documents
85     themselves. The indexing process loops over the keys using C<FIRSTKEY>
86     and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
87 ulpfr 10
88     By convention access modules should be members of the
89     C<WAIT::Document> hierarchy. Have a look at the
90     C<WAIT::Document::Split> module to get the idea.
91    
92    
93     =head2 Parse
94    
95 ulpfr 13 The task of the parse module is to split the documents into logical
96     parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
97 ulpfr 10 manuals piped through B<nroff>(1) into the sections I<name>,
98     I<synopsis>, I<options>, I<description>, I<author>, I<example>,
99     I<bugs>, I<text>, I<see>, and I<environment>. Here is the
100 ulpfr 13 implementation of C<WAIT::Parse::Base> which handles documents with a
101 ulpfr 10 pretty simple tagged format:
102    
103     AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
104     TI: Searching Structured Documents with the Enhanced Retrieval
105     Functionality of freeWAIS-sf and SFgate
106     ER: D. Kroemker
107     BT: Computer Networks and ISDN Systems; Proceedings of the third
108     International World-Wide Web Conference
109     PN: Elsevier
110     PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
111     PP: 1027-1036
112     PY: 1995
113    
114     sub split { # called as method
115     my %result;
116     my $fld;
117 ulpfr 13
118 ulpfr 10 for (split /\n/, $_[1]) {
119     if (s/^(\S+):\s*//) {
120     $fld = lc $1;
121     }
122     $result{$fld} .= $_ if defined $fld;
123     }
124     return \%result;
125 ulpfr 13 }
126 ulpfr 10
127     Since the original document cannot be reconstructed from its
128     attributes, we need a second method (I<tag>) which marks the regions
129     of the document with tags for the different attributes. This tagged
130     form is used by the display module to hilight search terms in the
131     documents. Besides the tags for the attributes, the method might assign
132     the special tags C<_b> and C<_i> for indicating bold and italic
133     regions.
134    
135     sub tag {
136     my @result;
137     my $tag;
138 ulpfr 13
139 ulpfr 10 for (split /\n/, $_[1]) {
140     next if /^\w\w:\s*$/;
141     if (s/^(\S+)://) {
142     push @result, {_b => 1}, "$1:";
143     $tag = lc $1;
144     }
145     if (defined $tag) {
146     push @result, {$tag => 1}, "$_\n";
147     } else {
148     push @result, {}, "$_\n";
149     }
150     }
151     return @result; # we don't go for speed
152 ulpfr 13 }
153 ulpfr 10
154     Obviously one could implement C<split> via C<tag>. The reason for
155     having two functions is speed. We need to call C<split> for each
156     document when indexing a collection. Therefore speed is essential. On
157     the other hand, C<tag> is called in order to display a single document
158     and may be a little slower. It may care about tagging bold and italic
159     regions. See C<WAIT::Parse::Nroff> how this might decrease
160     performance.
161    
162    
163     =head2 Filter definition
164    
165     From the Information Retrieval perspective, the hardest part of the
166     system is the filter module. The database administrator defines for
167     each attribute, how the contents should be processed before it is
168     stored in the index. Usually the processing contains steps to restrict
169     the character set, case transformation, splitting to words and
170     transforming to word stems. In WAIT these steps are defined naturally
171     as a pipeline of processing steps. The pipelines are made up by
172     functions in the package B<WAIT::Filter> which is pre-populated by the
173     most common functions but may be extended any time.
174    
175     The equivalent for a typical freeWAIS-sf processing would be this
176     pipeline:
177    
178     [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
179    
180     The function C<isotr> replaces unknown characters by blanks. C<isolc>
181     transforms to lower case. C<split2> splits into words and removes
182     words shorter than two characters. C<stop> removes the freeWAIS-sf
183     stopwords and C<Stem> applies the Porter algorithm for computing the
184     stem of the words.
185    
186 ulpfr 13 The filter definition for a collection defines a set of pipelines for
187 ulpfr 10 the attributes and modifies the pipelines which should be used for
188     prefix and interval searches.
189    
190 ulpfr 13 Several complete working examples come with WAIT in the script
191     directory. It is recommended to follow the pattern of the scripts
192     smakewhatis and sman.
193 ulpfr 10
194 ulpfr 13 =cut
195 ulpfr 10

Properties

Name Value
cvs2svn:cvs-rev 1.1.1.2

  ViewVC Help
Powered by ViewVC 1.1.26