--- branches/CPAN/README 2000/04/28 15:41:10 11 +++ branches/CPAN/README 2000/05/09 11:29:45 19 @@ -1,6 +1,6 @@ - WAIT 1.6 + WAIT 1.8 - Copyright (c) 1996, Ulrich Pfeifer + Copyright (c) 1996-2000, Ulrich Pfeifer ------------------------------------------------------------------------ This program is free software; you can redistribute it and/or @@ -11,48 +11,81 @@ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ------------------------------------------------------------------------ -This software is not actively maintained by it's author. +News: -For more two years now I tried to steal some time to clean this up -without any luck. So I decided to pass the baton on. I consider the -input part pretty satisfying. The query part - despite being operable -and useful - needs a major overhaul. To provide a forum for further -discussions an to coordinate further developement, I did setup a -mailinglist. Drop me a line if you want to participate. +Locking +======= + +WAIT now supports some basic locking. + +Speed +===== + +Searching large collections is now considerably faster: + + $table->search({attr => 'text', + cont => $query, + top => 1, + picky => 0}); + +Table indices may now be tuned to improve search performance. The +index tuning can be switched on and off using $table->set(top=>1/0) to +allow for bulk inserts. + +Documentation +============= + +WAIT is still not documented really. But Andreas König took the +trouble to comment the example scripts. This will help you +implementing your own applications. I added some tiny scripts to +index e.g. your .yow file or the fourtune databases. + +SourceForge +=========== + +WAIT is registered on SourceForge now: + + http://wait.sourceforge.net/ + https://sourceforge.net/project/?group_id=4814 + +I will keep the CVS repository up to date. If you have some spare +tuits, feel free to contribute. Ulrich Pfeifer ------------------------------------------------------------------------ NAME - WAIT - a rewrite of the freeWAIS-sf engine in Perl + WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS + +SYNOPSIS + A Synopsis is not yet available. Status of this document - I started writing down some information about the implementation - before I forget them in my spare time. The stuff is incomplete - at least. Any additions, corrections, ... welcome. + I started writing down some information about the implementation before + I forget them in my spare time. The stuff is incomplete at least. Any + additions, corrections, ... welcome. PURPOSE - As you might know, I developed and maintained freeWAIS-sf (with - the help of many people in The Net). FreeWAIS-sf is based on - freeWAIS maintained by the Clearing House for Network - Information Retrieval (CNIDR) which in turn is based on wais-8- - b5 implemented by Thinking Machine et al. During this long - history - implementation started about 1989 - many people - contributed to the distribution and added features not foreseen - by the original design. While the system fulfills its task now, - the code has reached a state where adding new features is nearly - impossible and even fixing longstanding bugs and removing - limitations has become a very time consuming task. - - Therefore I decided to pass the maintenance to WSC Inc. and - built a new system from scratch. For obvious reasons I choosed - Perl as implementation language. + As you might know, I developed and maintained freeWAIS-sf (with the help + of many people in The Net). FreeWAIS-sf is based on freeWAIS maintained + by the Clearing House for Network Information Retrieval (CNIDR) which in + turn is based on wais-8-b5 implemented by Thinking Machine et al. During + this long history - implementation started about 1989 - many people + contributed to the distribution and added features not foreseen by the + original design. While the system fulfills its task now, the code has + reached a state where adding new features is nearly impossible and even + fixing longstanding bugs and removing limitations has become a very time + consuming task. + + Therefore I decided to pass the maintenance to WSC Inc. and built a new + system from scratch. For obvious reasons I choosed Perl as + implementation language. DESCRIPTION The central idea of the system is to provide a framework and the - building blocks for any indexing and search system the users - might want to build. Obviously the framework limits the class of - system which can be build. + building blocks for any indexing and search system the users might want + to build. Obviously the framework limits the class of system which can + be build. +------+ +-----+ +------+ ==> |Access| ==> |Parse| ==> | | @@ -64,33 +97,30 @@ <= |Display| <== |Query| <-> | | +-------+ +-----+ +------+ - A collection (aka table) is defined by the instances of the - access and parse module together with the filter definitions. At - query time in addition a query and a display module must be - choosen. + A collection (aka table) is defined by the instances of the access and + parse module together with the filter definitions. At query time in + addition a query and a display module must be choosen. Access - The access module defines which documents where members of a - database. Usually an access module is a tied hash, whose keys - are the Ids of the documents (did = document id) and whose - values are the documents themselves. The indexing process loops - over the keys using `FIRSTKEY' and `NEXTKEY'. Documents are - retrieved with `FETCH'. - - By convention access modules should be members of the - `WAIT::Document' hierarchy. Have a look at the - `WAIT::Document::Split' module to get the idea. + The access module defines which documents are members of a database. + Usually an access module is a tied hash, whose keys are the Ids of the + documents (did = document id) and whose values are the documents + themselves. The indexing process loops over the keys using `FIRSTKEY' + and `NEXTKEY'. Documents are retrieved with `FETCH'. + + By convention access modules should be members of the `WAIT::Document' + hierarchy. Have a look at the `WAIT::Document::Split' module to get the + idea. Parse - The task parse module is to split the documents into logical - parts via the `split' method. E.g. the `WAIT::Parse::Nroff' - splits manuals piped through nroff(1) into the sections *name*, - *synopsis*, *options*, *description*, *author*, *example*, - *bugs*, *text*, *see*, and *environment*. Here is the - implementation of `WAIT::Parse::Base' which handes documents - with a pretty simple tagged format: + The task of the parse module is to split the documents into logical + parts via the `split' method. E.g. the `WAIT::Parse::Nroff' splits + manuals piped through nroff(1) into the sections *name*, *synopsis*, + *options*, *description*, *author*, *example*, *bugs*, *text*, *see*, + and *environment*. Here is the implementation of `WAIT::Parse::Base' + which handles documents with a pretty simple tagged format: AU: Pfeifer, U.; Fuhr, N.; Huynh, T. TI: Searching Structured Documents with the Enhanced Retrieval @@ -106,7 +136,7 @@ sub split { # called as method my %result; my $fld; - + for (split /\n/, $_[1]) { if (s/^(\S+):\s*//) { $fld = lc $1; @@ -114,20 +144,19 @@ $result{$fld} .= $_ if defined $fld; } return \%result; - } + } - Since the original document cannot be reconstructed from its - attributes, we need a second method (*tag*) which marks the - regions of the document with tags for the different attributes. - This tagged form is used by the display module to hilight search - terms in the documents. Besides the tags for the attributes, the - method might assign the special tags `_b' and `_i' for - indicating bold and italic regions. + Since the original document cannot be reconstructed from its attributes, + we need a second method (*tag*) which marks the regions of the document + with tags for the different attributes. This tagged form is used by the + display module to hilight search terms in the documents. Besides the + tags for the attributes, the method might assign the special tags `_b' + and `_i' for indicating bold and italic regions. sub tag { my @result; my $tag; - + for (split /\n/, $_[1]) { next if /^\w\w:\s*$/; if (s/^(\S+)://) { @@ -141,65 +170,43 @@ } } return @result; # we don't go for speed - } + } - Obviously one could implement `split' via `tag'. The reason for - having two functions is speed. We need to call `split' for each - document when indexing a collection. Therefore speed is - essential. On the other hand, `tag' is called in order to - display a single document and may be a little slower. It may - care about tagging bold and italic regions. See + Obviously one could implement `split' via `tag'. The reason for having + two functions is speed. We need to call `split' for each document when + indexing a collection. Therefore speed is essential. On the other hand, + `tag' is called in order to display a single document and may be a + little slower. It may care about tagging bold and italic regions. See `WAIT::Parse::Nroff' how this might decrease performance. Filter definition - From the Information Retrieval perspective, the hardest part of - the system is the filter module. The database administrator - defines for each attribute, how the contents should be processed - before it is stored in the index. Usually the processing - contains steps to restrict the character set, case - transformation, splitting to words and transforming to word - stems. In WAIT these steps are defined naturally as a pipeline - of processing steps. The pipelines are made up by functions in - the package WAIT::Filter which is pre-populated by the most - common functions but may be extended any time. + From the Information Retrieval perspective, the hardest part of the + system is the filter module. The database administrator defines for each + attribute, how the contents should be processed before it is stored in + the index. Usually the processing contains steps to restrict the + character set, case transformation, splitting to words and transforming + to word stems. In WAIT these steps are defined naturally as a pipeline + of processing steps. The pipelines are made up by functions in the + package WAIT::Filter which is pre-populated by the most common functions + but may be extended any time. - The equivalent for a typical freeWAIS-sf processing would be - this pipeline: + The equivalent for a typical freeWAIS-sf processing would be this + pipeline: [ 'isotr', 'isolc', 'split2', 'stop', 'Stem'] - The function `isotr' replaces unknown characters by blanks. - `isolc' transforms to lower case. `split2' splits into words and - removes words shorter than two characters. `stop' removes the - freeWAIS-sf stopwords and `Stem' applies the Porter algorithm - for computing the stem of the words. - - The filter definition for a collection defines a set of piplines - for the attributes and modifies the pipelines which should be - used for prefix and interval searches. - - Here is a complete example: - - my $stem = [{ - 'prefix' => ['unroff', 'isotr', 'isolc'], - 'intervall' => ['unroff', 'isotr', 'isolc'], - },'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem']; - my $text = [{ - 'prefix' => ['unroff', 'isotr', 'isolc'], - 'intervall' => ['unroff', 'isotr', 'isolc'], - }, - 'unroff', 'isotr', 'isolc', 'split2', 'stop']; - my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex']; - - my $spec = [ - 'name' => $stem, - 'synopsis' => $stem, - 'bugs' => $stem, - 'description' => $stem, - 'text' => $stem, - 'environment' => $text, - 'example' => $text, 'example' => $stem, - 'author' => $sound, 'author' => $stem, - ] + The function `isotr' replaces unknown characters by blanks. `isolc' + transforms to lower case. `split2' splits into words and removes words + shorter than two characters. `stop' removes the freeWAIS-sf stopwords + and `Stem' applies the Porter algorithm for computing the stem of the + words. + + The filter definition for a collection defines a set of pipelines for + the attributes and modifies the pipelines which should be used for + prefix and interval searches. + + Several complete working examples come with WAIT in the script + directory. It is recommended to follow the pattern of the scripts + smakewhatis and sman.