trunk/doc/queries.pod

=pod

=head1 NAME

WAIS Queries is WAIT

=head1 Formulating a Query

WAIT uses extension of original WAIS Protocol for queries which
encodes new and richer query semantics in the query
string. This means that the query now has to obey a certain syntax
(see section on L<Query Syntax>)
and consequently a user might get a syntax error
when submitting a query. So our goal was to make the syntax as easy as
possible and especially leave simple free text queries valid.

Simpliest query are in form B<free text query> which means a list of search
term separated by spaces. 

In the query the categories to be searched should be selectable for
each term. To leave the original queries valid (and to support casual
users) we provided a default category, which is used if no category is
specified in the query. Now we give an outline of the query language:

The atomic search expressions of the language are terms, term
wild-cards and phrases (e.g. C<information>, C<inform*>,
C<"information retrieval">).

Stemming is handled transparently for the client. Terms searched in a
stemmed category are searched using their word stem automatically. For
a wildcard (only tail truncation is implemented), all matching words
from the dictionary are used as search terms. Phrase search looks up
the words in the string. At least one of them must be an index term.
Then the server scans the documents containing this word for string
matches for the complete phrase. This means that string search can only
work if the server has access to the documents. For type C<URL> this
is not the case.

Additionally the prefix operators C<soundex> and C<phonix> are
allowed for converting the following query term into its Soundex/Phonix
code. This is for example very useful when searching in phonebooks if
the exact spelling of a name is not known. Arbitrary Boolean
combination of these atomic expressions with the I<binary> operators
C<and>, C<or> and C<not> (C<not> means C<and not> in Boolean
logic and is therefore a binary operation, too. See the examples
below.) are allowed. Parentheses can be used for grouping. For
compatibility with the original syntax, C<or> may be omitted.

For each expression, a semantic category (field) can be defined using
the C<I<category> I<pred>> operator, where I<pred> is C<=> for text
categories and one of C<=>, C<E<lt>>, C<E<gt>> for numeric
categories.

Here are some examples:

=over

=item C<information retrieval>

free text query

=item C<information or retrieval>

same as above

=item C<ti=information retrieval>

C<information> must be in the title

=item C<ti=(information retrieval)>

one of them in title

=item C<ti=(information or retrieval)>

same as above

=item C<ti=(information and retrieval)>

both of them in title

=item C<ti=(information not retrieval)>

C<information> in title and C<retrieval> not in title

=item C<py=1990>

numerically equal

=item C<pyE<lt>1990>

numerically less

=item C<pyE<gt>1990>

numerically greater

=item C<edE<lt>19930101>

Date search. Format is yyyymmdd

=item C<au=(soundex salatan)>

soundex search, matches eg. C<Salton>

=item C<ti=("information retrieval")>

phrase search

=item C<ti=(information system*)>

wild-card search

=item C<nuclear PROX_UNORDERED 10 waste>

With this
feature, you could search for C<nuclear PROX_UNORDERED 10 waste> (like the
Lexis/Nexis search syntax) to find all stories that have C<nuclear>
and C<waste> within 10 words of each other.

=item C<nuclear PROX_ORDERED 10 waste>

Proximity Searches If the order of the two words is important, then
C<nuclear PROX_ORDERED 10 waste> will find all stories that have C<nuclear>
up to 10 words before C<waste>. Proximity also works within fields;
for instance, C<byline=(dan PROX_ORDERED woods)> will find every story with a
byline that has C<dan> within 2 words of C<woods>. Note that you
must use parentheses around the words you want to look for in the
field.

=item C<PROX_ATLEAST 20 clinton>

I<At Least> Searches C<PROX_ATLEAST 20 clinton> finds every story that has
at least 20 occurrences of C<clinton>. 

=back

=head1 Query Weighting

Let's for now disregard the boolean operators and assume that a query
is simply a list of terms. Query weighting is done using the B<Vector
Space> model. Each term in the query is associated with a B<query term
weight>. Currently this weight is constantly 1. On the other side, the
terms in each document get a B<document term weight>. This weight is
the product of a document specific weight and the B<inverse document
frequency>. The latter is defined as C<idf = log(I<N>/I<n>)> where
I<N> is the number of documents in the database and I<n> the number of
documents the term occurs in.

The other part of the document weight is computed as follows: Let I<tf>
be the number of occurances of the term in document and I<maxtf> the
maximum frequency of any term in the document. A preliminary weight is
computed according to C<I<w> = (0.5 * I<tf>)/(1 + I<maxtf>)>. Then
these weights are normalize by dividing them by the sum of the squares
of all preliminary weights for terms in this document. So the document
specific weights make up a vector of length 1. The final document term
weight is yielded by multiplying this weight to the I<idf>.

For simple queries (no booleans) the weight of a document is computed
by multiplying the query term weight to the query term weight for each
term in the query and summing up the results. This is often referred to
as the vector product (hence the name of the model) or scalar product.

Now let's get to the booleans. The C<or> operator is just dropped. So
C<information or retrieval> yields exactly the same weight than
C<information retrieval>. To interpret the vector product in another
way, you can say the C<or> operator just sums up the weights of its
arguments. For the C<and> operator the weights of both arguments is
computed and the final weight is just the minimum of these weights.
Similar the I<binary> C<not> operator returns the minimum of the
weight of the left argument and 1 - the weight of the right argument.

=head1 Query Syntax

Query syntax is taken from file C<waisquery.y> included in distribution.

  query           : expression
                  ;
  
  expression      : term
                  | expression OR term 
                  ;
  
  term            : factor
                  | term AND factor 
                  | term NOT factor 
                  ;
  
  factor          : unit
                  | unit PROX_ORDERED unit
                  | unit PROX_UNORDERED unit
                  | PROX_ATLEAST unit
                  ;
  
  unit            : w_unit 
                  | '(' expression ')' 
                  | WORD '=' '(' s_expression ')'
                  | WORD '=' w_unit 
                  | WORD '<' WORD
                  | WORD '>' WORD
                  | WORD '[' WORD ',' WORD ']'
                  ;

  phonsound       : PHONIX  
                  | SOUNDEX 
                  ;

  s_expression    : s_term 
                  | s_expression or s_term 
                  ;
  
  s_term          : s_factor 
                  | s_term AND s_factor 
                  | s_term NOT s_factor 
                  ;
  
  s_factor        : s_unit
                  | s_unit PROX_ORDERED s_unit
                  | s_unit PROX_UNORDERED s_unit
                  | PROX_ATLEAST s_unit
                  ;
  
  s_unit          : w_unit
                  | '(' s_expression ')' 
                  ;

  a_unit          : WORD 
                  | phonsound WORD 
                  ;

  w_unit          : a_unit 
                  | a_unit ASSIGN FLOAT

=head1 Authors

This document is based on Queries chapter from freeWAIS-sf documentation
written by Ulrich Pfeifer, and then modified to match WAIT. All errors
and omissions should be blamed on Dobrica Pavlinusic.

=cut
1	dpavlin	92	=pod
2
3			=head1 NAME
4
5			WAIS Queries is WAIT
6
7			=head1 Formulating a Query
8
9			WAIT uses extension of original WAIS Protocol for queries which
10			encodes new and richer query semantics in the query
11			string. This means that the query now has to obey a certain syntax
12			(see section on L<Query Syntax>)
13			and consequently a user might get a syntax error
14			when submitting a query. So our goal was to make the syntax as easy as
15			possible and especially leave simple free text queries valid.
16
17			Simpliest query are in form B<free text query> which means a list of search
18			term separated by spaces.
19
20			In the query the categories to be searched should be selectable for
21			each term. To leave the original queries valid (and to support casual
22			users) we provided a default category, which is used if no category is
23			specified in the query. Now we give an outline of the query language:
24
25			The atomic search expressions of the language are terms, term
26			wild-cards and phrases (e.g. C<information>, C<inform*>,
27			C<"information retrieval">).
28
29			Stemming is handled transparently for the client. Terms searched in a
30			stemmed category are searched using their word stem automatically. For
31			a wildcard (only tail truncation is implemented), all matching words
32			from the dictionary are used as search terms. Phrase search looks up
33			the words in the string. At least one of them must be an index term.
34			Then the server scans the documents containing this word for string
35			matches for the complete phrase. This means that string search can only
36			work if the server has access to the documents. For type C<URL> this
37			is not the case.
38
39			Additionally the prefix operators C<soundex> and C<phonix> are
40			allowed for converting the following query term into its Soundex/Phonix
41			code. This is for example very useful when searching in phonebooks if
42			the exact spelling of a name is not known. Arbitrary Boolean
43			combination of these atomic expressions with the I<binary> operators
44			C<and>, C<or> and C<not> (C<not> means C<and not> in Boolean
45			logic and is therefore a binary operation, too. See the examples
46			below.) are allowed. Parentheses can be used for grouping. For
47			compatibility with the original syntax, C<or> may be omitted.
48
49			For each expression, a semantic category (field) can be defined using
50			the C<I<category> I<pred>> operator, where I<pred> is C<=> for text
51			categories and one of C<=>, C<E<lt>>, C<E<gt>> for numeric
52			categories.
53
54			Here are some examples:
55
56			=over
57
58			=item C<information retrieval>
59
60			free text query
61
62			=item C<information or retrieval>
63
64			same as above
65
66			=item C<ti=information retrieval>
67
68			C<information> must be in the title
69
70			=item C<ti=(information retrieval)>
71
72			one of them in title
73
74			=item C<ti=(information or retrieval)>
75
76			same as above
77
78			=item C<ti=(information and retrieval)>
79
80			both of them in title
81
82			=item C<ti=(information not retrieval)>
83
84			C<information> in title and C<retrieval> not in title
85
86			=item C<py=1990>
87
88			numerically equal
89
90			=item C<pyE<lt>1990>
91
92			numerically less
93
94			=item C<pyE<gt>1990>
95
96			numerically greater
97
98			=item C<edE<lt>19930101>
99
100			Date search. Format is yyyymmdd
101
102			=item C<au=(soundex salatan)>
103
104			soundex search, matches eg. C<Salton>
105
106			=item C<ti=("information retrieval")>
107
108			phrase search
109
110			=item C<ti=(information system*)>
111
112			wild-card search
113
114			=item C<nuclear PROX_UNORDERED 10 waste>
115
116			With this
117			feature, you could search for C<nuclear PROX_UNORDERED 10 waste> (like the
118			Lexis/Nexis search syntax) to find all stories that have C<nuclear>
119			and C<waste> within 10 words of each other.
120
121			=item C<nuclear PROX_ORDERED 10 waste>
122
123			Proximity Searches If the order of the two words is important, then
124			C<nuclear PROX_ORDERED 10 waste> will find all stories that have C<nuclear>
125			up to 10 words before C<waste>. Proximity also works within fields;
126			for instance, C<byline=(dan PROX_ORDERED woods)> will find every story with a
127			byline that has C<dan> within 2 words of C<woods>. Note that you
128			must use parentheses around the words you want to look for in the
129			field.
130
131			=item C<PROX_ATLEAST 20 clinton>
132
133			I<At Least> Searches C<PROX_ATLEAST 20 clinton> finds every story that has
134			at least 20 occurrences of C<clinton>.
135
136			=back
137
138			=head1 Query Weighting
139
140			Let's for now disregard the boolean operators and assume that a query
141			is simply a list of terms. Query weighting is done using the B<Vector
142			Space> model. Each term in the query is associated with a B<query term
143			weight>. Currently this weight is constantly 1. On the other side, the
144			terms in each document get a B<document term weight>. This weight is
145			the product of a document specific weight and the B<inverse document
146			frequency>. The latter is defined as C<idf = log(I<N>/I<n>)> where
147			I<N> is the number of documents in the database and I<n> the number of
148			documents the term occurs in.
149
150			The other part of the document weight is computed as follows: Let I<tf>
151			be the number of occurances of the term in document and I<maxtf> the
152			maximum frequency of any term in the document. A preliminary weight is
153			computed according to C<I<w> = (0.5 * I<tf>)/(1 + I<maxtf>)>. Then
154			these weights are normalize by dividing them by the sum of the squares
155			of all preliminary weights for terms in this document. So the document
156			specific weights make up a vector of length 1. The final document term
157			weight is yielded by multiplying this weight to the I<idf>.
158
159			For simple queries (no booleans) the weight of a document is computed
160			by multiplying the query term weight to the query term weight for each
161			term in the query and summing up the results. This is often referred to
162			as the vector product (hence the name of the model) or scalar product.
163
164			Now let's get to the booleans. The C<or> operator is just dropped. So
165			C<information or retrieval> yields exactly the same weight than
166			C<information retrieval>. To interpret the vector product in another
167			way, you can say the C<or> operator just sums up the weights of its
168			arguments. For the C<and> operator the weights of both arguments is
169			computed and the final weight is just the minimum of these weights.
170			Similar the I<binary> C<not> operator returns the minimum of the
171			weight of the left argument and 1 - the weight of the right argument.
172
173			=head1 Query Syntax
174
175			Query syntax is taken from file C<waisquery.y> included in distribution.
176
177			query : expression
178			;
179
180			expression : term
181			\| expression OR term
182			;
183
184			term : factor
185			\| term AND factor
186			\| term NOT factor
187			;
188
189			factor : unit
190			\| unit PROX_ORDERED unit
191			\| unit PROX_UNORDERED unit
192			\| PROX_ATLEAST unit
193			;
194
195			unit : w_unit
196			\| '(' expression ')'
197			\| WORD '=' '(' s_expression ')'
198			\| WORD '=' w_unit
199			\| WORD '<' WORD
200			\| WORD '>' WORD
201			\| WORD '[' WORD ',' WORD ']'
202			;
203
204			phonsound : PHONIX
205			\| SOUNDEX
206			;
207
208			s_expression : s_term
209			\| s_expression or s_term
210			;
211
212			s_term : s_factor
213			\| s_term AND s_factor
214			\| s_term NOT s_factor
215			;
216
217			s_factor : s_unit
218			\| s_unit PROX_ORDERED s_unit
219			\| s_unit PROX_UNORDERED s_unit
220			\| PROX_ATLEAST s_unit
221			;
222
223			s_unit : w_unit
224			\| '(' s_expression ')'
225			;
226
227			a_unit : WORD
228			\| phonsound WORD
229			;
230
231			w_unit : a_unit
232			\| a_unit ASSIGN FLOAT
233
234			=head1 Authors
235
236			This document is based on Queries chapter from freeWAIS-sf documentation
237			written by Ulrich Pfeifer, and then modified to match WAIT. All errors
238			and omissions should be blamed on Dobrica Pavlinusic.
239
240			=cut