trunk/doc/queries.pod

=pod

=head1 NAME

WAIS Queries is WAIT

=head1 Formulating a Query

WAIT uses extension of original WAIS Protocol for queries which
encodes new and richer query semantics in the query
string. This means that the query now has to obey a certain syntax
(see section on L<Query Syntax>)
and consequently a user might get a syntax error
when submitting a query. So our goal was to make the syntax as easy as
possible and especially leave simple free text queries valid.

Simpliest query are in form B<free text query> which means a list of search
term separated by spaces. 

In the query the categories to be searched should be selectable for
each term. To leave the original queries valid (and to support casual
users) we provided a default category, which is used if no category is
specified in the query. Now we give an outline of the query language:

The atomic search expressions of the language are terms, term
wild-cards and phrases (e.g. C<information>, C<inform*>,
C<"information retrieval">).

Stemming is handled transparently for the client. Terms searched in a
stemmed category are searched using their word stem automatically. For
a wildcard (only tail truncation is implemented), all matching words
from the dictionary are used as search terms. Phrase search looks up
the words in the string. At least one of them must be an index term.
Then the server scans the documents containing this word for string
matches for the complete phrase. This means that string search can only
work if the server has access to the documents. For type C<URL> this
is not the case.

Additionally the prefix operators C<soundex> and C<phonix> are
allowed for converting the following query term into its Soundex/Phonix
code. This is for example very useful when searching in phonebooks if
the exact spelling of a name is not known. Arbitrary Boolean
combination of these atomic expressions with the I<binary> operators
C<and>, C<or> and C<not> (C<not> means C<and not> in Boolean
logic and is therefore a binary operation, too. See the examples
below.) are allowed. Parentheses can be used for grouping. For
compatibility with the original syntax, C<or> may be omitted.

For each expression, a semantic category (field) can be defined using
the C<I<category> I<pred>> operator, where I<pred> is C<=> for text
categories and one of C<=>, C<E<lt>>, C<E<gt>> for numeric
categories.

Here are some examples:

=over

=item C<information retrieval>

free text query

=item C<information or retrieval>

same as above

=item C<ti=information retrieval>

C<information> must be in the title

=item C<ti=(information retrieval)>

one of them in title

=item C<ti=(information or retrieval)>

same as above

=item C<ti=(information and retrieval)>

both of them in title

=item C<ti=(information not retrieval)>

C<information> in title and C<retrieval> not in title

=item C<py=1990>

numerically equal

=item C<pyE<lt>1990>

numerically less

=item C<pyE<gt>1990>

numerically greater

=item C<edE<lt>19930101>

Date search. Format is yyyymmdd

=item C<au=(soundex salatan)>

soundex search, matches eg. C<Salton>

=item C<ti=("information retrieval")>

phrase search

=item C<ti=(information system*)>

wild-card search

=item C<nuclear PROX_UNORDERED 10 waste>

With this
feature, you could search for C<nuclear PROX_UNORDERED 10 waste> (like the
Lexis/Nexis search syntax) to find all stories that have C<nuclear>
and C<waste> within 10 words of each other.

=item C<nuclear PROX_ORDERED 10 waste>

Proximity Searches If the order of the two words is important, then
C<nuclear PROX_ORDERED 10 waste> will find all stories that have C<nuclear>
up to 10 words before C<waste>. Proximity also works within fields;
for instance, C<byline=(dan PROX_ORDERED woods)> will find every story with a
byline that has C<dan> within 2 words of C<woods>. Note that you
must use parentheses around the words you want to look for in the
field.

=item C<PROX_ATLEAST 20 clinton>

I<At Least> Searches C<PROX_ATLEAST 20 clinton> finds every story that has
at least 20 occurrences of C<clinton>. 

=back

=head1 Query Weighting

Let's for now disregard the boolean operators and assume that a query
is simply a list of terms. Query weighting is done using the B<Vector
Space> model. Each term in the query is associated with a B<query term
weight>. Currently this weight is constantly 1. On the other side, the
terms in each document get a B<document term weight>. This weight is
the product of a document specific weight and the B<inverse document
frequency>. The latter is defined as C<idf = log(I<N>/I<n>)> where
I<N> is the number of documents in the database and I<n> the number of
documents the term occurs in.

The other part of the document weight is computed as follows: Let I<tf>
be the number of occurances of the term in document and I<maxtf> the
maximum frequency of any term in the document. A preliminary weight is
computed according to C<I<w> = (0.5 * I<tf>)/(1 + I<maxtf>)>. Then
these weights are normalize by dividing them by the sum of the squares
of all preliminary weights for terms in this document. So the document
specific weights make up a vector of length 1. The final document term
weight is yielded by multiplying this weight to the I<idf>.

For simple queries (no booleans) the weight of a document is computed
by multiplying the query term weight to the query term weight for each
term in the query and summing up the results. This is often referred to
as the vector product (hence the name of the model) or scalar product.

Now let's get to the booleans. The C<or> operator is just dropped. So
C<information or retrieval> yields exactly the same weight than
C<information retrieval>. To interpret the vector product in another
way, you can say the C<or> operator just sums up the weights of its
arguments. For the C<and> operator the weights of both arguments is
computed and the final weight is just the minimum of these weights.
Similar the I<binary> C<not> operator returns the minimum of the
weight of the left argument and 1 - the weight of the right argument.

=head1 Query Syntax

Query syntax is taken from file C<waisquery.y> included in distribution.

  query           : expression
                  ;
  
  expression      : term
                  | expression OR term 
                  ;
  
  term            : factor
                  | term AND factor 
                  | term NOT factor 
                  ;
  
  factor          : unit
                  | unit PROX_ORDERED unit
                  | unit PROX_UNORDERED unit
                  | PROX_ATLEAST unit
                  ;
  
  unit            : w_unit 
                  | '(' expression ')' 
                  | WORD '=' '(' s_expression ')'
                  | WORD '=' w_unit 
                  | WORD '<' WORD
                  | WORD '>' WORD
                  | WORD '[' WORD ',' WORD ']'
                  ;

  phonsound       : PHONIX  
                  | SOUNDEX 
                  ;

  s_expression    : s_term 
                  | s_expression or s_term 
                  ;
  
  s_term          : s_factor 
                  | s_term AND s_factor 
                  | s_term NOT s_factor 
                  ;
  
  s_factor        : s_unit
                  | s_unit PROX_ORDERED s_unit
                  | s_unit PROX_UNORDERED s_unit
                  | PROX_ATLEAST s_unit
                  ;
  
  s_unit          : w_unit
                  | '(' s_expression ')' 
                  ;

  a_unit          : WORD 
                  | phonsound WORD 
                  ;

  w_unit          : a_unit 
                  | a_unit ASSIGN FLOAT

=head1 Authors

This document is based on Queries chapter from freeWAIS-sf documentation
written by Ulrich Pfeifer, and then modified to match WAIT. All errors
and omissions should be blamed on Dobrica Pavlinusic.

=cut
1	=pod
2
3	=head1 NAME
4
5	WAIS Queries is WAIT
6
7	=head1 Formulating a Query
8
9	WAIT uses extension of original WAIS Protocol for queries which
10	encodes new and richer query semantics in the query
11	string. This means that the query now has to obey a certain syntax
12	(see section on L<Query Syntax>)
13	and consequently a user might get a syntax error
14	when submitting a query. So our goal was to make the syntax as easy as
15	possible and especially leave simple free text queries valid.
16
17	Simpliest query are in form B<free text query> which means a list of search
18	term separated by spaces.
19
20	In the query the categories to be searched should be selectable for
21	each term. To leave the original queries valid (and to support casual
22	users) we provided a default category, which is used if no category is
23	specified in the query. Now we give an outline of the query language:
24
25	The atomic search expressions of the language are terms, term
26	wild-cards and phrases (e.g. C<information>, C<inform*>,
27	C<"information retrieval">).
28
29	Stemming is handled transparently for the client. Terms searched in a
30	stemmed category are searched using their word stem automatically. For
31	a wildcard (only tail truncation is implemented), all matching words
32	from the dictionary are used as search terms. Phrase search looks up
33	the words in the string. At least one of them must be an index term.
34	Then the server scans the documents containing this word for string
35	matches for the complete phrase. This means that string search can only
36	work if the server has access to the documents. For type C<URL> this
37	is not the case.
38
39	Additionally the prefix operators C<soundex> and C<phonix> are
40	allowed for converting the following query term into its Soundex/Phonix
41	code. This is for example very useful when searching in phonebooks if
42	the exact spelling of a name is not known. Arbitrary Boolean
43	combination of these atomic expressions with the I<binary> operators
44	C<and>, C<or> and C<not> (C<not> means C<and not> in Boolean
45	logic and is therefore a binary operation, too. See the examples
46	below.) are allowed. Parentheses can be used for grouping. For
47	compatibility with the original syntax, C<or> may be omitted.
48
49	For each expression, a semantic category (field) can be defined using
50	the C<I<category> I<pred>> operator, where I<pred> is C<=> for text
51	categories and one of C<=>, C<E<lt>>, C<E<gt>> for numeric
52	categories.
53
54	Here are some examples:
55
56	=over
57
58	=item C<information retrieval>
59
60	free text query
61
62	=item C<information or retrieval>
63
64	same as above
65
66	=item C<ti=information retrieval>
67
68	C<information> must be in the title
69
70	=item C<ti=(information retrieval)>
71
72	one of them in title
73
74	=item C<ti=(information or retrieval)>
75
76	same as above
77
78	=item C<ti=(information and retrieval)>
79
80	both of them in title
81
82	=item C<ti=(information not retrieval)>
83
84	C<information> in title and C<retrieval> not in title
85
86	=item C<py=1990>
87
88	numerically equal
89
90	=item C<pyE<lt>1990>
91
92	numerically less
93
94	=item C<pyE<gt>1990>
95
96	numerically greater
97
98	=item C<edE<lt>19930101>
99
100	Date search. Format is yyyymmdd
101
102	=item C<au=(soundex salatan)>
103
104	soundex search, matches eg. C<Salton>
105
106	=item C<ti=("information retrieval")>
107
108	phrase search
109
110	=item C<ti=(information system*)>
111
112	wild-card search
113
114	=item C<nuclear PROX_UNORDERED 10 waste>
115
116	With this
117	feature, you could search for C<nuclear PROX_UNORDERED 10 waste> (like the
118	Lexis/Nexis search syntax) to find all stories that have C<nuclear>
119	and C<waste> within 10 words of each other.
120
121	=item C<nuclear PROX_ORDERED 10 waste>
122
123	Proximity Searches If the order of the two words is important, then
124	C<nuclear PROX_ORDERED 10 waste> will find all stories that have C<nuclear>
125	up to 10 words before C<waste>. Proximity also works within fields;
126	for instance, C<byline=(dan PROX_ORDERED woods)> will find every story with a
127	byline that has C<dan> within 2 words of C<woods>. Note that you
128	must use parentheses around the words you want to look for in the
129	field.
130
131	=item C<PROX_ATLEAST 20 clinton>
132
133	I<At Least> Searches C<PROX_ATLEAST 20 clinton> finds every story that has
134	at least 20 occurrences of C<clinton>.
135
136	=back
137
138	=head1 Query Weighting
139
140	Let's for now disregard the boolean operators and assume that a query
141	is simply a list of terms. Query weighting is done using the B<Vector
142	Space> model. Each term in the query is associated with a B<query term
143	weight>. Currently this weight is constantly 1. On the other side, the
144	terms in each document get a B<document term weight>. This weight is
145	the product of a document specific weight and the B<inverse document
146	frequency>. The latter is defined as C<idf = log(I<N>/I<n>)> where
147	I<N> is the number of documents in the database and I<n> the number of
148	documents the term occurs in.
149
150	The other part of the document weight is computed as follows: Let I<tf>
151	be the number of occurances of the term in document and I<maxtf> the
152	maximum frequency of any term in the document. A preliminary weight is
153	computed according to C<I<w> = (0.5 * I<tf>)/(1 + I<maxtf>)>. Then
154	these weights are normalize by dividing them by the sum of the squares
155	of all preliminary weights for terms in this document. So the document
156	specific weights make up a vector of length 1. The final document term
157	weight is yielded by multiplying this weight to the I<idf>.
158
159	For simple queries (no booleans) the weight of a document is computed
160	by multiplying the query term weight to the query term weight for each
161	term in the query and summing up the results. This is often referred to
162	as the vector product (hence the name of the model) or scalar product.
163
164	Now let's get to the booleans. The C<or> operator is just dropped. So
165	C<information or retrieval> yields exactly the same weight than
166	C<information retrieval>. To interpret the vector product in another
167	way, you can say the C<or> operator just sums up the weights of its
168	arguments. For the C<and> operator the weights of both arguments is
169	computed and the final weight is just the minimum of these weights.
170	Similar the I<binary> C<not> operator returns the minimum of the
171	weight of the left argument and 1 - the weight of the right argument.
172
173	=head1 Query Syntax
174
175	Query syntax is taken from file C<waisquery.y> included in distribution.
176
177	query : expression
178	;
179
180	expression : term
181	\| expression OR term
182	;
183
184	term : factor
185	\| term AND factor
186	\| term NOT factor
187	;
188
189	factor : unit
190	\| unit PROX_ORDERED unit
191	\| unit PROX_UNORDERED unit
192	\| PROX_ATLEAST unit
193	;
194
195	unit : w_unit
196	\| '(' expression ')'
197	\| WORD '=' '(' s_expression ')'
198	\| WORD '=' w_unit
199	\| WORD '<' WORD
200	\| WORD '>' WORD
201	\| WORD '[' WORD ',' WORD ']'
202	;
203
204	phonsound : PHONIX
205	\| SOUNDEX
206	;
207
208	s_expression : s_term
209	\| s_expression or s_term
210	;
211
212	s_term : s_factor
213	\| s_term AND s_factor
214	\| s_term NOT s_factor
215	;
216
217	s_factor : s_unit
218	\| s_unit PROX_ORDERED s_unit
219	\| s_unit PROX_UNORDERED s_unit
220	\| PROX_ATLEAST s_unit
221	;
222
223	s_unit : w_unit
224	\| '(' s_expression ')'
225	;
226
227	a_unit : WORD
228	\| phonsound WORD
229	;
230
231	w_unit : a_unit
232	\| a_unit ASSIGN FLOAT
233
234	=head1 Authors
235
236	This document is based on Queries chapter from freeWAIS-sf documentation
237	written by Ulrich Pfeifer, and then modified to match WAIT. All errors
238	and omissions should be blamed on Dobrica Pavlinusic.
239
240	=cut