1 |
dpavlin |
92 |
=pod |
2 |
|
|
|
3 |
|
|
=head1 NAME |
4 |
|
|
|
5 |
|
|
WAIS Queries is WAIT |
6 |
|
|
|
7 |
|
|
=head1 Formulating a Query |
8 |
|
|
|
9 |
|
|
WAIT uses extension of original WAIS Protocol for queries which |
10 |
|
|
encodes new and richer query semantics in the query |
11 |
|
|
string. This means that the query now has to obey a certain syntax |
12 |
|
|
(see section on L<Query Syntax>) |
13 |
|
|
and consequently a user might get a syntax error |
14 |
|
|
when submitting a query. So our goal was to make the syntax as easy as |
15 |
|
|
possible and especially leave simple free text queries valid. |
16 |
|
|
|
17 |
|
|
Simpliest query are in form B<free text query> which means a list of search |
18 |
|
|
term separated by spaces. |
19 |
|
|
|
20 |
|
|
In the query the categories to be searched should be selectable for |
21 |
|
|
each term. To leave the original queries valid (and to support casual |
22 |
|
|
users) we provided a default category, which is used if no category is |
23 |
|
|
specified in the query. Now we give an outline of the query language: |
24 |
|
|
|
25 |
|
|
The atomic search expressions of the language are terms, term |
26 |
|
|
wild-cards and phrases (e.g. C<information>, C<inform*>, |
27 |
|
|
C<"information retrieval">). |
28 |
|
|
|
29 |
|
|
Stemming is handled transparently for the client. Terms searched in a |
30 |
|
|
stemmed category are searched using their word stem automatically. For |
31 |
|
|
a wildcard (only tail truncation is implemented), all matching words |
32 |
|
|
from the dictionary are used as search terms. Phrase search looks up |
33 |
|
|
the words in the string. At least one of them must be an index term. |
34 |
|
|
Then the server scans the documents containing this word for string |
35 |
|
|
matches for the complete phrase. This means that string search can only |
36 |
|
|
work if the server has access to the documents. For type C<URL> this |
37 |
|
|
is not the case. |
38 |
|
|
|
39 |
|
|
Additionally the prefix operators C<soundex> and C<phonix> are |
40 |
|
|
allowed for converting the following query term into its Soundex/Phonix |
41 |
|
|
code. This is for example very useful when searching in phonebooks if |
42 |
|
|
the exact spelling of a name is not known. Arbitrary Boolean |
43 |
|
|
combination of these atomic expressions with the I<binary> operators |
44 |
|
|
C<and>, C<or> and C<not> (C<not> means C<and not> in Boolean |
45 |
|
|
logic and is therefore a binary operation, too. See the examples |
46 |
|
|
below.) are allowed. Parentheses can be used for grouping. For |
47 |
|
|
compatibility with the original syntax, C<or> may be omitted. |
48 |
|
|
|
49 |
|
|
For each expression, a semantic category (field) can be defined using |
50 |
|
|
the C<I<category> I<pred>> operator, where I<pred> is C<=> for text |
51 |
|
|
categories and one of C<=>, C<E<lt>>, C<E<gt>> for numeric |
52 |
|
|
categories. |
53 |
|
|
|
54 |
|
|
Here are some examples: |
55 |
|
|
|
56 |
|
|
=over |
57 |
|
|
|
58 |
|
|
=item C<information retrieval> |
59 |
|
|
|
60 |
|
|
free text query |
61 |
|
|
|
62 |
|
|
=item C<information or retrieval> |
63 |
|
|
|
64 |
|
|
same as above |
65 |
|
|
|
66 |
|
|
=item C<ti=information retrieval> |
67 |
|
|
|
68 |
|
|
C<information> must be in the title |
69 |
|
|
|
70 |
|
|
=item C<ti=(information retrieval)> |
71 |
|
|
|
72 |
|
|
one of them in title |
73 |
|
|
|
74 |
|
|
=item C<ti=(information or retrieval)> |
75 |
|
|
|
76 |
|
|
same as above |
77 |
|
|
|
78 |
|
|
=item C<ti=(information and retrieval)> |
79 |
|
|
|
80 |
|
|
both of them in title |
81 |
|
|
|
82 |
|
|
=item C<ti=(information not retrieval)> |
83 |
|
|
|
84 |
|
|
C<information> in title and C<retrieval> not in title |
85 |
|
|
|
86 |
|
|
=item C<py=1990> |
87 |
|
|
|
88 |
|
|
numerically equal |
89 |
|
|
|
90 |
|
|
=item C<pyE<lt>1990> |
91 |
|
|
|
92 |
|
|
numerically less |
93 |
|
|
|
94 |
|
|
=item C<pyE<gt>1990> |
95 |
|
|
|
96 |
|
|
numerically greater |
97 |
|
|
|
98 |
|
|
=item C<edE<lt>19930101> |
99 |
|
|
|
100 |
|
|
Date search. Format is yyyymmdd |
101 |
|
|
|
102 |
|
|
=item C<au=(soundex salatan)> |
103 |
|
|
|
104 |
|
|
soundex search, matches eg. C<Salton> |
105 |
|
|
|
106 |
|
|
=item C<ti=("information retrieval")> |
107 |
|
|
|
108 |
|
|
phrase search |
109 |
|
|
|
110 |
|
|
=item C<ti=(information system*)> |
111 |
|
|
|
112 |
|
|
wild-card search |
113 |
|
|
|
114 |
|
|
=item C<nuclear PROX_UNORDERED 10 waste> |
115 |
|
|
|
116 |
|
|
With this |
117 |
|
|
feature, you could search for C<nuclear PROX_UNORDERED 10 waste> (like the |
118 |
|
|
Lexis/Nexis search syntax) to find all stories that have C<nuclear> |
119 |
|
|
and C<waste> within 10 words of each other. |
120 |
|
|
|
121 |
|
|
=item C<nuclear PROX_ORDERED 10 waste> |
122 |
|
|
|
123 |
|
|
Proximity Searches If the order of the two words is important, then |
124 |
|
|
C<nuclear PROX_ORDERED 10 waste> will find all stories that have C<nuclear> |
125 |
|
|
up to 10 words before C<waste>. Proximity also works within fields; |
126 |
|
|
for instance, C<byline=(dan PROX_ORDERED woods)> will find every story with a |
127 |
|
|
byline that has C<dan> within 2 words of C<woods>. Note that you |
128 |
|
|
must use parentheses around the words you want to look for in the |
129 |
|
|
field. |
130 |
|
|
|
131 |
|
|
=item C<PROX_ATLEAST 20 clinton> |
132 |
|
|
|
133 |
|
|
I<At Least> Searches C<PROX_ATLEAST 20 clinton> finds every story that has |
134 |
|
|
at least 20 occurrences of C<clinton>. |
135 |
|
|
|
136 |
|
|
=back |
137 |
|
|
|
138 |
|
|
=head1 Query Weighting |
139 |
|
|
|
140 |
|
|
Let's for now disregard the boolean operators and assume that a query |
141 |
|
|
is simply a list of terms. Query weighting is done using the B<Vector |
142 |
|
|
Space> model. Each term in the query is associated with a B<query term |
143 |
|
|
weight>. Currently this weight is constantly 1. On the other side, the |
144 |
|
|
terms in each document get a B<document term weight>. This weight is |
145 |
|
|
the product of a document specific weight and the B<inverse document |
146 |
|
|
frequency>. The latter is defined as C<idf = log(I<N>/I<n>)> where |
147 |
|
|
I<N> is the number of documents in the database and I<n> the number of |
148 |
|
|
documents the term occurs in. |
149 |
|
|
|
150 |
|
|
The other part of the document weight is computed as follows: Let I<tf> |
151 |
|
|
be the number of occurances of the term in document and I<maxtf> the |
152 |
|
|
maximum frequency of any term in the document. A preliminary weight is |
153 |
|
|
computed according to C<I<w> = (0.5 * I<tf>)/(1 + I<maxtf>)>. Then |
154 |
|
|
these weights are normalize by dividing them by the sum of the squares |
155 |
|
|
of all preliminary weights for terms in this document. So the document |
156 |
|
|
specific weights make up a vector of length 1. The final document term |
157 |
|
|
weight is yielded by multiplying this weight to the I<idf>. |
158 |
|
|
|
159 |
|
|
For simple queries (no booleans) the weight of a document is computed |
160 |
|
|
by multiplying the query term weight to the query term weight for each |
161 |
|
|
term in the query and summing up the results. This is often referred to |
162 |
|
|
as the vector product (hence the name of the model) or scalar product. |
163 |
|
|
|
164 |
|
|
Now let's get to the booleans. The C<or> operator is just dropped. So |
165 |
|
|
C<information or retrieval> yields exactly the same weight than |
166 |
|
|
C<information retrieval>. To interpret the vector product in another |
167 |
|
|
way, you can say the C<or> operator just sums up the weights of its |
168 |
|
|
arguments. For the C<and> operator the weights of both arguments is |
169 |
|
|
computed and the final weight is just the minimum of these weights. |
170 |
|
|
Similar the I<binary> C<not> operator returns the minimum of the |
171 |
|
|
weight of the left argument and 1 - the weight of the right argument. |
172 |
|
|
|
173 |
|
|
=head1 Query Syntax |
174 |
|
|
|
175 |
|
|
Query syntax is taken from file C<waisquery.y> included in distribution. |
176 |
|
|
|
177 |
|
|
query : expression |
178 |
|
|
; |
179 |
|
|
|
180 |
|
|
expression : term |
181 |
|
|
| expression OR term |
182 |
|
|
; |
183 |
|
|
|
184 |
|
|
term : factor |
185 |
|
|
| term AND factor |
186 |
|
|
| term NOT factor |
187 |
|
|
; |
188 |
|
|
|
189 |
|
|
factor : unit |
190 |
|
|
| unit PROX_ORDERED unit |
191 |
|
|
| unit PROX_UNORDERED unit |
192 |
|
|
| PROX_ATLEAST unit |
193 |
|
|
; |
194 |
|
|
|
195 |
|
|
unit : w_unit |
196 |
|
|
| '(' expression ')' |
197 |
|
|
| WORD '=' '(' s_expression ')' |
198 |
|
|
| WORD '=' w_unit |
199 |
|
|
| WORD '<' WORD |
200 |
|
|
| WORD '>' WORD |
201 |
|
|
| WORD '[' WORD ',' WORD ']' |
202 |
|
|
; |
203 |
|
|
|
204 |
|
|
phonsound : PHONIX |
205 |
|
|
| SOUNDEX |
206 |
|
|
; |
207 |
|
|
|
208 |
|
|
s_expression : s_term |
209 |
|
|
| s_expression or s_term |
210 |
|
|
; |
211 |
|
|
|
212 |
|
|
s_term : s_factor |
213 |
|
|
| s_term AND s_factor |
214 |
|
|
| s_term NOT s_factor |
215 |
|
|
; |
216 |
|
|
|
217 |
|
|
s_factor : s_unit |
218 |
|
|
| s_unit PROX_ORDERED s_unit |
219 |
|
|
| s_unit PROX_UNORDERED s_unit |
220 |
|
|
| PROX_ATLEAST s_unit |
221 |
|
|
; |
222 |
|
|
|
223 |
|
|
s_unit : w_unit |
224 |
|
|
| '(' s_expression ')' |
225 |
|
|
; |
226 |
|
|
|
227 |
|
|
a_unit : WORD |
228 |
|
|
| phonsound WORD |
229 |
|
|
; |
230 |
|
|
|
231 |
|
|
w_unit : a_unit |
232 |
|
|
| a_unit ASSIGN FLOAT |
233 |
|
|
|
234 |
|
|
=head1 Authors |
235 |
|
|
|
236 |
|
|
This document is based on Queries chapter from freeWAIS-sf documentation |
237 |
|
|
written by Ulrich Pfeifer, and then modified to match WAIT. All errors |
238 |
|
|
and omissions should be blamed on Dobrica Pavlinusic. |
239 |
|
|
|
240 |
|
|
=cut |