/[wait]/trunk/doc/queries.pod
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /trunk/doc/queries.pod

Parent Directory Parent Directory | Revision Log Revision Log


Revision 92 - (show annotations)
Tue May 25 16:03:10 2004 UTC (19 years, 11 months ago) by dpavlin
File size: 7849 byte(s)
First try to document WAIT query syntax. Basically, slightly modified chapter
from freeWAIS-sf documentation

1 =pod
2
3 =head1 NAME
4
5 WAIS Queries is WAIT
6
7 =head1 Formulating a Query
8
9 WAIT uses extension of original WAIS Protocol for queries which
10 encodes new and richer query semantics in the query
11 string. This means that the query now has to obey a certain syntax
12 (see section on L<Query Syntax>)
13 and consequently a user might get a syntax error
14 when submitting a query. So our goal was to make the syntax as easy as
15 possible and especially leave simple free text queries valid.
16
17 Simpliest query are in form B<free text query> which means a list of search
18 term separated by spaces.
19
20 In the query the categories to be searched should be selectable for
21 each term. To leave the original queries valid (and to support casual
22 users) we provided a default category, which is used if no category is
23 specified in the query. Now we give an outline of the query language:
24
25 The atomic search expressions of the language are terms, term
26 wild-cards and phrases (e.g. C<information>, C<inform*>,
27 C<"information retrieval">).
28
29 Stemming is handled transparently for the client. Terms searched in a
30 stemmed category are searched using their word stem automatically. For
31 a wildcard (only tail truncation is implemented), all matching words
32 from the dictionary are used as search terms. Phrase search looks up
33 the words in the string. At least one of them must be an index term.
34 Then the server scans the documents containing this word for string
35 matches for the complete phrase. This means that string search can only
36 work if the server has access to the documents. For type C<URL> this
37 is not the case.
38
39 Additionally the prefix operators C<soundex> and C<phonix> are
40 allowed for converting the following query term into its Soundex/Phonix
41 code. This is for example very useful when searching in phonebooks if
42 the exact spelling of a name is not known. Arbitrary Boolean
43 combination of these atomic expressions with the I<binary> operators
44 C<and>, C<or> and C<not> (C<not> means C<and not> in Boolean
45 logic and is therefore a binary operation, too. See the examples
46 below.) are allowed. Parentheses can be used for grouping. For
47 compatibility with the original syntax, C<or> may be omitted.
48
49 For each expression, a semantic category (field) can be defined using
50 the C<I<category> I<pred>> operator, where I<pred> is C<=> for text
51 categories and one of C<=>, C<E<lt>>, C<E<gt>> for numeric
52 categories.
53
54 Here are some examples:
55
56 =over
57
58 =item C<information retrieval>
59
60 free text query
61
62 =item C<information or retrieval>
63
64 same as above
65
66 =item C<ti=information retrieval>
67
68 C<information> must be in the title
69
70 =item C<ti=(information retrieval)>
71
72 one of them in title
73
74 =item C<ti=(information or retrieval)>
75
76 same as above
77
78 =item C<ti=(information and retrieval)>
79
80 both of them in title
81
82 =item C<ti=(information not retrieval)>
83
84 C<information> in title and C<retrieval> not in title
85
86 =item C<py=1990>
87
88 numerically equal
89
90 =item C<pyE<lt>1990>
91
92 numerically less
93
94 =item C<pyE<gt>1990>
95
96 numerically greater
97
98 =item C<edE<lt>19930101>
99
100 Date search. Format is yyyymmdd
101
102 =item C<au=(soundex salatan)>
103
104 soundex search, matches eg. C<Salton>
105
106 =item C<ti=("information retrieval")>
107
108 phrase search
109
110 =item C<ti=(information system*)>
111
112 wild-card search
113
114 =item C<nuclear PROX_UNORDERED 10 waste>
115
116 With this
117 feature, you could search for C<nuclear PROX_UNORDERED 10 waste> (like the
118 Lexis/Nexis search syntax) to find all stories that have C<nuclear>
119 and C<waste> within 10 words of each other.
120
121 =item C<nuclear PROX_ORDERED 10 waste>
122
123 Proximity Searches If the order of the two words is important, then
124 C<nuclear PROX_ORDERED 10 waste> will find all stories that have C<nuclear>
125 up to 10 words before C<waste>. Proximity also works within fields;
126 for instance, C<byline=(dan PROX_ORDERED woods)> will find every story with a
127 byline that has C<dan> within 2 words of C<woods>. Note that you
128 must use parentheses around the words you want to look for in the
129 field.
130
131 =item C<PROX_ATLEAST 20 clinton>
132
133 I<At Least> Searches C<PROX_ATLEAST 20 clinton> finds every story that has
134 at least 20 occurrences of C<clinton>.
135
136 =back
137
138 =head1 Query Weighting
139
140 Let's for now disregard the boolean operators and assume that a query
141 is simply a list of terms. Query weighting is done using the B<Vector
142 Space> model. Each term in the query is associated with a B<query term
143 weight>. Currently this weight is constantly 1. On the other side, the
144 terms in each document get a B<document term weight>. This weight is
145 the product of a document specific weight and the B<inverse document
146 frequency>. The latter is defined as C<idf = log(I<N>/I<n>)> where
147 I<N> is the number of documents in the database and I<n> the number of
148 documents the term occurs in.
149
150 The other part of the document weight is computed as follows: Let I<tf>
151 be the number of occurances of the term in document and I<maxtf> the
152 maximum frequency of any term in the document. A preliminary weight is
153 computed according to C<I<w> = (0.5 * I<tf>)/(1 + I<maxtf>)>. Then
154 these weights are normalize by dividing them by the sum of the squares
155 of all preliminary weights for terms in this document. So the document
156 specific weights make up a vector of length 1. The final document term
157 weight is yielded by multiplying this weight to the I<idf>.
158
159 For simple queries (no booleans) the weight of a document is computed
160 by multiplying the query term weight to the query term weight for each
161 term in the query and summing up the results. This is often referred to
162 as the vector product (hence the name of the model) or scalar product.
163
164 Now let's get to the booleans. The C<or> operator is just dropped. So
165 C<information or retrieval> yields exactly the same weight than
166 C<information retrieval>. To interpret the vector product in another
167 way, you can say the C<or> operator just sums up the weights of its
168 arguments. For the C<and> operator the weights of both arguments is
169 computed and the final weight is just the minimum of these weights.
170 Similar the I<binary> C<not> operator returns the minimum of the
171 weight of the left argument and 1 - the weight of the right argument.
172
173 =head1 Query Syntax
174
175 Query syntax is taken from file C<waisquery.y> included in distribution.
176
177 query : expression
178 ;
179
180 expression : term
181 | expression OR term
182 ;
183
184 term : factor
185 | term AND factor
186 | term NOT factor
187 ;
188
189 factor : unit
190 | unit PROX_ORDERED unit
191 | unit PROX_UNORDERED unit
192 | PROX_ATLEAST unit
193 ;
194
195 unit : w_unit
196 | '(' expression ')'
197 | WORD '=' '(' s_expression ')'
198 | WORD '=' w_unit
199 | WORD '<' WORD
200 | WORD '>' WORD
201 | WORD '[' WORD ',' WORD ']'
202 ;
203
204 phonsound : PHONIX
205 | SOUNDEX
206 ;
207
208 s_expression : s_term
209 | s_expression or s_term
210 ;
211
212 s_term : s_factor
213 | s_term AND s_factor
214 | s_term NOT s_factor
215 ;
216
217 s_factor : s_unit
218 | s_unit PROX_ORDERED s_unit
219 | s_unit PROX_UNORDERED s_unit
220 | PROX_ATLEAST s_unit
221 ;
222
223 s_unit : w_unit
224 | '(' s_expression ')'
225 ;
226
227 a_unit : WORD
228 | phonsound WORD
229 ;
230
231 w_unit : a_unit
232 | a_unit ASSIGN FLOAT
233
234 =head1 Authors
235
236 This document is based on Queries chapter from freeWAIS-sf documentation
237 written by Ulrich Pfeifer, and then modified to match WAIT. All errors
238 and omissions should be blamed on Dobrica Pavlinusic.
239
240 =cut

  ViewVC Help
Powered by ViewVC 1.1.26