/[wait]/trunk/doc/queries.pod
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /trunk/doc/queries.pod

Parent Directory Parent Directory | Revision Log Revision Log


Revision 92 - (hide annotations)
Tue May 25 16:03:10 2004 UTC (20 years ago) by dpavlin
File size: 7849 byte(s)
First try to document WAIT query syntax. Basically, slightly modified chapter
from freeWAIS-sf documentation

1 dpavlin 92 =pod
2    
3     =head1 NAME
4    
5     WAIS Queries is WAIT
6    
7     =head1 Formulating a Query
8    
9     WAIT uses extension of original WAIS Protocol for queries which
10     encodes new and richer query semantics in the query
11     string. This means that the query now has to obey a certain syntax
12     (see section on L<Query Syntax>)
13     and consequently a user might get a syntax error
14     when submitting a query. So our goal was to make the syntax as easy as
15     possible and especially leave simple free text queries valid.
16    
17     Simpliest query are in form B<free text query> which means a list of search
18     term separated by spaces.
19    
20     In the query the categories to be searched should be selectable for
21     each term. To leave the original queries valid (and to support casual
22     users) we provided a default category, which is used if no category is
23     specified in the query. Now we give an outline of the query language:
24    
25     The atomic search expressions of the language are terms, term
26     wild-cards and phrases (e.g. C<information>, C<inform*>,
27     C<"information retrieval">).
28    
29     Stemming is handled transparently for the client. Terms searched in a
30     stemmed category are searched using their word stem automatically. For
31     a wildcard (only tail truncation is implemented), all matching words
32     from the dictionary are used as search terms. Phrase search looks up
33     the words in the string. At least one of them must be an index term.
34     Then the server scans the documents containing this word for string
35     matches for the complete phrase. This means that string search can only
36     work if the server has access to the documents. For type C<URL> this
37     is not the case.
38    
39     Additionally the prefix operators C<soundex> and C<phonix> are
40     allowed for converting the following query term into its Soundex/Phonix
41     code. This is for example very useful when searching in phonebooks if
42     the exact spelling of a name is not known. Arbitrary Boolean
43     combination of these atomic expressions with the I<binary> operators
44     C<and>, C<or> and C<not> (C<not> means C<and not> in Boolean
45     logic and is therefore a binary operation, too. See the examples
46     below.) are allowed. Parentheses can be used for grouping. For
47     compatibility with the original syntax, C<or> may be omitted.
48    
49     For each expression, a semantic category (field) can be defined using
50     the C<I<category> I<pred>> operator, where I<pred> is C<=> for text
51     categories and one of C<=>, C<E<lt>>, C<E<gt>> for numeric
52     categories.
53    
54     Here are some examples:
55    
56     =over
57    
58     =item C<information retrieval>
59    
60     free text query
61    
62     =item C<information or retrieval>
63    
64     same as above
65    
66     =item C<ti=information retrieval>
67    
68     C<information> must be in the title
69    
70     =item C<ti=(information retrieval)>
71    
72     one of them in title
73    
74     =item C<ti=(information or retrieval)>
75    
76     same as above
77    
78     =item C<ti=(information and retrieval)>
79    
80     both of them in title
81    
82     =item C<ti=(information not retrieval)>
83    
84     C<information> in title and C<retrieval> not in title
85    
86     =item C<py=1990>
87    
88     numerically equal
89    
90     =item C<pyE<lt>1990>
91    
92     numerically less
93    
94     =item C<pyE<gt>1990>
95    
96     numerically greater
97    
98     =item C<edE<lt>19930101>
99    
100     Date search. Format is yyyymmdd
101    
102     =item C<au=(soundex salatan)>
103    
104     soundex search, matches eg. C<Salton>
105    
106     =item C<ti=("information retrieval")>
107    
108     phrase search
109    
110     =item C<ti=(information system*)>
111    
112     wild-card search
113    
114     =item C<nuclear PROX_UNORDERED 10 waste>
115    
116     With this
117     feature, you could search for C<nuclear PROX_UNORDERED 10 waste> (like the
118     Lexis/Nexis search syntax) to find all stories that have C<nuclear>
119     and C<waste> within 10 words of each other.
120    
121     =item C<nuclear PROX_ORDERED 10 waste>
122    
123     Proximity Searches If the order of the two words is important, then
124     C<nuclear PROX_ORDERED 10 waste> will find all stories that have C<nuclear>
125     up to 10 words before C<waste>. Proximity also works within fields;
126     for instance, C<byline=(dan PROX_ORDERED woods)> will find every story with a
127     byline that has C<dan> within 2 words of C<woods>. Note that you
128     must use parentheses around the words you want to look for in the
129     field.
130    
131     =item C<PROX_ATLEAST 20 clinton>
132    
133     I<At Least> Searches C<PROX_ATLEAST 20 clinton> finds every story that has
134     at least 20 occurrences of C<clinton>.
135    
136     =back
137    
138     =head1 Query Weighting
139    
140     Let's for now disregard the boolean operators and assume that a query
141     is simply a list of terms. Query weighting is done using the B<Vector
142     Space> model. Each term in the query is associated with a B<query term
143     weight>. Currently this weight is constantly 1. On the other side, the
144     terms in each document get a B<document term weight>. This weight is
145     the product of a document specific weight and the B<inverse document
146     frequency>. The latter is defined as C<idf = log(I<N>/I<n>)> where
147     I<N> is the number of documents in the database and I<n> the number of
148     documents the term occurs in.
149    
150     The other part of the document weight is computed as follows: Let I<tf>
151     be the number of occurances of the term in document and I<maxtf> the
152     maximum frequency of any term in the document. A preliminary weight is
153     computed according to C<I<w> = (0.5 * I<tf>)/(1 + I<maxtf>)>. Then
154     these weights are normalize by dividing them by the sum of the squares
155     of all preliminary weights for terms in this document. So the document
156     specific weights make up a vector of length 1. The final document term
157     weight is yielded by multiplying this weight to the I<idf>.
158    
159     For simple queries (no booleans) the weight of a document is computed
160     by multiplying the query term weight to the query term weight for each
161     term in the query and summing up the results. This is often referred to
162     as the vector product (hence the name of the model) or scalar product.
163    
164     Now let's get to the booleans. The C<or> operator is just dropped. So
165     C<information or retrieval> yields exactly the same weight than
166     C<information retrieval>. To interpret the vector product in another
167     way, you can say the C<or> operator just sums up the weights of its
168     arguments. For the C<and> operator the weights of both arguments is
169     computed and the final weight is just the minimum of these weights.
170     Similar the I<binary> C<not> operator returns the minimum of the
171     weight of the left argument and 1 - the weight of the right argument.
172    
173     =head1 Query Syntax
174    
175     Query syntax is taken from file C<waisquery.y> included in distribution.
176    
177     query : expression
178     ;
179    
180     expression : term
181     | expression OR term
182     ;
183    
184     term : factor
185     | term AND factor
186     | term NOT factor
187     ;
188    
189     factor : unit
190     | unit PROX_ORDERED unit
191     | unit PROX_UNORDERED unit
192     | PROX_ATLEAST unit
193     ;
194    
195     unit : w_unit
196     | '(' expression ')'
197     | WORD '=' '(' s_expression ')'
198     | WORD '=' w_unit
199     | WORD '<' WORD
200     | WORD '>' WORD
201     | WORD '[' WORD ',' WORD ']'
202     ;
203    
204     phonsound : PHONIX
205     | SOUNDEX
206     ;
207    
208     s_expression : s_term
209     | s_expression or s_term
210     ;
211    
212     s_term : s_factor
213     | s_term AND s_factor
214     | s_term NOT s_factor
215     ;
216    
217     s_factor : s_unit
218     | s_unit PROX_ORDERED s_unit
219     | s_unit PROX_UNORDERED s_unit
220     | PROX_ATLEAST s_unit
221     ;
222    
223     s_unit : w_unit
224     | '(' s_expression ')'
225     ;
226    
227     a_unit : WORD
228     | phonsound WORD
229     ;
230    
231     w_unit : a_unit
232     | a_unit ASSIGN FLOAT
233    
234     =head1 Authors
235    
236     This document is based on Queries chapter from freeWAIS-sf documentation
237     written by Ulrich Pfeifer, and then modified to match WAIT. All errors
238     and omissions should be blamed on Dobrica Pavlinusic.
239    
240     =cut

  ViewVC Help
Powered by ViewVC 1.1.26