1 |
dpavlin |
604 |
Malete query expressions: |
2 |
|
|
|
3 |
|
|
Like CDS/ISIS, Malete supports both index based searching and |
4 |
|
|
record based filtering. Unlike in CDS/ISIS, the basic filtering syntax is not |
5 |
|
|
based on the formatting language, but a simple extension of the search syntax. |
6 |
|
|
|
7 |
|
|
However, a more general record manipulation language, the "patchwork", |
8 |
|
|
is planned, which will also act as a filter based on complex conditions. |
9 |
|
|
|
10 |
|
|
|
11 |
|
|
In a query expression, everything up to a '?' is a search expression, |
12 |
|
|
the remaining part defines a filter. |
13 |
|
|
|
14 |
|
|
|
15 |
|
|
In general, an expression defines a result set, consisting of pointers |
16 |
|
|
to occurrences of keys in records. For searching, these pointers are |
17 |
|
|
fetched from the "inverted file", the .mqd B-Tree. On filtering, |
18 |
|
|
the expression creates pointers as needed from the current record |
19 |
|
|
and matches the record, if the result set is not empty. |
20 |
|
|
|
21 |
|
|
|
22 |
|
|
An expression consists of |
23 |
|
|
- simple expressions creating result sets |
24 |
|
|
- operators modifying result sets |
25 |
|
|
- parentheses to control evaluation order |
26 |
|
|
|
27 |
|
|
|
28 |
|
|
* pointers |
29 |
|
|
|
30 |
|
|
A pointer (to an occurrence of a key in a record) |
31 |
|
|
is a 4-tuple (a point in a 4 dimensional space) (r,t,o,p) of |
32 |
|
|
- 1 the record's id |
33 |
|
|
- 2 the tag of the field where the key was found, |
34 |
|
|
so Smith in an author field can be distinguished from Smith in title |
35 |
|
|
- 3 the "occurrence" (repeat count) of field, |
36 |
|
|
so one can demand Mark and Twain in the same author field, |
37 |
|
|
instead of Mark and Twain in two author fields |
38 |
|
|
- 4 the position (nth word) within the field |
39 |
|
|
|
40 |
|
|
|
41 |
|
|
* simple expressions |
42 |
|
|
|
43 |
|
|
A set of results can be created by |
44 |
|
|
- [rel][whitespace]term |
45 |
|
|
a relation to a term, producing all pointers to keys that have the given |
46 |
|
|
relation (as listed below) to term. Relation defaults to equality =. |
47 |
|
|
term$ is a legacy form of %term. |
48 |
|
|
Term is a series of "word characters", which are ASCII letters and digits, |
49 |
|
|
the underscore and all non-ASCII characters (code greater than 127), |
50 |
|
|
or any characters in double quotes ". |
51 |
|
|
Between the quotes, two quotes evaluate to a single quote. |
52 |
|
|
- #n |
53 |
|
|
a reference to the nth query. |
54 |
|
|
Inserts the search or filter expression, as applicable, of the nth query. |
55 |
|
|
(Might not be fully implemented). |
56 |
|
|
In a search expression, an existing search result set might be used. |
57 |
|
|
As a special case, if the whole query expression is only #n, |
58 |
|
|
the nth query is used as is, including search, filter and cursor. |
59 |
|
|
|
60 |
|
|
Relations are: |
61 |
|
|
$ |
62 |
|
|
= equality (not needed) |
63 |
|
|
% prefix match |
64 |
|
|
> greater than |
65 |
|
|
>= greater than or equal |
66 |
|
|
< less than (not yet implemented, behaves like '>'. For range queries |
67 |
|
|
use '-' rather than '<' '*' '>' ) |
68 |
|
|
<= less than or equal (not yet implemented, behaves like '>=') |
69 |
|
|
: contains (filtering only) |
70 |
|
|
~ regular expression match (optional, filtering only) |
71 |
|
|
$ |
72 |
|
|
The regexp match will not be available in all environments |
73 |
|
|
(since it requires a regexp implementation, probably the one from Tcl). |
74 |
|
|
Yet, it's just too tempting ... |
75 |
|
|
|
76 |
|
|
|
77 |
|
|
The less/greater than (or equal) operators are not used alone, |
78 |
|
|
but in "key identity" expressions to define ranges. |
79 |
|
|
|
80 |
|
|
|
81 |
|
|
* key identity |
82 |
|
|
|
83 |
|
|
The semantics of pointers is to denote a word position in an occurrence |
84 |
|
|
of a tag in a record. We assume that a pointer is only found at the one |
85 |
|
|
key being (some transformation of) the word at that position, |
86 |
|
|
i.e. pointers are unique. |
87 |
|
|
|
88 |
|
|
|
89 |
|
|
Therefore, matching arbitrary result sets on pointer identity is of little |
90 |
|
|
use. However, when applied to terms, we can use a pointer identity condition |
91 |
|
|
as a key identity condition, effectively limiting the range of keys searched: |
92 |
|
|
- - key identity |
93 |
|
|
is a pseudo operator, syntactically operating on result sets, |
94 |
|
|
but actually implemented as joining and modifying term relations. |
95 |
|
|
|
96 |
|
|
Both operands of - must be terms, it refuses to operate on other expressions. |
97 |
|
|
If the left hand term has the = relation, it is modified to >=. |
98 |
|
|
If the right hand has the = relation, it is modified to <. |
99 |
|
|
The expression 'A - B' will search all keys greater or equal "A" |
100 |
|
|
and less than "B". To include "B", use 'A - <=B'. |
101 |
|
|
The prefix '%ABC' is actually just a shorthand for 'ABC - ABD'. |
102 |
|
|
If multiple upper or lower bounds, including those from a prefix, |
103 |
|
|
are found in the operands, the lowest lower and highest upper ones apply. |
104 |
|
|
|
105 |
|
|
|
106 |
|
|
Currently, only the default case greater or equal to A and less than B |
107 |
|
|
is implemented. |
108 |
|
|
|
109 |
|
|
|
110 |
|
|
* tag filter |
111 |
|
|
|
112 |
|
|
The tag filter selects pointers with certain tags by restricting |
113 |
|
|
the second dimension to one or more tag values: |
114 |
|
|
- expr/tag or expr/(tag[,tag...]) |
115 |
|
|
where tags are literal numbers. In other words, it intersects the left hand |
116 |
|
|
result set with the set {(r,t,o,p)|t element of taglist} |
117 |
|
|
|
118 |
|
|
As a special case, a tag filter with an empty left hand expression can be used |
119 |
|
|
at the very start of (the outermost level of) an expression: |
120 |
|
|
- in filtering to select the fields for final output. |
121 |
|
|
This is a proper part of filtering, |
122 |
|
|
since a record with no matching fields is skipped. |
123 |
|
|
- in searching to force a given ordering. |
124 |
|
|
This is not a proper part of searching, but rather defines a cursor |
125 |
|
|
for record retrieval, which then uses the search result set as filter. |
126 |
|
|
We propably should define an extended syntax for ordering, |
127 |
|
|
like we have for filtering, to allow key range conditions here. |
128 |
|
|
Since this does a full index scan while retrieving records, |
129 |
|
|
results are not made unique, and ordering will be most useful only |
130 |
|
|
with a single tag (or some alternative tags) of a non-repeatable field. |
131 |
|
|
Currently not implemented. |
132 |
|
|
|
133 |
|
|
The tag filter, like the pointer identity, is another pseudo operator, |
134 |
|
|
actually modifying its subexpressions rather than their result sets. |
135 |
|
|
See below for details. |
136 |
|
|
|
137 |
|
|
|
138 |
|
|
* binary operators |
139 |
|
|
|
140 |
|
|
Additive operator: |
141 |
|
|
- + is the logical OR |
142 |
|
|
calculating the union of result sets |
143 |
|
|
|
144 |
|
|
All other operators reduce their left hand result set by selecting elements |
145 |
|
|
based on certain relations they have to those of the right hand set. |
146 |
|
|
In other words they are intersections of the left hand with a certain |
147 |
|
|
set defined (but not actually created) based on the right hand set. |
148 |
|
|
|
149 |
|
|
|
150 |
|
|
Multiplicative operators select pointers from their left hand set |
151 |
|
|
based upon their equality to pointers in the right hand set |
152 |
|
|
in the first n dimensions: |
153 |
|
|
- * ("AND") is 1-dimensional equality |
154 |
|
|
selects pointers with the same record id, i.e. those pointer from the |
155 |
|
|
left hand set for which there exists a pointer with the same record id |
156 |
|
|
in the right hand set |
157 |
|
|
- ; or (G) is 2-dimensional equality |
158 |
|
|
selecting on same (record and) tag |
159 |
|
|
- , or (F) is 3-dimensional equality |
160 |
|
|
selecting on same (record and tag and) occurrence |
161 |
|
|
Mathematically, these operators create an intersection of the left hand |
162 |
|
|
with a projection of the right hand to the first n dimensions |
163 |
|
|
cross product with the full space of the remaining dimensions. |
164 |
|
|
E.g. A;B := A intersect {(r,t,x,y)|exist o,p with (r,t,o,p) in B}, |
165 |
|
|
which we can denote as the (G) set of B. |
166 |
|
|
|
167 |
|
|
The full 4-dimensional equality operator, pointer identity, |
168 |
|
|
is a pseudo operator as described above. |
169 |
|
|
|
170 |
|
|
|
171 |
|
|
The difference operator |
172 |
|
|
- ^ ("NOT" or "WITHOUT") |
173 |
|
|
selects based on 1-dimensional inequality, i.e. those pointer from the |
174 |
|
|
left hand set for which there exists NO pointer with the same record id |
175 |
|
|
in the right hand set. |
176 |
|
|
A^B := A intersect {(r,x,y,z)|not exist t,o,p with (r,t,o,p) in B} |
177 |
|
|
(the NOT set of B). |
178 |
|
|
Note that the literal words OR, AND and NOT are not operators, but terms. |
179 |
|
|
|
180 |
|
|
|
181 |
|
|
The distance operators are based on a proximity relation in the 4th dimension, |
182 |
|
|
requiring 3-dimensional equality (same record, tag and occurrence): |
183 |
|
|
- .[...] or (n) max distance |
184 |
|
|
A sequence of n dots or a number in parentheses selects pointers from the |
185 |
|
|
left hand, for which there is a pointer in the right hand where the word |
186 |
|
|
position p differs by at most n. (0) is pointer identity. |
187 |
|
|
A(n)B := A intersect |
188 |
|
|
{(r,t,o,x)|exists p with abs(p-x) <=n and (r,t,o,p) in B} |
189 |
|
|
(the dist set of B). |
190 |
|
|
- $$[$$] exact distance: |
191 |
|
|
likewise with exact position difference n. |
192 |
|
|
(A single $ could be confused with prefix match, but is the same as a .). |
193 |
|
|
|
194 |
|
|
Due to the problems of word counting during filtering, |
195 |
|
|
the distance operators may silently behave like 3-dimensional equality. |
196 |
|
|
|
197 |
|
|
|
198 |
|
|
* operator precedence |
199 |
|
|
|
200 |
|
|
Adjacent terms with no operator in between are combined by *. |
201 |
|
|
The operator precedence can by overriden by enclosing any expression |
202 |
|
|
in parentheses (). |
203 |
|
|
|
204 |
|
|
|
205 |
|
|
The operators in decreasing precedence: |
206 |
|
|
- pointer/key identity - |
207 |
|
|
- positional distance . and $$ |
208 |
|
|
- field level match , and ; ((F) and (G)) |
209 |
|
|
- filter / |
210 |
|
|
- record level match * and ^ |
211 |
|
|
- additive + |
212 |
|
|
On the same precedence level, |
213 |
|
|
the distance operators .. and $$ associate right ('A.B.C' is 'A.(B.C)'), |
214 |
|
|
all others associate left ('A B C' is '(A*B)*C'). |
215 |
|
|
|
216 |
|
|
The tag filter / has highest precedence to its right hand, |
217 |
|
|
since the tag list is not subject to any operator evaluation. |
218 |
|
|
|
219 |
|
|
|
220 |
|
|
* detailed semantics of searching |
221 |
|
|
|
222 |
|
|
First of all, tag filters are *defined* to be applied fully distributive. |
223 |
|
|
For operators of higher precedence, they are logically distributive: |
224 |
|
|
'(A (G) B)/101' yields the same as '(A/101) (G) (B/101)', |
225 |
|
|
hence their lower precedence. For '(A ^ B)/100', |
226 |
|
|
it looks like we should first select all results for A, |
227 |
|
|
strip those where B is (anywhere) in the same record, |
228 |
|
|
and then strip all in fields other than 100. |
229 |
|
|
This, however, would be better written as 'A/100 ^ B': |
230 |
|
|
there is no point in delaying a tag filter. |
231 |
|
|
The only reasonable use of applying a tag filter to a lower precedence |
232 |
|
|
operation (which requires use of parentheses anyway) is to distribute it. |
233 |
|
|
So, '(A ^ B)/100' is *defined* as 'A/100 ^ B/100'. |
234 |
|
|
|
235 |
|
|
Second, we define tag filters to not nest, but override, |
236 |
|
|
so that only the innermost applies. E.g. in '(A/101 B C)/102', |
237 |
|
|
only B and C have to be found in field 102, while A is checked in 101. |
238 |
|
|
Again, this is because there would be no point in intersecting nested |
239 |
|
|
tag lists: it's shorter to write what should be in effect in the first place. |
240 |
|
|
|
241 |
|
|
|
242 |
|
|
To summarize, |
243 |
|
|
- tag filters are always distributed down to the term level, |
244 |
|
|
where they are applied most efficiently, and |
245 |
|
|
- there is always at most one tag filter in effect. |
246 |
|
|
In the further reasoning, tag filters are not considered. |
247 |
|
|
|
248 |
|
|
|
249 |
|
|
The final result set of a search expression is 1-dimensional |
250 |
|
|
(record id only), in other words Malete does not maintain |
251 |
|
|
> http://lcweb.loc.gov/z3950/agency/markup/23.html Extended Result Sets |
252 |
|
|
, but discards the other dimensions after evaluating a search. |
253 |
|
|
If higher dimensional operators (including tag filter) are applied to #n |
254 |
|
|
references, the original search has to be recomputed. |
255 |
|
|
(Which might not be supported). |
256 |
|
|
|
257 |
|
|
|
258 |
|
|
The reason for this is: |
259 |
|
|
- it reduces space requirements between queries |
260 |
|
|
- delivering records from search results is easier, |
261 |
|
|
especially support for free cursor movement |
262 |
|
|
- commutative properties allow for a couple of optimizations |
263 |
|
|
|
264 |
|
|
Intermediate results during query evaluation may have to contain higher |
265 |
|
|
dimensional data, depending on the operations which will be applied to them: |
266 |
|
|
- 1 record id only for OR, AND, NOT |
267 |
|
|
- 2 plus tag for ; (G) |
268 |
|
|
- 3 plus occurrence for , (F) |
269 |
|
|
- 4 plus position in field for .. and $$ |
270 |
|
|
We say that an expression is evaluated in a n-dimensional context, |
271 |
|
|
if n is the highest dimension of an operator to be applied to its result set. |
272 |
|
|
(The OR here is considered neutral in that it keeps data in all dimensions, |
273 |
|
|
but does not require it). This is an important property, |
274 |
|
|
because it allows to ignore what happens to other dimensions. |
275 |
|
|
|
276 |
|
|
|
277 |
|
|
Not all operators are associative or commutative, |
278 |
|
|
in general, the order of evaluation and operands matter. |
279 |
|
|
For example, should 'A.B.C' select a C adjacent to A or to B |
280 |
|
|
or to any of these? By the right associative property, 'A.B.C' is 'A.(B.C)', |
281 |
|
|
which by definition first selects those pointers to (occurrences of) B, |
282 |
|
|
which are adjacent to C, then pointers to A adjacent to one of those Bs, |
283 |
|
|
which is probably what you expected. |
284 |
|
|
It is important for 'B.C' to create pointers to B, not to C, |
285 |
|
|
so we can not use 'A.(C.B)' nor '(A.B).C', |
286 |
|
|
both of which would select As adjacent to C. |
287 |
|
|
|
288 |
|
|
|
289 |
|
|
* properties of operators |
290 |
|
|
|
291 |
|
|
Yet, there are some useful properties of operators. |
292 |
|
|
|
293 |
|
|
|
294 |
|
|
The multiplicative operators and the difference are based on equality relations, |
295 |
|
|
which are reflexive (p=p), symmetric (if p=q then q=p) and transitive |
296 |
|
|
(if p=q and q=r, then p=r). |
297 |
|
|
|
298 |
|
|
The proximity relation is symmetric, but not transitive: |
299 |
|
|
If A is adjacent to B, and B adjacent to C, A is usually not adjacent to C |
300 |
|
|
(actually it can only be adjacent to another C like in a field "C A B C"). |
301 |
|
|
Exact distance is not even reflexive. |
302 |
|
|
|
303 |
|
|
|
304 |
|
|
Therefore: |
305 |
|
|
- intersection operators are (left and right) distributive over union: |
306 |
|
|
'(A+B) op C' = '(A op C)+(B op C)' and 'A op (B+C)' = '(A op B)+(A op C)' |
307 |
|
|
for all but the NOT, which reads 'A^(B+C)' = 'A^B^C'. |
308 |
|
|
- multiplicative operators and NOT are (self) associative: |
309 |
|
|
'(A op B) op C' = 'A op (B op C)', where op is one of * , ; or ^. |
310 |
|
|
This also holds where the operators are not the same, |
311 |
|
|
but the first is not NOT and the second has no higher dimension. |
312 |
|
|
- multiplicative operators are relative commutative in their dimension: |
313 |
|
|
'A op B' = 'B op A' if evaluated in no higher dimensional context |
314 |
|
|
(since the results are based on a n-dimensional equality, |
315 |
|
|
they are equal in the first n dimensions). |
316 |
|
|
- for all intersection operators o,p we also have a property similar to |
317 |
|
|
associativity: '(A o B) p C' = '(A p C) o B', since both expressions |
318 |
|
|
intersect A with both the o-set of B and the p-set of C. |
319 |
|
|
In a commutative context, this also equals 'B o (A p C)', |
320 |
|
|
which can be used to get an existing left hand result set B |
321 |
|
|
inside a right hand subexpression. |
322 |
|
|
- distance is commutative in lower dimensional contexts: |
323 |
|
|
e.g. '(A.B),C' = '(B.A),C', since it requires 3-dimensional equality, |
324 |
|
|
and, with the above, also '(A.B).C' = 'B.(A.C)'. |
325 |
|
|
|
326 |
|
|
Note that the context dimension depends not only on the direct parent operator, |
327 |
|
|
but the highest dimension of any ancestor in the parse tree: |
328 |
|
|
While '(A.B),C' = '(B.A),C' as final expressions have the same |
329 |
|
|
(1-dimensional) results, '((A.B),C)..D' != '((B.A),C)..D', |
330 |
|
|
since D's distance will be tested against A or B, resp. |
331 |
|
|
|
332 |
|
|
|
333 |
|
|
* processing search expressions |
334 |
|
|
|
335 |
|
|
Some notes on how a search expression is evaluated might be helpful |
336 |
|
|
in order to write efficient queries. |
337 |
|
|
|
338 |
|
|
|
339 |
|
|
Basic naive query processing proceeds by |
340 |
|
|
- recursively calculating the result sets of subexpressions |
341 |
|
|
- combining these result sets according to the operator |
342 |
|
|
|
343 |
|
|
Techniques used in combining result sets are |
344 |
|
|
- merging |
345 |
|
|
This is used by the OR to create the union of sets. |
346 |
|
|
Duplicates with regard to the required dimensions are eliminated. |
347 |
|
|
- marking |
348 |
|
|
First calculate the left hand result set. Iterate over the right hand set, |
349 |
|
|
look up and mark elements having the relation in the left hand set. |
350 |
|
|
Finally, eliminate all unmarked (or marked, for NOT) elements. |
351 |
|
|
This can be used to implement all relation based operators. |
352 |
|
|
However, it is most useful where the right hand expression can loop |
353 |
|
|
its elements without actually creating the set (direct marking). |
354 |
|
|
- filtering |
355 |
|
|
First calculate the right hand result set. Then, while generating |
356 |
|
|
elements for the left hand, look up the relation in the right hand |
357 |
|
|
set and store left hand elements only as appropriate. |
358 |
|
|
Note that a filter applies only while creating sets, |
359 |
|
|
not during marking or merging. |
360 |
|
|
All techniques require the result sets to be ordered according to the |
361 |
|
|
dimensions, i.e. by record id then tag then occ and pos, to run efficiently. |
362 |
|
|
|
363 |
|
|
|
364 |
|
|
The actual implementation tries to avoid the naive approach. |
365 |
|
|
Instead of calculating subexpression results independently, |
366 |
|
|
the results of previous calculations and the mode in which they |
367 |
|
|
finally will be combined are used where possible. |
368 |
|
|
- tag filters are always applied at the lowest (term) level, |
369 |
|
|
thus do not require an additional filtering step |
370 |
|
|
- the dimensional context is used to purge unnecessary duplicates and, |
371 |
|
|
optionally, can be used to reorder expressions according to their properties. |
372 |
|
|
|
373 |
|
|
|
374 |
|
|
Creating result sets: |
375 |
|
|
- for a ref, in an one-dimensional search expression, we have the results set. |
376 |
|
|
Else, it can be created by processing the query independently or inlining |
377 |
|
|
its text. Anyway, in most cases, especially where we create under a filter, |
378 |
|
|
we need to create a copy. |
379 |
|
|
- for an exact match term, the result set can be directly fetched |
380 |
|
|
from the index. Other term relations are an OR over all keys they match. |
381 |
|
|
- the result set of an OR is created by creating both parts recursively |
382 |
|
|
and merging them. Any filter is applied to all subexpressions. |
383 |
|
|
- for all other expressions, if not filtering and the right hand does |
384 |
|
|
not support direct marking, create it and use as left hand filter. |
385 |
|
|
Else create the left hand (using any filter), evaluate the right hand |
386 |
|
|
using direct marking, if possible, else create it and mark. |
387 |
|
|
|
388 |
|
|
Direct marking in subexpressions: |
389 |
|
|
- all simple expressions can efficiently loop their elements |
390 |
|
|
and thus support direct marking. |
391 |
|
|
- OR and NOT support direct marking, if the left hand supports it |
392 |
|
|
and the right hand is a simple expression |
393 |
|
|
|
394 |
|
|
Notes |
395 |
|
|
- OR supports marking also if the right hand is a plain OR. |
396 |
|
|
However, this case (A+(B+C)) should be expressed as ((A+B)+C). |
397 |
|
|
- marking can be extended to AND and OR, NOT with complex subexpressions |
398 |
|
|
using multiple mark values. (G) and (F) can support marking like AND when |
399 |
|
|
evaluated in their dimension; the only non-trivial case of this is when |
400 |
|
|
embedded in an OR. This extended marking is an optional feature. |
401 |
|
|
- in terms of buffers, filtering is most efficient in an intersection, |
402 |
|
|
because the filter set can be released early. |
403 |
|
|
It is most efficient, where the right hand support direct marking. |
404 |
|
|
In associative situations, it may be possible to rewrite '(A op B) op C' |
405 |
|
|
as 'A op (B op C)', so C's result can be released early during the |
406 |
|
|
evaluation of B and likewise B's result in A. |
407 |
|
|
|
408 |
|
|
|
409 |
|
|
Reordering expressions is by far the most important technique, |
410 |
|
|
and the best thing about it is that you can do it. |
411 |
|
|
To be most efficient, the parse tree must be as unbalanced as possible: |
412 |
|
|
In the best case, it is fully left associative like |
413 |
|
|
'((((A op B) op C) op D) op E) op F' |
414 |
|
|
(with A-F simple expressions supporting direct marking). |
415 |
|
|
That way we can always directly operate at the outermost level on |
416 |
|
|
the result set of the previous expression. |
417 |
|
|
|
418 |
|
|
|
419 |
|
|
On the default optimization level 0, no automatic expressions rewriting is done. |
420 |
|
|
Higher optimization levels will be specified, e.g. we would expect level 1 |
421 |
|
|
to recognize a common case 'A.B.C' in a lower dimensional context, |
422 |
|
|
and rewrite the proper expression 'A.(B.C)' to '(B.C).A', |
423 |
|
|
so we can reuse B's details once using marking. |
424 |
|
|
|
425 |
|
|
Full optimization should also use all commutattive properties |
426 |
|
|
to choose subexpressions for marking and filtering. |
427 |
|
|
|
428 |
|
|
|
429 |
|
|
* processing filter expressions |
430 |
|
|
|
431 |
|
|
For filter expressions, the result sets are created from the current record |
432 |
|
|
and thus are assumed to be rather small, so that buffer space is not an issue. |
433 |
|
|
|
434 |
|
|
Instead, short cut evaluation can be assumed to be much more important, |
435 |
|
|
since for an interesting filter expression we expect it to not match |
436 |
|
|
in most cases. All multiplicative operators should evaluate a term before |
437 |
|
|
a complex subexpression and stop processing if it evaluates to the empty set. |
438 |
|
|
|
439 |
|
|
|
440 |
|
|
Probably the most important optimization here would be to detect terms with |
441 |
|
|
exact, prefix or contains relation which are required by multiplicative |
442 |
|
|
operators and to do a quick check for their occurence as soon as possible. |
443 |
|
|
|
444 |
|
|
Especially where such a filter is applied to all records of a db, |
445 |
|
|
it should be applied while streaming the record file, |
446 |
|
|
hopefully approaching the performance of grep. |
447 |
|
|
|
448 |
|
|
|
449 |
|
|
* limits |
450 |
|
|
|
451 |
|
|
Several limits apply to query and especially search evaluation. |
452 |
|
|
Some are currently hardcoded, but may become configuration options |
453 |
|
|
in future releases. |
454 |
|
|
|
455 |
|
|
- the number of subexpressions (terminals and operators) in any |
456 |
|
|
expression is limited to 500. |
457 |
|
|
- the nesting depth during parsing is limited to 50. |
458 |
|
|
(This is usually lower then the depth of the final parse tree). |
459 |
|
|
- the number of open queries is limited by session option q (default 20). |
460 |
|
|
- the size (number of records) of a search result set is limited |
461 |
|
|
by session option s (default 10.000). |
462 |
|
|
|
463 |
|
|
|
464 |
|
|
--- |
465 |
|
|
$Id: Query.txt,v 1.19 2004/07/16 12:42:39 istr Exp $ |
466 |
|
|
|
467 |
|
|
For all those mathematical terms, |
468 |
|
|
> http://mathforum.org/dr.math/faq/faq.property.glossary.html ask Dr. Math |