1 |
Malete query expressions: |
2 |
|
3 |
Like CDS/ISIS, Malete supports both index based searching and |
4 |
record based filtering. Unlike in CDS/ISIS, the basic filtering syntax is not |
5 |
based on the formatting language, but a simple extension of the search syntax. |
6 |
|
7 |
However, a more general record manipulation language, the "patchwork", |
8 |
is planned, which will also act as a filter based on complex conditions. |
9 |
|
10 |
|
11 |
In a query expression, everything up to a '?' is a search expression, |
12 |
the remaining part defines a filter. |
13 |
|
14 |
|
15 |
In general, an expression defines a result set, consisting of pointers |
16 |
to occurrences of keys in records. For searching, these pointers are |
17 |
fetched from the "inverted file", the .mqd B-Tree. On filtering, |
18 |
the expression creates pointers as needed from the current record |
19 |
and matches the record, if the result set is not empty. |
20 |
|
21 |
|
22 |
An expression consists of |
23 |
- simple expressions creating result sets |
24 |
- operators modifying result sets |
25 |
- parentheses to control evaluation order |
26 |
|
27 |
|
28 |
* pointers |
29 |
|
30 |
A pointer (to an occurrence of a key in a record) |
31 |
is a 4-tuple (a point in a 4 dimensional space) (r,t,o,p) of |
32 |
- 1 the record's id |
33 |
- 2 the tag of the field where the key was found, |
34 |
so Smith in an author field can be distinguished from Smith in title |
35 |
- 3 the "occurrence" (repeat count) of field, |
36 |
so one can demand Mark and Twain in the same author field, |
37 |
instead of Mark and Twain in two author fields |
38 |
- 4 the position (nth word) within the field |
39 |
|
40 |
|
41 |
* simple expressions |
42 |
|
43 |
A set of results can be created by |
44 |
- [rel][whitespace]term |
45 |
a relation to a term, producing all pointers to keys that have the given |
46 |
relation (as listed below) to term. Relation defaults to equality =. |
47 |
term$ is a legacy form of %term. |
48 |
Term is a series of "word characters", which are ASCII letters and digits, |
49 |
the underscore and all non-ASCII characters (code greater than 127), |
50 |
or any characters in double quotes ". |
51 |
Between the quotes, two quotes evaluate to a single quote. |
52 |
- #n |
53 |
a reference to the nth query. |
54 |
Inserts the search or filter expression, as applicable, of the nth query. |
55 |
(Might not be fully implemented). |
56 |
In a search expression, an existing search result set might be used. |
57 |
As a special case, if the whole query expression is only #n, |
58 |
the nth query is used as is, including search, filter and cursor. |
59 |
|
60 |
Relations are: |
61 |
$ |
62 |
= equality (not needed) |
63 |
% prefix match |
64 |
> greater than |
65 |
>= greater than or equal |
66 |
< less than (not yet implemented, behaves like '>'. For range queries |
67 |
use '-' rather than '<' '*' '>' ) |
68 |
<= less than or equal (not yet implemented, behaves like '>=') |
69 |
: contains (filtering only) |
70 |
~ regular expression match (optional, filtering only) |
71 |
$ |
72 |
The regexp match will not be available in all environments |
73 |
(since it requires a regexp implementation, probably the one from Tcl). |
74 |
Yet, it's just too tempting ... |
75 |
|
76 |
|
77 |
The less/greater than (or equal) operators are not used alone, |
78 |
but in "key identity" expressions to define ranges. |
79 |
|
80 |
|
81 |
* key identity |
82 |
|
83 |
The semantics of pointers is to denote a word position in an occurrence |
84 |
of a tag in a record. We assume that a pointer is only found at the one |
85 |
key being (some transformation of) the word at that position, |
86 |
i.e. pointers are unique. |
87 |
|
88 |
|
89 |
Therefore, matching arbitrary result sets on pointer identity is of little |
90 |
use. However, when applied to terms, we can use a pointer identity condition |
91 |
as a key identity condition, effectively limiting the range of keys searched: |
92 |
- - key identity |
93 |
is a pseudo operator, syntactically operating on result sets, |
94 |
but actually implemented as joining and modifying term relations. |
95 |
|
96 |
Both operands of - must be terms, it refuses to operate on other expressions. |
97 |
If the left hand term has the = relation, it is modified to >=. |
98 |
If the right hand has the = relation, it is modified to <. |
99 |
The expression 'A - B' will search all keys greater or equal "A" |
100 |
and less than "B". To include "B", use 'A - <=B'. |
101 |
The prefix '%ABC' is actually just a shorthand for 'ABC - ABD'. |
102 |
If multiple upper or lower bounds, including those from a prefix, |
103 |
are found in the operands, the lowest lower and highest upper ones apply. |
104 |
|
105 |
|
106 |
Currently, only the default case greater or equal to A and less than B |
107 |
is implemented. |
108 |
|
109 |
|
110 |
* tag filter |
111 |
|
112 |
The tag filter selects pointers with certain tags by restricting |
113 |
the second dimension to one or more tag values: |
114 |
- expr/tag or expr/(tag[,tag...]) |
115 |
where tags are literal numbers. In other words, it intersects the left hand |
116 |
result set with the set {(r,t,o,p)|t element of taglist} |
117 |
|
118 |
As a special case, a tag filter with an empty left hand expression can be used |
119 |
at the very start of (the outermost level of) an expression: |
120 |
- in filtering to select the fields for final output. |
121 |
This is a proper part of filtering, |
122 |
since a record with no matching fields is skipped. |
123 |
- in searching to force a given ordering. |
124 |
This is not a proper part of searching, but rather defines a cursor |
125 |
for record retrieval, which then uses the search result set as filter. |
126 |
We propably should define an extended syntax for ordering, |
127 |
like we have for filtering, to allow key range conditions here. |
128 |
Since this does a full index scan while retrieving records, |
129 |
results are not made unique, and ordering will be most useful only |
130 |
with a single tag (or some alternative tags) of a non-repeatable field. |
131 |
Currently not implemented. |
132 |
|
133 |
The tag filter, like the pointer identity, is another pseudo operator, |
134 |
actually modifying its subexpressions rather than their result sets. |
135 |
See below for details. |
136 |
|
137 |
|
138 |
* binary operators |
139 |
|
140 |
Additive operator: |
141 |
- + is the logical OR |
142 |
calculating the union of result sets |
143 |
|
144 |
All other operators reduce their left hand result set by selecting elements |
145 |
based on certain relations they have to those of the right hand set. |
146 |
In other words they are intersections of the left hand with a certain |
147 |
set defined (but not actually created) based on the right hand set. |
148 |
|
149 |
|
150 |
Multiplicative operators select pointers from their left hand set |
151 |
based upon their equality to pointers in the right hand set |
152 |
in the first n dimensions: |
153 |
- * ("AND") is 1-dimensional equality |
154 |
selects pointers with the same record id, i.e. those pointer from the |
155 |
left hand set for which there exists a pointer with the same record id |
156 |
in the right hand set |
157 |
- ; or (G) is 2-dimensional equality |
158 |
selecting on same (record and) tag |
159 |
- , or (F) is 3-dimensional equality |
160 |
selecting on same (record and tag and) occurrence |
161 |
Mathematically, these operators create an intersection of the left hand |
162 |
with a projection of the right hand to the first n dimensions |
163 |
cross product with the full space of the remaining dimensions. |
164 |
E.g. A;B := A intersect {(r,t,x,y)|exist o,p with (r,t,o,p) in B}, |
165 |
which we can denote as the (G) set of B. |
166 |
|
167 |
The full 4-dimensional equality operator, pointer identity, |
168 |
is a pseudo operator as described above. |
169 |
|
170 |
|
171 |
The difference operator |
172 |
- ^ ("NOT" or "WITHOUT") |
173 |
selects based on 1-dimensional inequality, i.e. those pointer from the |
174 |
left hand set for which there exists NO pointer with the same record id |
175 |
in the right hand set. |
176 |
A^B := A intersect {(r,x,y,z)|not exist t,o,p with (r,t,o,p) in B} |
177 |
(the NOT set of B). |
178 |
Note that the literal words OR, AND and NOT are not operators, but terms. |
179 |
|
180 |
|
181 |
The distance operators are based on a proximity relation in the 4th dimension, |
182 |
requiring 3-dimensional equality (same record, tag and occurrence): |
183 |
- .[...] or (n) max distance |
184 |
A sequence of n dots or a number in parentheses selects pointers from the |
185 |
left hand, for which there is a pointer in the right hand where the word |
186 |
position p differs by at most n. (0) is pointer identity. |
187 |
A(n)B := A intersect |
188 |
{(r,t,o,x)|exists p with abs(p-x) <=n and (r,t,o,p) in B} |
189 |
(the dist set of B). |
190 |
- $$[$$] exact distance: |
191 |
likewise with exact position difference n. |
192 |
(A single $ could be confused with prefix match, but is the same as a .). |
193 |
|
194 |
Due to the problems of word counting during filtering, |
195 |
the distance operators may silently behave like 3-dimensional equality. |
196 |
|
197 |
|
198 |
* operator precedence |
199 |
|
200 |
Adjacent terms with no operator in between are combined by *. |
201 |
The operator precedence can by overriden by enclosing any expression |
202 |
in parentheses (). |
203 |
|
204 |
|
205 |
The operators in decreasing precedence: |
206 |
- pointer/key identity - |
207 |
- positional distance . and $$ |
208 |
- field level match , and ; ((F) and (G)) |
209 |
- filter / |
210 |
- record level match * and ^ |
211 |
- additive + |
212 |
On the same precedence level, |
213 |
the distance operators .. and $$ associate right ('A.B.C' is 'A.(B.C)'), |
214 |
all others associate left ('A B C' is '(A*B)*C'). |
215 |
|
216 |
The tag filter / has highest precedence to its right hand, |
217 |
since the tag list is not subject to any operator evaluation. |
218 |
|
219 |
|
220 |
* detailed semantics of searching |
221 |
|
222 |
First of all, tag filters are *defined* to be applied fully distributive. |
223 |
For operators of higher precedence, they are logically distributive: |
224 |
'(A (G) B)/101' yields the same as '(A/101) (G) (B/101)', |
225 |
hence their lower precedence. For '(A ^ B)/100', |
226 |
it looks like we should first select all results for A, |
227 |
strip those where B is (anywhere) in the same record, |
228 |
and then strip all in fields other than 100. |
229 |
This, however, would be better written as 'A/100 ^ B': |
230 |
there is no point in delaying a tag filter. |
231 |
The only reasonable use of applying a tag filter to a lower precedence |
232 |
operation (which requires use of parentheses anyway) is to distribute it. |
233 |
So, '(A ^ B)/100' is *defined* as 'A/100 ^ B/100'. |
234 |
|
235 |
Second, we define tag filters to not nest, but override, |
236 |
so that only the innermost applies. E.g. in '(A/101 B C)/102', |
237 |
only B and C have to be found in field 102, while A is checked in 101. |
238 |
Again, this is because there would be no point in intersecting nested |
239 |
tag lists: it's shorter to write what should be in effect in the first place. |
240 |
|
241 |
|
242 |
To summarize, |
243 |
- tag filters are always distributed down to the term level, |
244 |
where they are applied most efficiently, and |
245 |
- there is always at most one tag filter in effect. |
246 |
In the further reasoning, tag filters are not considered. |
247 |
|
248 |
|
249 |
The final result set of a search expression is 1-dimensional |
250 |
(record id only), in other words Malete does not maintain |
251 |
> http://lcweb.loc.gov/z3950/agency/markup/23.html Extended Result Sets |
252 |
, but discards the other dimensions after evaluating a search. |
253 |
If higher dimensional operators (including tag filter) are applied to #n |
254 |
references, the original search has to be recomputed. |
255 |
(Which might not be supported). |
256 |
|
257 |
|
258 |
The reason for this is: |
259 |
- it reduces space requirements between queries |
260 |
- delivering records from search results is easier, |
261 |
especially support for free cursor movement |
262 |
- commutative properties allow for a couple of optimizations |
263 |
|
264 |
Intermediate results during query evaluation may have to contain higher |
265 |
dimensional data, depending on the operations which will be applied to them: |
266 |
- 1 record id only for OR, AND, NOT |
267 |
- 2 plus tag for ; (G) |
268 |
- 3 plus occurrence for , (F) |
269 |
- 4 plus position in field for .. and $$ |
270 |
We say that an expression is evaluated in a n-dimensional context, |
271 |
if n is the highest dimension of an operator to be applied to its result set. |
272 |
(The OR here is considered neutral in that it keeps data in all dimensions, |
273 |
but does not require it). This is an important property, |
274 |
because it allows to ignore what happens to other dimensions. |
275 |
|
276 |
|
277 |
Not all operators are associative or commutative, |
278 |
in general, the order of evaluation and operands matter. |
279 |
For example, should 'A.B.C' select a C adjacent to A or to B |
280 |
or to any of these? By the right associative property, 'A.B.C' is 'A.(B.C)', |
281 |
which by definition first selects those pointers to (occurrences of) B, |
282 |
which are adjacent to C, then pointers to A adjacent to one of those Bs, |
283 |
which is probably what you expected. |
284 |
It is important for 'B.C' to create pointers to B, not to C, |
285 |
so we can not use 'A.(C.B)' nor '(A.B).C', |
286 |
both of which would select As adjacent to C. |
287 |
|
288 |
|
289 |
* properties of operators |
290 |
|
291 |
Yet, there are some useful properties of operators. |
292 |
|
293 |
|
294 |
The multiplicative operators and the difference are based on equality relations, |
295 |
which are reflexive (p=p), symmetric (if p=q then q=p) and transitive |
296 |
(if p=q and q=r, then p=r). |
297 |
|
298 |
The proximity relation is symmetric, but not transitive: |
299 |
If A is adjacent to B, and B adjacent to C, A is usually not adjacent to C |
300 |
(actually it can only be adjacent to another C like in a field "C A B C"). |
301 |
Exact distance is not even reflexive. |
302 |
|
303 |
|
304 |
Therefore: |
305 |
- intersection operators are (left and right) distributive over union: |
306 |
'(A+B) op C' = '(A op C)+(B op C)' and 'A op (B+C)' = '(A op B)+(A op C)' |
307 |
for all but the NOT, which reads 'A^(B+C)' = 'A^B^C'. |
308 |
- multiplicative operators and NOT are (self) associative: |
309 |
'(A op B) op C' = 'A op (B op C)', where op is one of * , ; or ^. |
310 |
This also holds where the operators are not the same, |
311 |
but the first is not NOT and the second has no higher dimension. |
312 |
- multiplicative operators are relative commutative in their dimension: |
313 |
'A op B' = 'B op A' if evaluated in no higher dimensional context |
314 |
(since the results are based on a n-dimensional equality, |
315 |
they are equal in the first n dimensions). |
316 |
- for all intersection operators o,p we also have a property similar to |
317 |
associativity: '(A o B) p C' = '(A p C) o B', since both expressions |
318 |
intersect A with both the o-set of B and the p-set of C. |
319 |
In a commutative context, this also equals 'B o (A p C)', |
320 |
which can be used to get an existing left hand result set B |
321 |
inside a right hand subexpression. |
322 |
- distance is commutative in lower dimensional contexts: |
323 |
e.g. '(A.B),C' = '(B.A),C', since it requires 3-dimensional equality, |
324 |
and, with the above, also '(A.B).C' = 'B.(A.C)'. |
325 |
|
326 |
Note that the context dimension depends not only on the direct parent operator, |
327 |
but the highest dimension of any ancestor in the parse tree: |
328 |
While '(A.B),C' = '(B.A),C' as final expressions have the same |
329 |
(1-dimensional) results, '((A.B),C)..D' != '((B.A),C)..D', |
330 |
since D's distance will be tested against A or B, resp. |
331 |
|
332 |
|
333 |
* processing search expressions |
334 |
|
335 |
Some notes on how a search expression is evaluated might be helpful |
336 |
in order to write efficient queries. |
337 |
|
338 |
|
339 |
Basic naive query processing proceeds by |
340 |
- recursively calculating the result sets of subexpressions |
341 |
- combining these result sets according to the operator |
342 |
|
343 |
Techniques used in combining result sets are |
344 |
- merging |
345 |
This is used by the OR to create the union of sets. |
346 |
Duplicates with regard to the required dimensions are eliminated. |
347 |
- marking |
348 |
First calculate the left hand result set. Iterate over the right hand set, |
349 |
look up and mark elements having the relation in the left hand set. |
350 |
Finally, eliminate all unmarked (or marked, for NOT) elements. |
351 |
This can be used to implement all relation based operators. |
352 |
However, it is most useful where the right hand expression can loop |
353 |
its elements without actually creating the set (direct marking). |
354 |
- filtering |
355 |
First calculate the right hand result set. Then, while generating |
356 |
elements for the left hand, look up the relation in the right hand |
357 |
set and store left hand elements only as appropriate. |
358 |
Note that a filter applies only while creating sets, |
359 |
not during marking or merging. |
360 |
All techniques require the result sets to be ordered according to the |
361 |
dimensions, i.e. by record id then tag then occ and pos, to run efficiently. |
362 |
|
363 |
|
364 |
The actual implementation tries to avoid the naive approach. |
365 |
Instead of calculating subexpression results independently, |
366 |
the results of previous calculations and the mode in which they |
367 |
finally will be combined are used where possible. |
368 |
- tag filters are always applied at the lowest (term) level, |
369 |
thus do not require an additional filtering step |
370 |
- the dimensional context is used to purge unnecessary duplicates and, |
371 |
optionally, can be used to reorder expressions according to their properties. |
372 |
|
373 |
|
374 |
Creating result sets: |
375 |
- for a ref, in an one-dimensional search expression, we have the results set. |
376 |
Else, it can be created by processing the query independently or inlining |
377 |
its text. Anyway, in most cases, especially where we create under a filter, |
378 |
we need to create a copy. |
379 |
- for an exact match term, the result set can be directly fetched |
380 |
from the index. Other term relations are an OR over all keys they match. |
381 |
- the result set of an OR is created by creating both parts recursively |
382 |
and merging them. Any filter is applied to all subexpressions. |
383 |
- for all other expressions, if not filtering and the right hand does |
384 |
not support direct marking, create it and use as left hand filter. |
385 |
Else create the left hand (using any filter), evaluate the right hand |
386 |
using direct marking, if possible, else create it and mark. |
387 |
|
388 |
Direct marking in subexpressions: |
389 |
- all simple expressions can efficiently loop their elements |
390 |
and thus support direct marking. |
391 |
- OR and NOT support direct marking, if the left hand supports it |
392 |
and the right hand is a simple expression |
393 |
|
394 |
Notes |
395 |
- OR supports marking also if the right hand is a plain OR. |
396 |
However, this case (A+(B+C)) should be expressed as ((A+B)+C). |
397 |
- marking can be extended to AND and OR, NOT with complex subexpressions |
398 |
using multiple mark values. (G) and (F) can support marking like AND when |
399 |
evaluated in their dimension; the only non-trivial case of this is when |
400 |
embedded in an OR. This extended marking is an optional feature. |
401 |
- in terms of buffers, filtering is most efficient in an intersection, |
402 |
because the filter set can be released early. |
403 |
It is most efficient, where the right hand support direct marking. |
404 |
In associative situations, it may be possible to rewrite '(A op B) op C' |
405 |
as 'A op (B op C)', so C's result can be released early during the |
406 |
evaluation of B and likewise B's result in A. |
407 |
|
408 |
|
409 |
Reordering expressions is by far the most important technique, |
410 |
and the best thing about it is that you can do it. |
411 |
To be most efficient, the parse tree must be as unbalanced as possible: |
412 |
In the best case, it is fully left associative like |
413 |
'((((A op B) op C) op D) op E) op F' |
414 |
(with A-F simple expressions supporting direct marking). |
415 |
That way we can always directly operate at the outermost level on |
416 |
the result set of the previous expression. |
417 |
|
418 |
|
419 |
On the default optimization level 0, no automatic expressions rewriting is done. |
420 |
Higher optimization levels will be specified, e.g. we would expect level 1 |
421 |
to recognize a common case 'A.B.C' in a lower dimensional context, |
422 |
and rewrite the proper expression 'A.(B.C)' to '(B.C).A', |
423 |
so we can reuse B's details once using marking. |
424 |
|
425 |
Full optimization should also use all commutattive properties |
426 |
to choose subexpressions for marking and filtering. |
427 |
|
428 |
|
429 |
* processing filter expressions |
430 |
|
431 |
For filter expressions, the result sets are created from the current record |
432 |
and thus are assumed to be rather small, so that buffer space is not an issue. |
433 |
|
434 |
Instead, short cut evaluation can be assumed to be much more important, |
435 |
since for an interesting filter expression we expect it to not match |
436 |
in most cases. All multiplicative operators should evaluate a term before |
437 |
a complex subexpression and stop processing if it evaluates to the empty set. |
438 |
|
439 |
|
440 |
Probably the most important optimization here would be to detect terms with |
441 |
exact, prefix or contains relation which are required by multiplicative |
442 |
operators and to do a quick check for their occurence as soon as possible. |
443 |
|
444 |
Especially where such a filter is applied to all records of a db, |
445 |
it should be applied while streaming the record file, |
446 |
hopefully approaching the performance of grep. |
447 |
|
448 |
|
449 |
* limits |
450 |
|
451 |
Several limits apply to query and especially search evaluation. |
452 |
Some are currently hardcoded, but may become configuration options |
453 |
in future releases. |
454 |
|
455 |
- the number of subexpressions (terminals and operators) in any |
456 |
expression is limited to 500. |
457 |
- the nesting depth during parsing is limited to 50. |
458 |
(This is usually lower then the depth of the final parse tree). |
459 |
- the number of open queries is limited by session option q (default 20). |
460 |
- the size (number of records) of a search result set is limited |
461 |
by session option s (default 10.000). |
462 |
|
463 |
|
464 |
--- |
465 |
$Id: Query.txt,v 1.19 2004/07/16 12:42:39 istr Exp $ |
466 |
|
467 |
For all those mathematical terms, |
468 |
> http://mathforum.org/dr.math/faq/faq.property.glossary.html ask Dr. Math |