/[webpac]/openisis/0.9.9e/doc/Query.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /openisis/0.9.9e/doc/Query.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 604 - (hide annotations)
Mon Dec 27 21:49:01 2004 UTC (19 years, 6 months ago) by dpavlin
File MIME type: text/plain
File size: 19950 byte(s)
import of new openisis release, 0.9.9e

1 dpavlin 604 Malete query expressions:
2    
3     Like CDS/ISIS, Malete supports both index based searching and
4     record based filtering. Unlike in CDS/ISIS, the basic filtering syntax is not
5     based on the formatting language, but a simple extension of the search syntax.
6    
7     However, a more general record manipulation language, the "patchwork",
8     is planned, which will also act as a filter based on complex conditions.
9    
10    
11     In a query expression, everything up to a '?' is a search expression,
12     the remaining part defines a filter.
13    
14    
15     In general, an expression defines a result set, consisting of pointers
16     to occurrences of keys in records. For searching, these pointers are
17     fetched from the "inverted file", the .mqd B-Tree. On filtering,
18     the expression creates pointers as needed from the current record
19     and matches the record, if the result set is not empty.
20    
21    
22     An expression consists of
23     - simple expressions creating result sets
24     - operators modifying result sets
25     - parentheses to control evaluation order
26    
27    
28     * pointers
29    
30     A pointer (to an occurrence of a key in a record)
31     is a 4-tuple (a point in a 4 dimensional space) (r,t,o,p) of
32     - 1 the record's id
33     - 2 the tag of the field where the key was found,
34     so Smith in an author field can be distinguished from Smith in title
35     - 3 the "occurrence" (repeat count) of field,
36     so one can demand Mark and Twain in the same author field,
37     instead of Mark and Twain in two author fields
38     - 4 the position (nth word) within the field
39    
40    
41     * simple expressions
42    
43     A set of results can be created by
44     - [rel][whitespace]term
45     a relation to a term, producing all pointers to keys that have the given
46     relation (as listed below) to term. Relation defaults to equality =.
47     term$ is a legacy form of %term.
48     Term is a series of "word characters", which are ASCII letters and digits,
49     the underscore and all non-ASCII characters (code greater than 127),
50     or any characters in double quotes ".
51     Between the quotes, two quotes evaluate to a single quote.
52     - #n
53     a reference to the nth query.
54     Inserts the search or filter expression, as applicable, of the nth query.
55     (Might not be fully implemented).
56     In a search expression, an existing search result set might be used.
57     As a special case, if the whole query expression is only #n,
58     the nth query is used as is, including search, filter and cursor.
59    
60     Relations are:
61     $
62     = equality (not needed)
63     % prefix match
64     > greater than
65     >= greater than or equal
66     < less than (not yet implemented, behaves like '>'. For range queries
67     use '-' rather than '<' '*' '>' )
68     <= less than or equal (not yet implemented, behaves like '>=')
69     : contains (filtering only)
70     ~ regular expression match (optional, filtering only)
71     $
72     The regexp match will not be available in all environments
73     (since it requires a regexp implementation, probably the one from Tcl).
74     Yet, it's just too tempting ...
75    
76    
77     The less/greater than (or equal) operators are not used alone,
78     but in "key identity" expressions to define ranges.
79    
80    
81     * key identity
82    
83     The semantics of pointers is to denote a word position in an occurrence
84     of a tag in a record. We assume that a pointer is only found at the one
85     key being (some transformation of) the word at that position,
86     i.e. pointers are unique.
87    
88    
89     Therefore, matching arbitrary result sets on pointer identity is of little
90     use. However, when applied to terms, we can use a pointer identity condition
91     as a key identity condition, effectively limiting the range of keys searched:
92     - - key identity
93     is a pseudo operator, syntactically operating on result sets,
94     but actually implemented as joining and modifying term relations.
95    
96     Both operands of - must be terms, it refuses to operate on other expressions.
97     If the left hand term has the = relation, it is modified to &gt;=.
98     If the right hand has the = relation, it is modified to &lt;.
99     The expression 'A - B' will search all keys greater or equal "A"
100     and less than "B". To include "B", use 'A - &lt;=B'.
101     The prefix '%ABC' is actually just a shorthand for 'ABC - ABD'.
102     If multiple upper or lower bounds, including those from a prefix,
103     are found in the operands, the lowest lower and highest upper ones apply.
104    
105    
106     Currently, only the default case greater or equal to A and less than B
107     is implemented.
108    
109    
110     * tag filter
111    
112     The tag filter selects pointers with certain tags by restricting
113     the second dimension to one or more tag values:
114     - expr/tag or expr/(tag[,tag...])
115     where tags are literal numbers. In other words, it intersects the left hand
116     result set with the set {(r,t,o,p)|t element of taglist}
117    
118     As a special case, a tag filter with an empty left hand expression can be used
119     at the very start of (the outermost level of) an expression:
120     - in filtering to select the fields for final output.
121     This is a proper part of filtering,
122     since a record with no matching fields is skipped.
123     - in searching to force a given ordering.
124     This is not a proper part of searching, but rather defines a cursor
125     for record retrieval, which then uses the search result set as filter.
126     We propably should define an extended syntax for ordering,
127     like we have for filtering, to allow key range conditions here.
128     Since this does a full index scan while retrieving records,
129     results are not made unique, and ordering will be most useful only
130     with a single tag (or some alternative tags) of a non-repeatable field.
131     Currently not implemented.
132    
133     The tag filter, like the pointer identity, is another pseudo operator,
134     actually modifying its subexpressions rather than their result sets.
135     See below for details.
136    
137    
138     * binary operators
139    
140     Additive operator:
141     - + is the logical OR
142     calculating the union of result sets
143    
144     All other operators reduce their left hand result set by selecting elements
145     based on certain relations they have to those of the right hand set.
146     In other words they are intersections of the left hand with a certain
147     set defined (but not actually created) based on the right hand set.
148    
149    
150     Multiplicative operators select pointers from their left hand set
151     based upon their equality to pointers in the right hand set
152     in the first n dimensions:
153     - * ("AND") is 1-dimensional equality
154     selects pointers with the same record id, i.e. those pointer from the
155     left hand set for which there exists a pointer with the same record id
156     in the right hand set
157     - ; or (G) is 2-dimensional equality
158     selecting on same (record and) tag
159     - , or (F) is 3-dimensional equality
160     selecting on same (record and tag and) occurrence
161     Mathematically, these operators create an intersection of the left hand
162     with a projection of the right hand to the first n dimensions
163     cross product with the full space of the remaining dimensions.
164     E.g. A;B := A intersect {(r,t,x,y)|exist o,p with (r,t,o,p) in B},
165     which we can denote as the (G) set of B.
166    
167     The full 4-dimensional equality operator, pointer identity,
168     is a pseudo operator as described above.
169    
170    
171     The difference operator
172     - ^ ("NOT" or "WITHOUT")
173     selects based on 1-dimensional inequality, i.e. those pointer from the
174     left hand set for which there exists NO pointer with the same record id
175     in the right hand set.
176     A^B := A intersect {(r,x,y,z)|not exist t,o,p with (r,t,o,p) in B}
177     (the NOT set of B).
178     Note that the literal words OR, AND and NOT are not operators, but terms.
179    
180    
181     The distance operators are based on a proximity relation in the 4th dimension,
182     requiring 3-dimensional equality (same record, tag and occurrence):
183     - .[...] or (n) max distance
184     A sequence of n dots or a number in parentheses selects pointers from the
185     left hand, for which there is a pointer in the right hand where the word
186     position p differs by at most n. (0) is pointer identity.
187     A(n)B := A intersect
188     {(r,t,o,x)|exists p with abs(p-x) &lt;=n and (r,t,o,p) in B}
189     (the dist set of B).
190     - $$[$$] exact distance:
191     likewise with exact position difference n.
192     (A single $ could be confused with prefix match, but is the same as a .).
193    
194     Due to the problems of word counting during filtering,
195     the distance operators may silently behave like 3-dimensional equality.
196    
197    
198     * operator precedence
199    
200     Adjacent terms with no operator in between are combined by *.
201     The operator precedence can by overriden by enclosing any expression
202     in parentheses ().
203    
204    
205     The operators in decreasing precedence:
206     - pointer/key identity -
207     - positional distance . and $$
208     - field level match , and ; ((F) and (G))
209     - filter /
210     - record level match * and ^
211     - additive +
212     On the same precedence level,
213     the distance operators .. and $$ associate right ('A.B.C' is 'A.(B.C)'),
214     all others associate left ('A B C' is '(A*B)*C').
215    
216     The tag filter / has highest precedence to its right hand,
217     since the tag list is not subject to any operator evaluation.
218    
219    
220     * detailed semantics of searching
221    
222     First of all, tag filters are *defined* to be applied fully distributive.
223     For operators of higher precedence, they are logically distributive:
224     '(A (G) B)/101' yields the same as '(A/101) (G) (B/101)',
225     hence their lower precedence. For '(A ^ B)/100',
226     it looks like we should first select all results for A,
227     strip those where B is (anywhere) in the same record,
228     and then strip all in fields other than 100.
229     This, however, would be better written as 'A/100 ^ B':
230     there is no point in delaying a tag filter.
231     The only reasonable use of applying a tag filter to a lower precedence
232     operation (which requires use of parentheses anyway) is to distribute it.
233     So, '(A ^ B)/100' is *defined* as 'A/100 ^ B/100'.
234    
235     Second, we define tag filters to not nest, but override,
236     so that only the innermost applies. E.g. in '(A/101 B C)/102',
237     only B and C have to be found in field 102, while A is checked in 101.
238     Again, this is because there would be no point in intersecting nested
239     tag lists: it's shorter to write what should be in effect in the first place.
240    
241    
242     To summarize,
243     - tag filters are always distributed down to the term level,
244     where they are applied most efficiently, and
245     - there is always at most one tag filter in effect.
246     In the further reasoning, tag filters are not considered.
247    
248    
249     The final result set of a search expression is 1-dimensional
250     (record id only), in other words Malete does not maintain
251     > http://lcweb.loc.gov/z3950/agency/markup/23.html Extended Result Sets
252     , but discards the other dimensions after evaluating a search.
253     If higher dimensional operators (including tag filter) are applied to #n
254     references, the original search has to be recomputed.
255     (Which might not be supported).
256    
257    
258     The reason for this is:
259     - it reduces space requirements between queries
260     - delivering records from search results is easier,
261     especially support for free cursor movement
262     - commutative properties allow for a couple of optimizations
263    
264     Intermediate results during query evaluation may have to contain higher
265     dimensional data, depending on the operations which will be applied to them:
266     - 1 record id only for OR, AND, NOT
267     - 2 plus tag for ; (G)
268     - 3 plus occurrence for , (F)
269     - 4 plus position in field for .. and $$
270     We say that an expression is evaluated in a n-dimensional context,
271     if n is the highest dimension of an operator to be applied to its result set.
272     (The OR here is considered neutral in that it keeps data in all dimensions,
273     but does not require it). This is an important property,
274     because it allows to ignore what happens to other dimensions.
275    
276    
277     Not all operators are associative or commutative,
278     in general, the order of evaluation and operands matter.
279     For example, should 'A.B.C' select a C adjacent to A or to B
280     or to any of these? By the right associative property, 'A.B.C' is 'A.(B.C)',
281     which by definition first selects those pointers to (occurrences of) B,
282     which are adjacent to C, then pointers to A adjacent to one of those Bs,
283     which is probably what you expected.
284     It is important for 'B.C' to create pointers to B, not to C,
285     so we can not use 'A.(C.B)' nor '(A.B).C',
286     both of which would select As adjacent to C.
287    
288    
289     * properties of operators
290    
291     Yet, there are some useful properties of operators.
292    
293    
294     The multiplicative operators and the difference are based on equality relations,
295     which are reflexive (p=p), symmetric (if p=q then q=p) and transitive
296     (if p=q and q=r, then p=r).
297    
298     The proximity relation is symmetric, but not transitive:
299     If A is adjacent to B, and B adjacent to C, A is usually not adjacent to C
300     (actually it can only be adjacent to another C like in a field "C A B C").
301     Exact distance is not even reflexive.
302    
303    
304     Therefore:
305     - intersection operators are (left and right) distributive over union:
306     '(A+B) op C' = '(A op C)+(B op C)' and 'A op (B+C)' = '(A op B)+(A op C)'
307     for all but the NOT, which reads 'A^(B+C)' = 'A^B^C'.
308     - multiplicative operators and NOT are (self) associative:
309     '(A op B) op C' = 'A op (B op C)', where op is one of * , ; or ^.
310     This also holds where the operators are not the same,
311     but the first is not NOT and the second has no higher dimension.
312     - multiplicative operators are relative commutative in their dimension:
313     'A op B' = 'B op A' if evaluated in no higher dimensional context
314     (since the results are based on a n-dimensional equality,
315     they are equal in the first n dimensions).
316     - for all intersection operators o,p we also have a property similar to
317     associativity: '(A o B) p C' = '(A p C) o B', since both expressions
318     intersect A with both the o-set of B and the p-set of C.
319     In a commutative context, this also equals 'B o (A p C)',
320     which can be used to get an existing left hand result set B
321     inside a right hand subexpression.
322     - distance is commutative in lower dimensional contexts:
323     e.g. '(A.B),C' = '(B.A),C', since it requires 3-dimensional equality,
324     and, with the above, also '(A.B).C' = 'B.(A.C)'.
325    
326     Note that the context dimension depends not only on the direct parent operator,
327     but the highest dimension of any ancestor in the parse tree:
328     While '(A.B),C' = '(B.A),C' as final expressions have the same
329     (1-dimensional) results, '((A.B),C)..D' != '((B.A),C)..D',
330     since D's distance will be tested against A or B, resp.
331    
332    
333     * processing search expressions
334    
335     Some notes on how a search expression is evaluated might be helpful
336     in order to write efficient queries.
337    
338    
339     Basic naive query processing proceeds by
340     - recursively calculating the result sets of subexpressions
341     - combining these result sets according to the operator
342    
343     Techniques used in combining result sets are
344     - merging
345     This is used by the OR to create the union of sets.
346     Duplicates with regard to the required dimensions are eliminated.
347     - marking
348     First calculate the left hand result set. Iterate over the right hand set,
349     look up and mark elements having the relation in the left hand set.
350     Finally, eliminate all unmarked (or marked, for NOT) elements.
351     This can be used to implement all relation based operators.
352     However, it is most useful where the right hand expression can loop
353     its elements without actually creating the set (direct marking).
354     - filtering
355     First calculate the right hand result set. Then, while generating
356     elements for the left hand, look up the relation in the right hand
357     set and store left hand elements only as appropriate.
358     Note that a filter applies only while creating sets,
359     not during marking or merging.
360     All techniques require the result sets to be ordered according to the
361     dimensions, i.e. by record id then tag then occ and pos, to run efficiently.
362    
363    
364     The actual implementation tries to avoid the naive approach.
365     Instead of calculating subexpression results independently,
366     the results of previous calculations and the mode in which they
367     finally will be combined are used where possible.
368     - tag filters are always applied at the lowest (term) level,
369     thus do not require an additional filtering step
370     - the dimensional context is used to purge unnecessary duplicates and,
371     optionally, can be used to reorder expressions according to their properties.
372    
373    
374     Creating result sets:
375     - for a ref, in an one-dimensional search expression, we have the results set.
376     Else, it can be created by processing the query independently or inlining
377     its text. Anyway, in most cases, especially where we create under a filter,
378     we need to create a copy.
379     - for an exact match term, the result set can be directly fetched
380     from the index. Other term relations are an OR over all keys they match.
381     - the result set of an OR is created by creating both parts recursively
382     and merging them. Any filter is applied to all subexpressions.
383     - for all other expressions, if not filtering and the right hand does
384     not support direct marking, create it and use as left hand filter.
385     Else create the left hand (using any filter), evaluate the right hand
386     using direct marking, if possible, else create it and mark.
387    
388     Direct marking in subexpressions:
389     - all simple expressions can efficiently loop their elements
390     and thus support direct marking.
391     - OR and NOT support direct marking, if the left hand supports it
392     and the right hand is a simple expression
393    
394     Notes
395     - OR supports marking also if the right hand is a plain OR.
396     However, this case (A+(B+C)) should be expressed as ((A+B)+C).
397     - marking can be extended to AND and OR, NOT with complex subexpressions
398     using multiple mark values. (G) and (F) can support marking like AND when
399     evaluated in their dimension; the only non-trivial case of this is when
400     embedded in an OR. This extended marking is an optional feature.
401     - in terms of buffers, filtering is most efficient in an intersection,
402     because the filter set can be released early.
403     It is most efficient, where the right hand support direct marking.
404     In associative situations, it may be possible to rewrite '(A op B) op C'
405     as 'A op (B op C)', so C's result can be released early during the
406     evaluation of B and likewise B's result in A.
407    
408    
409     Reordering expressions is by far the most important technique,
410     and the best thing about it is that you can do it.
411     To be most efficient, the parse tree must be as unbalanced as possible:
412     In the best case, it is fully left associative like
413     '((((A op B) op C) op D) op E) op F'
414     (with A-F simple expressions supporting direct marking).
415     That way we can always directly operate at the outermost level on
416     the result set of the previous expression.
417    
418    
419     On the default optimization level 0, no automatic expressions rewriting is done.
420     Higher optimization levels will be specified, e.g. we would expect level 1
421     to recognize a common case 'A.B.C' in a lower dimensional context,
422     and rewrite the proper expression 'A.(B.C)' to '(B.C).A',
423     so we can reuse B's details once using marking.
424    
425     Full optimization should also use all commutattive properties
426     to choose subexpressions for marking and filtering.
427    
428    
429     * processing filter expressions
430    
431     For filter expressions, the result sets are created from the current record
432     and thus are assumed to be rather small, so that buffer space is not an issue.
433    
434     Instead, short cut evaluation can be assumed to be much more important,
435     since for an interesting filter expression we expect it to not match
436     in most cases. All multiplicative operators should evaluate a term before
437     a complex subexpression and stop processing if it evaluates to the empty set.
438    
439    
440     Probably the most important optimization here would be to detect terms with
441     exact, prefix or contains relation which are required by multiplicative
442     operators and to do a quick check for their occurence as soon as possible.
443    
444     Especially where such a filter is applied to all records of a db,
445     it should be applied while streaming the record file,
446     hopefully approaching the performance of grep.
447    
448    
449     * limits
450    
451     Several limits apply to query and especially search evaluation.
452     Some are currently hardcoded, but may become configuration options
453     in future releases.
454    
455     - the number of subexpressions (terminals and operators) in any
456     expression is limited to 500.
457     - the nesting depth during parsing is limited to 50.
458     (This is usually lower then the depth of the final parse tree).
459     - the number of open queries is limited by session option q (default 20).
460     - the size (number of records) of a search result set is limited
461     by session option s (default 10.000).
462    
463    
464     ---
465     $Id: Query.txt,v 1.19 2004/07/16 12:42:39 istr Exp $
466    
467     For all those mathematical terms,
468     > http://mathforum.org/dr.math/faq/faq.property.glossary.html ask Dr. Math

  ViewVC Help
Powered by ViewVC 1.1.26