1 |
|
2 |
With the emerging use of unicode in ISIS databases, |
3 |
the 30 byte maximum on keys might become a somewhat |
4 |
narrow limit for languages using 2 or 3 UTF-8 bytes per |
5 |
character while still having lots of characters per word. |
6 |
Where UTF-16/UCS-2 is used, all languages would be |
7 |
equally affected by an effective maximum key of |
8 |
15 characters. |
9 |
|
10 |
Therefore, there is a clear need for a new index |
11 |
structure supporting longer keys for some environments. |
12 |
This would in itself not raise any compatibility issues, |
13 |
since an index is always redundant and can be reconstructed |
14 |
by any plattform in a suitable format for this plattform |
15 |
(of course according to the plattform's builtin limits). |
16 |
|
17 |
While it would be straightforward to just create some n03/l03 |
18 |
pair of files or change 01/02 parameters, and some tools |
19 |
might even care for the control-file contents and work without |
20 |
modification using such indexes, some other tools wouldn't. |
21 |
|
22 |
|
23 |
In this situation, one may want to spend some time thinking |
24 |
about how an index should be organised. For programmers having |
25 |
coped with ISAM and databases in theory and practice, |
26 |
the reasons for some design decisions of traditional |
27 |
ISIS index structures are less than obvious. |
28 |
|
29 |
|
30 |
* blank padding |
31 |
a typical index-block lookup routine does not benefit in speed |
32 |
from blank-padding. while the fixed-array organisation allows |
33 |
for very easy Pascal implementation (where you can't cast |
34 |
some part of a block to a struct), C code doesn't care. |
35 |
instead, the additional space accounts for |
36 |
additional IO and therefore slows down index access. |
37 |
considering that many DBs even allow for compressed index files, |
38 |
where every key is stored as a number of leading bytes it |
39 |
shares with it's predecessor followed by the trailing difference, |
40 |
one should expect some performance improvement from at least |
41 |
eliminating all those blanks. |
42 |
|
43 |
* block alignment |
44 |
the node and leaves files are made up of blocks not aligned |
45 |
at any page boundaries. Every B*tree in textbooks as well as |
46 |
in real-world databases uses page-aligned blocks for best IO |
47 |
performance. potential reader/writer concurrency problems |
48 |
(readers reading partially updated blocks) are also reduced |
49 |
for most systems when IO occurs at filesystem page boundaries. |
50 |
|
51 |
* multiple files |
52 |
The separation of one index structures nodes and leaves is |
53 |
due to the fact that they do have different block sizes. |
54 |
The obvious advantage of having two index file structures is to |
55 |
reduce the space wasted by blank padding. |
56 |
Using page-aligned blocks of fix size with unpadded keys |
57 |
would allow the complete index to go into one single file. |
58 |
Moreover, the postings could be stored together with the |
59 |
leaves, further reducing IO. |
60 |
|
61 |
--- |
62 |
$Id: longidx.txt,v 1.2 2002/11/13 18:52:55 kripke Exp $ |