1 |
OpenIsis/Malete field definition and record structures |
2 |
|
3 |
|
4 |
* overview |
5 |
|
6 |
A Malete record is a sequence of one or more fields. |
7 |
The first one is called the header, all others are identified by a numeric tag. |
8 |
|
9 |
As far as the Malete database core is concerned, |
10 |
a field may contain any arbitrary bytes but newline characters. |
11 |
Assuming anything about the structure of field data, |
12 |
including any encoding of binary data, |
13 |
is solely at the application's discretion. |
14 |
|
15 |
As Malete is designed to be a multi-purpose database engine, |
16 |
there is no special schema enforced. |
17 |
However, there is a schema suggested and used by the OpenIsis application. |
18 |
In the database's |
19 |
> MetaData metadata record, |
20 |
fields with tag (00)6 are reserved for this purpose (abuse at your own risk). |
21 |
|
22 |
|
23 |
The rationale of this field definition is to provide enough flexibility |
24 |
to efficiently support representations of all structures found in Z39.2 |
25 |
based systems (including but transcending the traditional CDS/ISIS software), |
26 |
especially the various MARC formats, as well as full representations of |
27 |
data commonly stored and transmitted in a couple of other formats like |
28 |
MIME and XML. |
29 |
|
30 |
The term "representation" means that Malete will not bother to |
31 |
directly support XML's angle brackets nor XML's/MIME's foo="bar" options |
32 |
nor the subfield delimiter characters of MARC or CDS/ISIS. |
33 |
Rather, for any such data there should be a lossless transformation to an |
34 |
efficient representation in some format described by this field definition. |
35 |
|
36 |
|
37 |
* structure of fields |
38 |
|
39 |
While fields may be used to hold a single value, |
40 |
it is a common technique to treat them as a sequence of subfields. |
41 |
("A data element considered as a component of a field.", Z39.2). |
42 |
|
43 |
A field may contain, in that order: |
44 |
- 0 or more positional subfields of fixed length |
45 |
- 1 or more positional subfields of variable length |
46 |
- 0 or more identified subfields of variable length |
47 |
|
48 |
Fixed length subfields end after as many bytes (not characters!) as given by |
49 |
their length. They are typically used for data coded in some ASCII values. |
50 |
Neither UTF-8 characters nor the delimiter character should be stored |
51 |
in fixed length fields (however, it's up to the application to exercise care). |
52 |
|
53 |
Variable length subfields end at a delimiter character or end of field. |
54 |
Malete by default uses a tabulator as delimiter, |
55 |
and import of CDS/ISIS databases converts the caret (hat '^') to tabs, |
56 |
however applications are free to use any delimiter they want. |
57 |
|
58 |
|
59 |
Positional subfields are identified by their position within the field, |
60 |
i.e. by counting that many bytes and delimiters. |
61 |
Of course, there is only one nth position within a field, |
62 |
i.e. every positional subfield can occur at most once. |
63 |
Since the first n bytes and first m delimited subfields are used as the |
64 |
positional subfields, they may be omitted only if end of field is seen, |
65 |
i.e. all other subfields are omitted. |
66 |
|
67 |
Identified subfields, on the other hand, start with a single character |
68 |
identifying the subfield, just like fields in a record are identified by a tag. |
69 |
Applications unaware of UTF-8 may demand a single byte as identifier. |
70 |
Where portability is an issue, only ASCII letters and digits should be used. |
71 |
Since there is at least one positional variable subfield, |
72 |
identified subfields always start after a delimiter (in accordance with Z39.2). |
73 |
An identified subfield may occur zero, one or more times in a field. |
74 |
|
75 |
|
76 |
The MAIN VALUE of a field contains the fixed length subfields together with |
77 |
the first positional variable subfield. Sloppy applications may use anything |
78 |
up to the first delimiter, assuming that fixed subfields do not contain it. |
79 |
In the common situation of having no fixed length subfields, |
80 |
the main value equals the first positional field. |
81 |
The main value in a field is very similar to a record's header |
82 |
and commonly used as a key to select a field in a record. |
83 |
|
84 |
|
85 |
The properties of subfields stated so far are consequences of their very |
86 |
definition. Additional properties, e.g. the main value being empty |
87 |
or an identified subfield having a fixed length a/o occuring exactly once, |
88 |
may be demanded by field definition. |
89 |
It is the applications responsibility to make sure records do not violate |
90 |
the field definition; the Malete server will happily store whatever it receives. |
91 |
|
92 |
|
93 |
* definition of fields |
94 |
|
95 |
The field definition uses fields of the metadata record, |
96 |
one per each field and one per subfield. |
97 |
These fields themselves do not use fixed length subfields. |
98 |
The main value is a (non-unique) key: |
99 |
- 'tag' for a field definition, |
100 |
where tag is an integer. Negative numbers are reserved for counted structures. |
101 |
By convention, general application data fields should |
102 |
> TagUse use tags |
103 |
100 - 999. |
104 |
- 'tag#len' for a fixed subfield, |
105 |
where len is a positive integer |
106 |
- 'tag#' for an additional variable positional subfield. |
107 |
the first variable positional subfield's type, values and xref |
108 |
are defined with the main field definition. |
109 |
- 'tag^i' for a subfield identified by character i |
110 |
('^' is the actual hat character, which is NOT the subfield delimiter; |
111 |
the field definition uses tabulators) |
112 |
|
113 |
All other subfields in the field definition are identified and optional: |
114 |
- n name |
115 |
A name by which a field or subfield can be referred to. |
116 |
Field names must be unique and subfield names must be unique in their field. |
117 |
It is strongly recommended to only use C identifiers, |
118 |
i.e. ASCII letters, digits and the underscore, not starting with a digit. |
119 |
- d description |
120 |
Some textual description suitable for the database users. |
121 |
- m min/mandatory |
122 |
The sub/field must occur at least as many times as given by this option's |
123 |
value (empty=1, absent=0). |
124 |
- r repeatable |
125 |
The sub/field must occur at most as many times as given by this option's |
126 |
value (empty=any, absent=1). A value preceeded by '+' (including a single |
127 |
'+' for any) implies the mandatory option (at least one occurrence). |
128 |
- v value |
129 |
Every occurrence of this repeatable option is of the form name=value, |
130 |
associating the symbolic name with a legal value for the sub/field. |
131 |
The first such value is used as a default where the sub/field is created |
132 |
for some reason. |
133 |
- t type |
134 |
Type of this sub/field; see further below. |
135 |
Defaults to any (non-control) characters. |
136 |
Applications might support repeated alternative types. |
137 |
|
138 |
* types of subfields |
139 |
|
140 |
Note that a field's type actually defines the type of its first |
141 |
positional variable subfield (which is usually the main value). |
142 |
If there are no subfields defined for a field, |
143 |
the field's value equals its main value. |
144 |
|
145 |
|
146 |
A simple type definition consists of a single letter indicating |
147 |
a character type, optionally followed by some digits giving a repeat count. |
148 |
Unlike the byte-based length restrictions of fixed length fields, |
149 |
the repeat count should be assumed in terms of characters. |
150 |
|
151 |
For the terms "alphabetic" and "digit", it's up to the application's |
152 |
UNICODE support to properly check these attributes for non-ASCII characters. |
153 |
Simple environments may assume any code greater than 127 alphabetic. |
154 |
|
155 |
Basic character types are: |
156 |
- c character |
157 |
Any character with a code value greater or equal 32 (i.e. no C0 controls). |
158 |
- a alpha |
159 |
Any alphabetic character. |
160 |
- d digit |
161 |
ASCII digits '0'-'9'. |
162 |
- n numeric |
163 |
Digits and optional leading minus sign. |
164 |
- w word |
165 |
Alpha, digits and underscore. |
166 |
|
167 |
Extended character/byte types, possibly not supported by all environments, are: |
168 |
- b bit/boolean |
169 |
ASCII digits '0' or '1'. If a subfield of this type is absent, '0' should |
170 |
be assumed, but a '1' if it's present and empty. |
171 |
- r raw |
172 |
Raw bytes using newline/vertical tab encoding as suggested by the |
173 |
> Protocol |
174 |
- i integer |
175 |
Binary coded fix point decimal numbers using two decimal digits per byte |
176 |
(128-99 .. 128+99) and starting with a byte 144 plus the bytes before |
177 |
the decimal point (minus for negative numbers). |
178 |
Such integers sort properly, avoid newlines and tabs, and the first byte |
179 |
(for up to 30 decimal digits) is not valid in UTF-8 or any ISO charset. |
180 |
- t time |
181 |
Date and time as GTF integer. Up to 8 digits before the decimal point |
182 |
for date YYYYMMDD, after the decimal point hhmmss... |
183 |
|
184 |
For all simple type definitions, the same letter may be used uppercase. |
185 |
With lowercase, the repeat count gives a maximum and defaults to any. |
186 |
With an uppercase type letter, the repeat count is exact and defaults to 1. |
187 |
|
188 |
|
189 |
Complex type definitions include the following: |
190 |
- = pattern |
191 |
Pattern is a sequence of simple type definitions of basic character types. |
192 |
E.g. 'A3a6' denotes 3 to 9 alphabetic characters. |
193 |
Any special character in pattern denotes itself (typically as separator). |
194 |
- ~ regexp |
195 |
Depending on the regexp package used. |
196 |
- " literal |
197 |
Must have one of the values listed with the field's v option. |
198 |
|
199 |
The field definition of basic field definition is: |
200 |
$ |
201 |
6 6 nfdt dfield definition r t=Nc |
202 |
6 6^n nname dsub/field name |
203 |
6 6^d ndesc dsub/field description |
204 |
6 6^m nmin dmin number of occurrences tn |
205 |
6 6^r nrep dmax number of occurrences |
206 |
6 6^v nval dnamed values r |
207 |
6 6^t ntype dsub/field type |
208 |
$ |
209 |
|
210 |
|
211 |
* advanced field definition |
212 |
|
213 |
There are some advanced field definition options which are probably |
214 |
not supported by all applications. |
215 |
Where used, however, the following formats are recommended: |
216 |
- b base |
217 |
The key or name of another sub/field definition in this metadata record |
218 |
from which options (and, for a field, subfield definitions) should be |
219 |
used for this entity. Obviously just a convenience feature. |
220 |
- x xref |
221 |
Definition of some other entity referred to by the value of this sub/field. |
222 |
Described elsewhere. |
223 |
- s structure a.k.a. subrecord |
224 |
The field introduces a structure in the record; see further below. |
225 |
- c child |
226 |
This repeatable option specifies a tag or name of a legal child field. |
227 |
Applications might support this being followed by '[:min][-max]' |
228 |
to specify a min a/o max count of occurences of this child, |
229 |
or one of the letters '+' (at least once), '?' (at most once), |
230 |
'!' (exactly once) or '*' (any number of times, default). |
231 |
In the definition of those childs, r0 may be used to indicate that they |
232 |
should not occur in the record but where explicitly listed as legal child. |
233 |
|
234 |
* structures |
235 |
|
236 |
The structure option indicates that a field is the header of a structure, |
237 |
indicating that some fields following it in the record somehow belong to it. |
238 |
("A group of fields within a record that may be treated as a logical entity. |
239 |
(When a record describes more than one entity, the descriptions of individual |
240 |
entities may be treated as subrecords.)", Z39.2). |
241 |
|
242 |
|
243 |
While in general there are a couple of ways to mark a sequence of fields |
244 |
as logically being one entity, there are three methods supported by |
245 |
the field definition: |
246 |
- counted structures |
247 |
If the s option's value is empty, |
248 |
the field's tag is the negative number of fields belonging to the |
249 |
structure, including the header. This is the means used by the |
250 |
> Protocol |
251 |
to efficiently and transparently embed any records in messages. |
252 |
Obviously counted structures cannot be accessed by their tag. |
253 |
They are defined as some negative tags. |
254 |
Some known format of their main value (especially a literal) |
255 |
may be used to access them by key. |
256 |
- delimited structures |
257 |
If the s option's value is '+', the field has one additional initial |
258 |
subfield of fixed length 1. For a given occurence of this field, |
259 |
this subfield must contain either '-', indicating that there are |
260 |
no childs, be absent (i.e. the field is completely empty), |
261 |
or contain a '+', indicating that everything up to a matching |
262 |
empty field of same tag are the structures childs. |
263 |
- fixed structures |
264 |
If the s option's value is a number, the structure has exactly as |
265 |
many childs as given by this number. Note that the number of fields |
266 |
may be greater if the childs are structures themselves. Rarely used. |
267 |
|
268 |
Note that while the field definition in general does not specify |
269 |
the ordering of fields, the childs of a structure are always |
270 |
a consecutive range according to the structure's definition. |
271 |
|
272 |
|
273 |
Z39.2 reserves control field 002 for "subrecord purposes", |
274 |
e.g. listing the offsets of such "groups of fields". |
275 |
|
276 |
|
277 |
* recommendations |
278 |
|
279 |
- fixed subfields should contain only bytes 32 to 126, inclusive |
280 |
- if delimited structures are used, they should be used consistently, |
281 |
i.e. all fields (but 0) should have that type |
282 |
- fixed structures should only be used for internal purposes |
283 |
|
284 |
* examples |
285 |
|
286 |
The headers of email or other MIME messages like |
287 |
$ |
288 |
Subject: hi there |
289 |
Content-Type: text/plain; charset="iso8859-1" |
290 |
$ |
291 |
using a field definition of |
292 |
$ |
293 |
6 10 nsubject |
294 |
6 11 ncontent-type |
295 |
6 11^c ncharset |
296 |
$ |
297 |
map to |
298 |
$ |
299 |
10 hi there |
300 |
11 text/plain ciso8859-1 |
301 |
$ |
302 |
Value options could be used to encode common value like text/plain. |
303 |
|
304 |
|
305 |
Using delimited structures, a typical HTML table definition starting with |
306 |
$ |
307 |
<table width="100%" cellpadding="0" cellspacing="0" |
308 |
marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0"> |
309 |
<tr> |
310 |
<td valign="top" width="160"> |
311 |
this is the textbody <br/> of the td node |
312 |
</td> |
313 |
</tr> |
314 |
... |
315 |
$ |
316 |
using |
317 |
$ |
318 |
6 100 ntd s+ |
319 |
6 100^w nwidth |
320 |
... |
321 |
6 101 ntr |
322 |
... |
323 |
$ |
324 |
will be compacted to |
325 |
$ |
326 |
100 + w100% p0 s0 m0 h0 t0 l0 b0 |
327 |
101 + |
328 |
102 + vtop w160 |
329 |
0 this is the textbody |
330 |
103 - |
331 |
0 of the td node |
332 |
102 |
333 |
101 |
334 |
... |
335 |
$ |
336 |
which could save half of the internet's bandwidth. |
337 |
|
338 |
Some strict XML parsers limit a node to at most one textnode child, |
339 |
which then should be stored in the node's main value. |
340 |
|
341 |
|
342 |
* conformance |
343 |
|
344 |
Most features of Z39.2 (a.k.a. ISO2709 a.k.a. IIF) map directly to |
345 |
Malete records. Subfield identifiers in Z39.2 can use more than one |
346 |
character, however, MARC always uses one. |
347 |
Initial fixed subfields are dubbed "indicators" by Z39.2, |
348 |
MARC uses two of length 1. They are not considered "data elements", |
349 |
as other subfields are. Here, fixed subfields are considered less special. |
350 |
|
351 |
|
352 |
MIME and *ML (SGML,HTML,XML...) data structures can be converted to records |
353 |
in a straightforward manner after a parser has resolved entities and the like. |
354 |
|
355 |
|
356 |
--- |
357 |
$Id: RecStruct.txt,v 1.6 2004/07/26 12:23:34 kripke Exp $ |