1 |
dpavlin |
237 |
structuring ISIS records using subfields or subrecords |
2 |
|
|
|
3 |
|
|
|
4 |
|
|
* structures |
5 |
|
|
|
6 |
|
|
The means by which an Isis record can be structured into "data elements" |
7 |
|
|
("A defined unit of information", Z39.2 a.k.a. ISO2709) |
8 |
|
|
fall in one of two broad categories (citing Z39.2): |
9 |
|
|
- subfields |
10 |
|
|
"A data element considered as a component of a field." |
11 |
|
|
In *ML (SGML,HTML,XML...), subfields correspond to a node's attributes. |
12 |
|
|
In MIME, subfields correspond to attributes of a MIME header value. |
13 |
|
|
- subrecords |
14 |
|
|
"A group of fields within a record that may be treated as a logical entity. |
15 |
|
|
(When a record describes more than one entity, |
16 |
|
|
the descriptions of individual entities may be treated as subrecords.)" |
17 |
|
|
In *ML, subrecords correspond to a node's childs. |
18 |
|
|
In MIME, subrecords correspond to multipart body parts. |
19 |
|
|
|
20 |
|
|
|
21 |
|
|
* subfields |
22 |
|
|
|
23 |
|
|
Since a field value can actually be anything, |
24 |
|
|
including XML text or a serialized (textual or binary) Isis record, |
25 |
|
|
it can be arbitrarily structured according to a regular expression |
26 |
|
|
or some other grammar (machine parseable or not). |
27 |
|
|
|
28 |
|
|
|
29 |
|
|
The term subfield, however, is used for a range of characters in the value |
30 |
|
|
which is identified by rather simple means: |
31 |
|
|
- fixed |
32 |
|
|
if all (or all but the last) subfields have a fixed length |
33 |
|
|
and are neither optional nor repeatable, |
34 |
|
|
then each subfield can be found at a fixed position. |
35 |
|
|
- delimited with optional identifier |
36 |
|
|
this is the proper Z39.2 notion of a subfield. |
37 |
|
|
|
38 |
|
|
If a special delimiter character is found in the field, |
39 |
|
|
it breaks the field into subfields. |
40 |
|
|
Z39.2, and thus MARC, use the character 31 as delimiter |
41 |
|
|
(hex 1F, CTRL-_, ASCII "unit separator" US). |
42 |
|
|
Traditional Isis uses the caret '^'. |
43 |
|
|
|
44 |
|
|
OpenIsis permits any character, including the horizontal TAB and semicolon. |
45 |
|
|
More precisely, OpenIsis reverts Z39.2's notion that |
46 |
|
|
"every subfield is INTRODUCED by a delimiter, unless it isn't" |
47 |
|
|
to the principle that for every data element, it is specified |
48 |
|
|
how it's end is detected, including by fixed length or varying delimiters. |
49 |
|
|
|
50 |
|
|
|
51 |
|
|
The initial n characters of a subfield are used to identify the subfield. |
52 |
|
|
Z39.2 permits any (small) fixed value for n, including 0, i.e. not identified. |
53 |
|
|
The MARC family of standards uses n=1. |
54 |
|
|
OpenIsis allows for any value, including variable length identifiers, |
55 |
|
|
which are themselves delimited by some character like a '=' (see below). |
56 |
|
|
|
57 |
|
|
|
58 |
|
|
Z39.2 states that if identifiers are used, each must be preceeded |
59 |
|
|
by a delimiter, and every data element, including the first, |
60 |
|
|
must be identified that way. However, an initial range of m characters |
61 |
|
|
(i.e. preceeding the first delimiter) in every field may serve as "indicator", |
62 |
|
|
which is not regarded a "data element". Again, m is a small fixed number; |
63 |
|
|
MARC uses m=2. Traditional Isis has no special support for indicators. |
64 |
|
|
OpenIsis allows to access whatever is before the first delimiter. |
65 |
|
|
|
66 |
|
|
|
67 |
|
|
Different subfielding methods can be mixed or nested. |
68 |
|
|
Typical cases are: |
69 |
|
|
- mixed fixed/delimited |
70 |
|
|
After some initial fixed subfields, following subffields are delimited. |
71 |
|
|
This can be used to describe MARC's fixed indicators. |
72 |
|
|
- nested delimited/fixed |
73 |
|
|
A delimited subfield has itself a fixed substructure. |
74 |
|
|
Actually the leading identifier in a subfield can be regarded |
75 |
|
|
as fixed part in a mixed substructure. |
76 |
|
|
- nested unidentified delimited |
77 |
|
|
A delimited subfield has itself a delimited structure. |
78 |
|
|
This can be used to model variable length identifiers. |
79 |
|
|
|
80 |
|
|
In other words, identifiers are themselves nothing but subfields |
81 |
|
|
used as keys on some level of nesting. |
82 |
|
|
On the other hand, any subfield could serve as a key for it's parent. |
83 |
|
|
This is used e.g. to select a field by a subfield indicating a language |
84 |
|
|
(see below for keyed subrecords). |
85 |
|
|
|
86 |
|
|
If you look at the |
87 |
|
|
> Serialized plaintext representation of an Isis record, |
88 |
|
|
actually the whole record is a newline delimited value, |
89 |
|
|
the whole database is a blankline (double newline) delimited value |
90 |
|
|
and each field has it's tag as initial tab-delimited subfield. |
91 |
|
|
|
92 |
|
|
|
93 |
|
|
In the future, OpenIsis will add support for a wide variety of |
94 |
|
|
subfielding techniques such as defined by regular expressions, |
95 |
|
|
MIME headers or produced in typical "character/comma separated values" files |
96 |
|
|
(opionally using quotes). |
97 |
|
|
|
98 |
|
|
Since splitting subfields is mostly and can always be done on the |
99 |
|
|
application level (i.e. a database server rarely needs to care), |
100 |
|
|
"support" essentially boils down to the definition of appropriate meta data. |
101 |
|
|
|
102 |
|
|
|
103 |
|
|
* subrecords |
104 |
|
|
|
105 |
|
|
A subrecord consists of a typically continuous range of fields within a record, |
106 |
|
|
started by some field to introduce the subrecord. |
107 |
|
|
Some variants, however, like keyed subfields, |
108 |
|
|
can be freely scattered and don't need a "header" field. |
109 |
|
|
|
110 |
|
|
|
111 |
|
|
There are basically four ways to denote the boundaries of structures: |
112 |
|
|
- embraced |
113 |
|
|
where a special field is used to denote the structures end. |
114 |
|
|
This resembles SGML-style notations, |
115 |
|
|
where each opening tag is matched by a closing tag. |
116 |
|
|
This is relatively easy and recommended for every day use. |
117 |
|
|
- marked |
118 |
|
|
where the fields of the child structure are marked as such. |
119 |
|
|
This is sort of the opposite approach of embracing. |
120 |
|
|
Marking comes in several powerful flavours, |
121 |
|
|
see below for a more detailled discussion. |
122 |
|
|
- counted |
123 |
|
|
where the number of fields (not childs) belonging to the |
124 |
|
|
structure is given in (any leading digits of) the initial field. |
125 |
|
|
This allows for safe embedding regardless of the |
126 |
|
|
structure's contents and is thus used in contexts where |
127 |
|
|
full generality is needed like when embedding result records |
128 |
|
|
within a server's response. |
129 |
|
|
- implicit |
130 |
|
|
where the number of childs is fixed. |
131 |
|
|
An example of this is the parse tree of a query, |
132 |
|
|
where the structure "AND" has exactly two childs |
133 |
|
|
(which in turn might be structures). |
134 |
|
|
This is used mostly for internal structures like parsed |
135 |
|
|
queries or formats, which are not meant to be exchanged. |
136 |
|
|
|
137 |
|
|
The field introducing a subrecord might have any subfields |
138 |
|
|
just like other fields, similar to the attributes that might |
139 |
|
|
be assigned to a tag in SGML applications like HTML. |
140 |
|
|
|
141 |
|
|
However, the first subfield (unidentified initial characters) |
142 |
|
|
of a field opening an embraced or counted subrecord is reserved as indicator: |
143 |
|
|
- a plus sign '+' as first character |
144 |
|
|
indicates explicity opening a subrecord |
145 |
|
|
- a minus sign '-' as first character |
146 |
|
|
indicates an empty subrecord (containing no childs) |
147 |
|
|
- an empty value |
148 |
|
|
indicates explicity closing a subrecord |
149 |
|
|
(similar to the closing blank line used in several protocols) |
150 |
|
|
- an initial numeric value |
151 |
|
|
(of decimal digits) gives the number of fields to follow. |
152 |
|
|
- an initial character @A-Z |
153 |
|
|
gives the number of childs to follow (@=0,A=1,B=2...) (rarely used) |
154 |
|
|
|
155 |
|
|
Auxiliary information about the child, |
156 |
|
|
like an embedded records row number and type, |
157 |
|
|
are stored in subfields of the parent. |
158 |
|
|
|
159 |
|
|
|
160 |
|
|
* conventions |
161 |
|
|
|
162 |
|
|
While the intented usage of subrecords might be specified in |
163 |
|
|
more detail in the |
164 |
|
|
> Meta table metadata |
165 |
|
|
, the schema can also be used standalone (without referring to metadata), |
166 |
|
|
if some conventions on tag ranges are followed. |
167 |
|
|
|
168 |
|
|
The extend of subrecords by length or braces can be safely |
169 |
|
|
determined if you just know that you want the given field |
170 |
|
|
to be regarded as subrecord. |
171 |
|
|
|
172 |
|
|
For subrecords of fixed number of childs (meant for internal use), |
173 |
|
|
it is necessary to recognize whether a following field is itself a structure. |
174 |
|
|
If they are used at all, the tag range -1..-99 should be reserved for this |
175 |
|
|
purpose. |
176 |
|
|
|
177 |
|
|
In this context, typically one of two modes is used: |
178 |
|
|
- the MIME processing mode for processing list-style content, |
179 |
|
|
assumes that negative tags denote structures, |
180 |
|
|
while positive contain plain data. |
181 |
|
|
- in XML processing mode, everything but the 0 tag (text node) is a structure. |
182 |
|
|
|
183 |
|
|
If a parent has a subfield ^0, |
184 |
|
|
that should contain the childs identity as dbname or mfn or dbname.mfn. |
185 |
|
|
If the parents indicator is delimited by a tab instead of a ^, |
186 |
|
|
the next tab-delimited subfield is interpreted that way (where applicable). |
187 |
|
|
|
188 |
|
|
|
189 |
|
|
* marked structures |
190 |
|
|
|
191 |
|
|
There is a wide variety of techniques for marking fields as "childs" |
192 |
|
|
of other fields. Marking techniques work especially well for a single |
193 |
|
|
level of substructuring; for nested structures, some restrictions apply. |
194 |
|
|
|
195 |
|
|
We give some commonly used examples: |
196 |
|
|
|
197 |
|
|
- quoting |
198 |
|
|
is done by prefixing every child field value with a special string, |
199 |
|
|
which is not used as prefix outside the child fields. |
200 |
|
|
However, at least for a single level of quoting, it does not impose |
201 |
|
|
a problem if the child fields themselves started with the same prefix: |
202 |
|
|
Still, the original value is retrieved by stripping the (first) prefix. |
203 |
|
|
This even works for multiple levels, as long as the record was properly |
204 |
|
|
constructed, i.e. the quoting prefix is not used outside childs. |
205 |
|
|
Examples are the output of the diff command (which is driving the |
206 |
|
|
RCS/CVS revision control system very reliably) and the '>' quoting |
207 |
|
|
used in e-mail replies. |
208 |
|
|
- tagging |
209 |
|
|
Instead of the field value, of course also the field tag can be used |
210 |
|
|
as child mark. In some situations it might be possible to choose |
211 |
|
|
appropriate reserved tags for the childs. |
212 |
|
|
In other situations, where some given child tag must be kept, |
213 |
|
|
it can be stored as prefix in the field value according to the canonical |
214 |
|
|
> Serialized |
215 |
|
|
plain text format. |
216 |
|
|
- keying |
217 |
|
|
If the mark used is dependent on an attribute of the parent field, |
218 |
|
|
the childs can be determined even if non-continuous. |
219 |
|
|
With some more cooperation of the childs, the mark might be an |
220 |
|
|
attribute (subfield) instead of a prefix (indicator). |
221 |
|
|
That way, childs and parents are linked together rather logically |
222 |
|
|
than "physically" by a common key just like in relational databases. |
223 |
|
|
This easily extends to multiple levels using segmented keys |
224 |
|
|
(consisting of several attributes/subfields). |
225 |
|
|
While this scheme only works with well behaved childs and may waste |
226 |
|
|
some space by replicating keys, it is simple and robust and gives |
227 |
|
|
convenient access to the childs without inspecting the structure. |
228 |
|
|
|
229 |
|
|
|
230 |
|
|
*design childs vs. attributes |
231 |
|
|
|
232 |
|
|
Every information that can be represented using an attribute, |
233 |
|
|
can also be represented using a child. |
234 |
|
|
From that point of view, attributes are a redundant "language" construct |
235 |
|
|
and one might deem a model using only childs as the simpler one. |
236 |
|
|
We call such an attributeless model "canonical verbose" representation. |
237 |
|
|
It's a little bit similar to the "everything is an object" |
238 |
|
|
approach of pure OO languages like Smalltalk. |
239 |
|
|
|
240 |
|
|
|
241 |
|
|
But then, having a richer language isn't always such a bad thing, |
242 |
|
|
if you know how to use it appropriately. |
243 |
|
|
(This "if" is the core of almost any serious criticism of rich languages, |
244 |
|
|
but for now, let's assume we know what we're doing). |
245 |
|
|
Appropriate use basically boils down to choosing the language construct |
246 |
|
|
that was just made for your situation, i.e. not the most general one, |
247 |
|
|
but quite to the opposite the most specific (restricted) one. |
248 |
|
|
That way you will not only have the most efficient representation, |
249 |
|
|
but also express additional information about what's going on. |
250 |
|
|
|
251 |
|
|
|
252 |
|
|
In short, a "canonical compact" modelling can be based upon the principle |
253 |
|
|
"Use attributes wherever possible". |
254 |
|
|
|
255 |
|
|
Some logical property of a logical structure can be represented by |
256 |
|
|
means of attributes, if |
257 |
|
|
- it is simple, |
258 |
|
|
i.e. one single string value. |
259 |
|
|
- or at least flat, |
260 |
|
|
i.e. itself a structure that can be represented based on attributes |
261 |
|
|
that do not interfere with the parents attributes. |
262 |
|
|
In the latter case, the property will show up as several |
263 |
|
|
logically interrelated attributes of the parent. |
264 |
|
|
However, such a flat group of attributes might be a candidate |
265 |
|
|
for a child under some circumstances. |
266 |
|
|
- it is not repeatable. |
267 |
|
|
Although OpenIsis supports repeated subfields as used by some MARCs, |
268 |
|
|
XML/SGML attributes can not be repeated. |
269 |
|
|
(Technically, they can, but there neither is defined semantics for |
270 |
|
|
repeated attributes nor is access supported by parsers or the DOM). |
271 |
|
|
Moreover, traditional CDS/ISIS implementations do not support |
272 |
|
|
repeated subfields, so it's probably a good idea to not use them |
273 |
|
|
without a pretty good reason. |
274 |
|
|
|
275 |
|
|
Basically, when you think C, one field's attributes take everything |
276 |
|
|
that goes into a simple struct, without using arrays or pointers. |
277 |
|
|
|
278 |
|
|
|
279 |
|
|
The detailled modelling should also take into account the intended usage. |
280 |
|
|
|
281 |
|
|
For example, one might devise some attribute candidates to childs, if |
282 |
|
|
- they are likely to be accessed or modified together |
283 |
|
|
but independent of other properties |
284 |
|
|
- they are candidates to be inherited or overridden as a group in a |
285 |
|
|
> PatchWork |
286 |
|
|
- the parent would otherwise become very large |
287 |
|
|
|
288 |
|
|
|
289 |
|
|
*variants variant structures |
290 |
|
|
|
291 |
|
|
The C language construct of a "union" is frequently used in bibliographic |
292 |
|
|
databases. The typical form resembles the PASCAL "variant record", |
293 |
|
|
using an initial field as indicator for the usage of the given field. |
294 |
|
|
Sometimes, however, the more liberal C practice is used, |
295 |
|
|
where the intented interpretation is specified somewhere in the record, somehow. |
296 |
|
|
|
297 |
|
|
A similar construct is used in ALGOL-derived OO languages like C++ or Java, |
298 |
|
|
where the indicator (of what object is this ?) is out-of-band data |
299 |
|
|
(i.e. cannot be modified or inspected like any other data). |
300 |
|
|
|
301 |
|
|
|
302 |
|
|
In Isis records, fields always have a tag |
303 |
|
|
(and subfields commonly have an identifier) indicating the kind of data. |
304 |
|
|
Therefore, there is little need to introduce another level of switches. |
305 |
|
|
A canonically decomposed model |
306 |
|
|
- would not reuse fields or subfields with different structure |
307 |
|
|
- would not contain rules like |
308 |
|
|
"if subfield a has value b then subfield c must be present" |
309 |
|
|
|
310 |
|
|
However, on the other hand, full decomposition might be tedious and |
311 |
|
|
even hide relationships. Moreover, from a given point of view, |
312 |
|
|
tags and identifiers are just ordinary subfields on some level. |
313 |
|
|
|
314 |
|
|
|
315 |
|
|
In general, if the same tag is used for variants of a field, |
316 |
|
|
the risk of misinterpretation of data should be minimized by |
317 |
|
|
not reusing the same subfields with different structure. |
318 |
|
|
After all, defining another indicator and ignoring an unexpected subfield |
319 |
|
|
or moaning on the lack of an expected one is cheaper and more robust and clear |
320 |
|
|
than verifying an expected structure based on other subfield values. |
321 |
|
|
|
322 |
|
|
|
323 |
|
|
* examples |
324 |
|
|
|
325 |
|
|
A typical HTML table definition starting with |
326 |
|
|
$ |
327 |
|
|
<table width="100%" cellpadding="0" cellspacing="0" |
328 |
|
|
marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0"> |
329 |
|
|
<tr> |
330 |
|
|
<td valign="top" width="160"> |
331 |
|
|
this is the textbody <br/> of the td node |
332 |
|
|
</td> |
333 |
|
|
</tr> |
334 |
|
|
... |
335 |
|
|
$ |
336 |
|
|
will be compacted to, say, |
337 |
|
|
$ |
338 |
|
|
100 +^w100%^p0^s0^m0^h0^t0^l0^b0 |
339 |
|
|
101 + |
340 |
|
|
102 +^vtop^w160 |
341 |
|
|
0 this is the textbody |
342 |
|
|
103 - |
343 |
|
|
0 of the td node |
344 |
|
|
102 |
345 |
|
|
101 |
346 |
|
|
... |
347 |
|
|
$ |
348 |
|
|
For a detailed description of the transformation, see |
349 |
|
|
> xmlisis the XML-ISIS doku |
350 |
|
|
|
351 |
|
|
A six field result record might be embedded within a response like |
352 |
|
|
$ |
353 |
|
|
908 6 cds.47 |
354 |
|
|
24 Hydrological achievements and social problems |
355 |
|
|
... |
356 |
|
|
$ |
357 |
|
|
|
358 |
|
|
Assuming we gave tag -20 to "OR" (and 0 to a literal), |
359 |
|
|
the query "plant OR water" might be parsed to |
360 |
|
|
$ |
361 |
|
|
-20 B |
362 |
|
|
0 plant |
363 |
|
|
0 water |
364 |
|
|
$ |
365 |
|
|
|
366 |
|
|
"frog AND (plant OR water)" might look like, if -21 is assigned to "AND" |
367 |
|
|
$ |
368 |
|
|
-21 B |
369 |
|
|
0 frog |
370 |
|
|
-20 B |
371 |
|
|
0 plant |
372 |
|
|
0 water |
373 |
|
|
$ |
374 |
|
|
|
375 |
|
|
For implicit tags, the number of childs is redundant |
376 |
|
|
(fixed per tag in a given use) and will typically be omitted. |
377 |
|
|
|
378 |
|
|
--- |
379 |
|
|
$Id: Struct.txt,v 1.8 2003/06/23 14:44:29 kripke Exp $ |