1 |
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
2 |
<html> |
3 |
<head> |
4 |
|
5 |
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"> |
6 |
<title>ISIS charsets and unicode</title> |
7 |
</head> |
8 |
<body> |
9 |
|
10 |
<h2> some notes on the use of charsets with ISIS </h2> |
11 |
|
12 |
<h3>what are charsets?</h3> |
13 |
Since computers can store nothing but numbers, but we want them to store |
14 |
characters, there has to a table telling which character is stored as which |
15 |
number, or, vice versa, which number is to display and print as which character. |
16 |
such tables are called <b>charsets</b>.<br> |
17 |
Since the smallest unit of number storage is a byte, which can hold 256 |
18 |
different numbers from 0 to 255, many charsets are based on one |
19 |
byte and thus can hold up to 256 characters. such charsets are called <b> |
20 |
one-byte-charsets</b> .<br> |
21 |
For many scripts, like the various versions of latin, greek, cyrillic, hebrew |
22 |
and arabic, 256 characters are more than enough.<br> |
23 |
For others, namely chinese, japanese and korean (<a href="http://czyborra.com/charsets/cjk.html"> |
24 |
CJK</a> |
25 |
) scripts with several thousand characters, it's not enough. The modern |
26 |
<a href="http://czyborra.com/charsets/vietnamese.html"> vietnamese </a> |
27 |
script is based on latin letters but needs a vast amount of accented letters, |
28 |
so 256 isn't enough. Those scripts don't get by with one byte per character, |
29 |
so they need <b>multi-byte-charsets</b>, where two or more bytes are needed |
30 |
to encode one character.<br> |
31 |
|
32 |
<h3>what is UNICODE</h3> |
33 |
<a href="http://czyborra.com/unicode/standard.html">UNICODE</a> |
34 |
is a big multi-byte-charset designed to include all <a href="http://czyborra.com/unicode/characters.html"> |
35 |
characters</a> |
36 |
needed in the world (over 40.000 by now), even for some ancient languages. |
37 |
The problems having several charsets are a) you have to know which charset |
38 |
is used in a given text, b) computer systems need to be aware of all possible |
39 |
charsets and c) it's not possible to have a text or database contain characters |
40 |
which are encoded in different charsets. Having all text in unicode solves |
41 |
those problems. Check out <a href="http://www.unicode.org/iuc/iuc10/x-utf8.html"> |
42 |
this sample page</a> |
43 |
- with a 21st century browser like Mozilla 5 (Netscape 6) you will see most |
44 |
or all of the letters.<br> |
45 |
|
46 |
<h3>ASCII-compatible charsets and encodings</h3> |
47 |
Many charsets use the numbers 0 to 127 in the same way: to represent the |
48 |
basic set of latin characters defined by <a href="http://czyborra.com/charsets/iso646.html"> |
49 |
ASCII</a> |
50 |
. Whenever there's a byte with a number in that range, this byte has the |
51 |
meaning of the corresponding ASCII-character. For example, the number 43 always |
52 |
is a plus sign +, which is important if a query expression is scanned for |
53 |
such characters.<br> |
54 |
All <a href="http://czyborra.com/charsets/iso8859.html">ISO-8859-x</a> |
55 |
charsets are ASCII-compatible. Older <a href="http://czyborra.com/charsets/cyrillic.html"> |
56 |
Cyrillic</a> |
57 |
charsets are NOT compatible with ASCII. Some of the eastern multi-byte-charsets |
58 |
are, some are not.<br> |
59 |
Some of the multi-byte-charsets have different <b>encodings</b>, that is, |
60 |
there is only one table mapping numbers to letters, but distinct ways to use |
61 |
multiple bytes to express such a number, some of which use the numbers in |
62 |
the ASCII-range only for ASCII characters, others don't. UNICODE has two widely |
63 |
used encodings, <a href="http://czyborra.com/utf/#UTF-8">UTF-8</a> |
64 |
and UTF-16 (UCS-2). <b>UTF-8 is ASCII-compatible</b>, UTF-16 is not.<br> |
65 |
|
66 |
<h3>so what about ISIS</h3> |
67 |
|
68 |
<ul> |
69 |
<li>the ISIS database format itself is capable of storing anything and |
70 |
thus can store text in <b>any</b> charset/encoding.<br> |
71 |
tools like biremes mx may store and retrieve (by MFN) text in nearly any |
72 |
encoding (but depending on how the programming is done, UTF-16 may not work |
73 |
because it may use bytes with value 0).<br> |
74 |
</li> |
75 |
<li>the ISIS query and formatting language depends on special ASCII-characters |
76 |
having special meaning and therefore will require an <b>ASCII-compatible |
77 |
encoding</b>. All the ISO-8859-x charsets will do as will <b>UTF-8 encoded |
78 |
unicode</b> (although some care must be taken when multiple bytes representing |
79 |
one character are cut off in the midth). At least in theory, <b>mx</b> and |
80 |
<b>wwwisis</b> are able to search for records in any ASCII-compatible |
81 |
encoding including UTF-8 unicode (given carefull web-programming).</li> |
82 |
<li><b>winisis</b> doesn't know about the possibility of one character |
83 |
having multiple bytes. It will work with any <b>ASCII-compatible</b> <b>one-byte-charset</b> |
84 |
, as long as it doesn't have to know what it does. That is, if your computer |
85 |
has some preferred charset installed, you will see all characters displayed |
86 |
according to that charset, and a character possibly entered as the german |
87 |
ä could show up as greek delta :). No support for multi-byte-charsets, |
88 |
especially <b>not unicode</b>.</li> |
89 |
<li>Like any Java software, <b><a href="http://web.tiscali.it/javaisis/"> |
90 |
JavaISIS</a> |
91 |
</b> is - in theory - able to handle unicode characters and even to do |
92 |
the transformation between <b>unicode and most of the other</b> charsets. |
93 |
Some limitations may result from the underlying wwwisis. In practice, version |
94 |
3.5 claims to give "Multi-language encoding support", but unfortunately it's |
95 |
in beta since March 2001 (sources made available in Feb 2002).</li> |
96 |
<li><b>openisis</b> supports <b>any charset</b> and with it's Java-binding, |
97 |
<b>especially unicode</b> and all the conversions. openisis alone can |
98 |
do it on the web, and in combination with JavaISIS (once new sources are |
99 |
available) also with a winisis-like interface.<br> |
100 |
</li> |
101 |
|
102 |
</ul> |
103 |
<br> |
104 |
|
105 |
<h2> some other resources on unicode </h2> |
106 |
|
107 |
To see all those characters, you need fonts to tell your display |
108 |
or printer how they look like. |
109 |
Here's a |
110 |
<a href="http://www.hclrss.demon.co.uk/unicode/fonts.html"> very fine page </a> |
111 |
on how to acquire and install those fonts (and some more advice). |
112 |
James Kass has a |
113 |
<a href="http://home.att.net/~jameskass/scriptlinks.htm">long list</a> |
114 |
of high quality links related to Unicode. |
115 |
|
116 |
|
117 |
If you for some reason have to waste your time with M$ products, |
118 |
you may want to check out |
119 |
<a href="http://www.microsoft.com/typography/fonts/"> this page </a>. |
120 |
Especially there's the one-size(23 MB)-fits-all fat font |
121 |
<a href="http://office.microsoft.com/downloads/2000/aruniupd.aspx"> |
122 |
Arial Unicode MS </a> (TM, (c), ... expect the worst) |
123 |
containing nearly all unicode glyphs, which is also included |
124 |
with newer Windoze and/or Ophice versions. |
125 |
|
126 |
</body> |
127 |
</html> |