openisis/doc/charsets.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
       
  <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
  <title>ISIS charsets and unicode</title>
</head>
  <body>
 
<h2> some notes on the use of charsets with ISIS  </h2>
 
<h3>what are charsets?</h3>
 Since computers can store nothing but numbers, but we want them to store 
characters, there has to a table telling which character is stored as which 
number, or, vice versa, which number is to display and print as which character.
 such tables are called <b>charsets</b>.<br>
 Since the smallest unit of number storage is a byte, which can hold 256
different numbers from 0 to 255, many charsets are based on one
byte and thus can hold up to 256 characters. such charsets are called <b>
one-byte-charsets</b> .<br>
 For many scripts, like the various versions of latin, greek, cyrillic, hebrew 
and arabic, 256 characters are more than enough.<br>
 For others, namely chinese, japanese and korean (<a href="http://czyborra.com/charsets/cjk.html">
 CJK</a>
 ) scripts with several thousand characters, it's not enough. The modern
<a href="http://czyborra.com/charsets/vietnamese.html"> vietnamese </a>
 script is based on latin letters but needs a vast amount of accented letters, 
so 256 isn't enough. Those scripts don't get by with one byte per character,
 so they need <b>multi-byte-charsets</b>, where two or more bytes are needed 
to encode one character.<br>
 
<h3>what is UNICODE</h3>
 <a href="http://czyborra.com/unicode/standard.html">UNICODE</a>
  is a big multi-byte-charset designed to include all <a href="http://czyborra.com/unicode/characters.html">
 characters</a>
  needed in the world (over 40.000 by now), even for some ancient languages. 
The problems having several charsets are a) you have to know which charset 
is used in a given text, b) computer systems need to be aware of all possible 
charsets and c) it's not possible to have a text or database contain characters 
which are encoded in different charsets. Having all text in unicode solves 
those problems. Check out <a href="http://www.unicode.org/iuc/iuc10/x-utf8.html">
this sample page</a>
 - with a 21st century browser like Mozilla 5 (Netscape 6) you will see most
or all of the letters.<br>
 
<h3>ASCII-compatible charsets and encodings</h3>
 Many charsets use the numbers 0 to 127 in the same way: to represent the 
basic set of latin characters defined by <a href="http://czyborra.com/charsets/iso646.html">
 ASCII</a>
 . Whenever there's a byte with a number in that range, this byte has the 
meaning of the corresponding ASCII-character. For example, the number 43 always
is a plus sign +, which is important if a query expression is scanned for
such characters.<br>
 All <a href="http://czyborra.com/charsets/iso8859.html">ISO-8859-x</a>
  charsets are ASCII-compatible. Older <a href="http://czyborra.com/charsets/cyrillic.html">
 Cyrillic</a>
  charsets are NOT compatible with ASCII. Some of the eastern multi-byte-charsets 
are, some are not.<br>
Some of the multi-byte-charsets have different <b>encodings</b>, that is, 
there is only one table mapping numbers to letters, but distinct ways to use
multiple bytes to express such a number, some of which use the numbers in
the ASCII-range only for ASCII characters, others don't. UNICODE has two widely
used encodings, <a href="http://czyborra.com/utf/#UTF-8">UTF-8</a>
  and UTF-16 (UCS-2). <b>UTF-8 is ASCII-compatible</b>, UTF-16 is not.<br>
 
<h3>so what about ISIS</h3>
 
<ul>
   <li>the ISIS database format itself is capable of storing anything and 
thus can store text in <b>any</b> charset/encoding.<br>
 tools like biremes mx may store and retrieve (by MFN) text in nearly any 
encoding (but depending on how the programming is done, UTF-16 may not work 
because it may use bytes with value 0).<br>
   </li>
   <li>the ISIS query and formatting language depends on special ASCII-characters 
 having special meaning and therefore will require an <b>ASCII-compatible
encoding</b>. All the ISO-8859-x charsets will do as will <b>UTF-8 encoded
unicode</b> (although some care must be taken when multiple bytes representing
one character are cut off in the midth). At least in theory, <b>mx</b> and
    <b>wwwisis</b> are able to search for records in any&nbsp;ASCII-compatible
encoding including UTF-8 unicode (given carefull web-programming).</li>
   <li><b>winisis</b> doesn't know about the possibility of one character
having multiple bytes. It will work with any <b>ASCII-compatible</b> <b>one-byte-charset</b>
, as long as it doesn't have to know what it does. That is, if your computer
has some preferred charset installed, you will see all characters displayed
according to that charset, and a character possibly entered as the german
&auml; could show up as greek delta :). No support for multi-byte-charsets,
especially <b>not unicode</b>.</li>
   <li>Like any Java software, <b><a href="http://web.tiscali.it/javaisis/">
JavaISIS</a>
    </b> is - in theory - able to handle unicode characters and even to do
the transformation between <b>unicode and most of the other</b> charsets.
 Some limitations may result from the underlying wwwisis. In practice, version
3.5 claims to give "Multi-language encoding support", but unfortunately it's
in beta since March 2001 (sources made available in Feb 2002).</li>
   <li><b>openisis</b> supports <b>any charset</b> and with it's Java-binding,
    <b>especially unicode</b> and all the conversions. openisis alone can
do it on the web, and in combination with JavaISIS (once new sources are
available) also with a winisis-like interface.<br>
   </li>
 
</ul>
 <br>
 
<h2> some other resources on unicode </h2>

To see all those characters, you need fonts to tell your display
or printer how they look like.
Here's a
<a href="http://www.hclrss.demon.co.uk/unicode/fonts.html"> very fine page </a>
on how to acquire and install those fonts (and some more advice).
James Kass has a
<a href="http://home.att.net/~jameskass/scriptlinks.htm">long list</a>
of high quality links related to Unicode.


If you for some reason have to waste your time with M$ products,
you may want to check out
<a href="http://www.microsoft.com/typography/fonts/"> this page </a>.
Especially there's the one-size(23 MB)-fits-all fat font
<a href="http://office.microsoft.com/downloads/2000/aruniupd.aspx">
Arial Unicode MS </a> (TM, (c), ... expect the worst)
containing nearly all unicode glyphs, which is also included
with newer Windoze and/or Ophice versions.

</body>
</html>
1	<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2	<html>
3	<head>
4
5	<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
6	<title>ISIS charsets and unicode</title>
7	</head>
8	<body>
9
10	<h2> some notes on the use of charsets with ISIS </h2>
11
12	<h3>what are charsets?</h3>
13	Since computers can store nothing but numbers, but we want them to store
14	characters, there has to a table telling which character is stored as which
15	number, or, vice versa, which number is to display and print as which character.
16	such tables are called <b>charsets</b>.<br>
17	Since the smallest unit of number storage is a byte, which can hold 256
18	different numbers from 0 to 255, many charsets are based on one
19	byte and thus can hold up to 256 characters. such charsets are called <b>
20	one-byte-charsets</b> .<br>
21	For many scripts, like the various versions of latin, greek, cyrillic, hebrew
22	and arabic, 256 characters are more than enough.<br>
23	For others, namely chinese, japanese and korean (<a href="http://czyborra.com/charsets/cjk.html">
24	CJK</a>
25	) scripts with several thousand characters, it's not enough. The modern
26	<a href="http://czyborra.com/charsets/vietnamese.html"> vietnamese </a>
27	script is based on latin letters but needs a vast amount of accented letters,
28	so 256 isn't enough. Those scripts don't get by with one byte per character,
29	so they need <b>multi-byte-charsets</b>, where two or more bytes are needed
30	to encode one character.<br>
31
32	<h3>what is UNICODE</h3>
33	<a href="http://czyborra.com/unicode/standard.html">UNICODE</a>
34	is a big multi-byte-charset designed to include all <a href="http://czyborra.com/unicode/characters.html">
35	characters</a>
36	needed in the world (over 40.000 by now), even for some ancient languages.
37	The problems having several charsets are a) you have to know which charset
38	is used in a given text, b) computer systems need to be aware of all possible
39	charsets and c) it's not possible to have a text or database contain characters
40	which are encoded in different charsets. Having all text in unicode solves
41	those problems. Check out <a href="http://www.unicode.org/iuc/iuc10/x-utf8.html">
42	this sample page</a>
43	- with a 21st century browser like Mozilla 5 (Netscape 6) you will see most
44	or all of the letters.<br>
45
46	<h3>ASCII-compatible charsets and encodings</h3>
47	Many charsets use the numbers 0 to 127 in the same way: to represent the
48	basic set of latin characters defined by <a href="http://czyborra.com/charsets/iso646.html">
49	ASCII</a>
50	. Whenever there's a byte with a number in that range, this byte has the
51	meaning of the corresponding ASCII-character. For example, the number 43 always
52	is a plus sign +, which is important if a query expression is scanned for
53	such characters.<br>
54	All <a href="http://czyborra.com/charsets/iso8859.html">ISO-8859-x</a>
55	charsets are ASCII-compatible. Older <a href="http://czyborra.com/charsets/cyrillic.html">
56	Cyrillic</a>
57	charsets are NOT compatible with ASCII. Some of the eastern multi-byte-charsets
58	are, some are not.<br>
59	Some of the multi-byte-charsets have different <b>encodings</b>, that is,
60	there is only one table mapping numbers to letters, but distinct ways to use
61	multiple bytes to express such a number, some of which use the numbers in
62	the ASCII-range only for ASCII characters, others don't. UNICODE has two widely
63	used encodings, <a href="http://czyborra.com/utf/#UTF-8">UTF-8</a>
64	and UTF-16 (UCS-2). <b>UTF-8 is ASCII-compatible</b>, UTF-16 is not.<br>
65
66	<h3>so what about ISIS</h3>
67
68	<ul>
69	<li>the ISIS database format itself is capable of storing anything and
70	thus can store text in <b>any</b> charset/encoding.<br>
71	tools like biremes mx may store and retrieve (by MFN) text in nearly any
72	encoding (but depending on how the programming is done, UTF-16 may not work
73	because it may use bytes with value 0).<br>
74	</li>
75	<li>the ISIS query and formatting language depends on special ASCII-characters
76	having special meaning and therefore will require an <b>ASCII-compatible
77	encoding</b>. All the ISO-8859-x charsets will do as will <b>UTF-8 encoded
78	unicode</b> (although some care must be taken when multiple bytes representing
79	one character are cut off in the midth). At least in theory, <b>mx</b> and
80	<b>wwwisis</b> are able to search for records in any ASCII-compatible
81	encoding including UTF-8 unicode (given carefull web-programming).</li>
82	<li><b>winisis</b> doesn't know about the possibility of one character
83	having multiple bytes. It will work with any <b>ASCII-compatible</b> <b>one-byte-charset</b>
84	, as long as it doesn't have to know what it does. That is, if your computer
85	has some preferred charset installed, you will see all characters displayed
86	according to that charset, and a character possibly entered as the german
87	ä could show up as greek delta :). No support for multi-byte-charsets,
88	especially <b>not unicode</b>.</li>
89	<li>Like any Java software, <b><a href="http://web.tiscali.it/javaisis/">
90	JavaISIS</a>
91	</b> is - in theory - able to handle unicode characters and even to do
92	the transformation between <b>unicode and most of the other</b> charsets.
93	Some limitations may result from the underlying wwwisis. In practice, version
94	3.5 claims to give "Multi-language encoding support", but unfortunately it's
95	in beta since March 2001 (sources made available in Feb 2002).</li>
96	<li><b>openisis</b> supports <b>any charset</b> and with it's Java-binding,
97	<b>especially unicode</b> and all the conversions. openisis alone can
98	do it on the web, and in combination with JavaISIS (once new sources are
99	available) also with a winisis-like interface.<br>
100	</li>
101
102	</ul>
103	<br>
104
105	<h2> some other resources on unicode </h2>
106
107	To see all those characters, you need fonts to tell your display
108	or printer how they look like.
109	Here's a
110	<a href="http://www.hclrss.demon.co.uk/unicode/fonts.html"> very fine page </a>
111	on how to acquire and install those fonts (and some more advice).
112	James Kass has a
113	<a href="http://home.att.net/~jameskass/scriptlinks.htm">long list</a>
114	of high quality links related to Unicode.
115
116
117	If you for some reason have to waste your time with M$ products,
118	you may want to check out
119	<a href="http://www.microsoft.com/typography/fonts/"> this page </a>.
120	Especially there's the one-size(23 MB)-fits-all fat font
121	<a href="http://office.microsoft.com/downloads/2000/aruniupd.aspx">
122	Arial Unicode MS </a> (TM, (c), ... expect the worst)
123	containing nearly all unicode glyphs, which is also included
124	with newer Windoze and/or Ophice versions.
125
126	</body>
127	</html>