rserv/doc/ErServer_internals.txt

This article discuss ErServer 1.2, new version which uses Java replication
engine. URL: http://www.varlena.com/GeneralBits/Tidbits/erserver_works.html


                      Quick Outline of ErServer Internals

   by Andrew Sullivan
   23-Oct-2003

Abstract

   This is a quick conceptual outline of what erserver does once you have
   it working, if you ever get it there. That is, this is not an outline
   of how the setup scripts, etc., work (or don't), but how a tuple on
   the master ends up on the slave(s).

I. Required support

  A. On the master

   The master must have the following tables set up:
       _rserv_log_1_
       _rserv_log_2_
       _rserv_sync_
       _rserv_tables_

   Any other tables are not strictly required. There is a table that
   normally gets set up by the system, _rserv_servers_, but it is never
   actually used. It was apparently a stub for some planned features.
   There is a server number in _rserv_sync_, and you could therefore in
   principle use _rserv_servers_ for convenient lookups, but it won't get
   used by anything.

   The master must also have a couple of installed sequences:
       _rserv_active_log_id_
       _rserv_old_log_status_
       _rserv_sync_seq_

   Again, there is a sequence which exists for _rserv_servers_. It's not
   actually used.

   The function for the replication log trigger must be created. It's
   generally called _rserv_log_, but the important detail is that it is
   what is called by the trigger created on each replicated table, and
   that it in turn is a C function which is compiled from src/erserver.c;
   it should compile to a function called erserver.so, which will be in
   the directory where you installed erserver. (In case you're wondering,
   by the way, yes this breaks the nice $libdir trick in 7.2 and later.)

  B. On the slave

   On the slave, you need exact copies of the tables you are going to
   replicate, plus you need _rserv_slave_tables_ and _rserv_slave_sync_.

  C. General

   All this is set up as a database super user. In fact, you really only
   need to be super user to install the function, because it's a C
   function. In practice, it's easier to set up the relevant bits as
   super user, and then make some limited GRANTs to the relevant tables.

II. Queueing things to be replicated

   A table which is to be replicated must meet several conditions on the
   master.

  A. It must be listed in the table _rserv_tables_:

   Column  Type   Modifiers
   tname  name   
   cname  name   
   reloid oid    
   key    integer

   The tname is the table name. The cname is the unique, not null key of
   the table for replication purposes (the setup script creates
   _ers_uniq, but there's nothing requiring that; if you had
   single-column primary keys on all your tables, you could use them).
   The reloid is the oid in pg_class for that table (i.e. select a.oid
   from pg_class a, _rserv_tables_ b where a.relname = b.tname and
   b.tname = [target table]), and the key is the attnum from pg_attribute
   for that field (i.e. select attnum from pg_attribute a, _rserv_tables_
   b where a.attname = b.cname and a.attrelid = b.reloid and b.tname =
   [target table]; note that you'd have to have reloid set correctly for
   this to work).

  B. Trigger Installed

   On the master, it must have the replication trigger installed. It is
   normally called _rserv_trigger_t_, but the name is not important. What
   _is_ important is that the trigger calls the _rserv_log_ C function
   (the actual relevant bit is that the function which is called is the
   function, distributed with the erserver source, which compiles to
   rserv.so).

  C. Permissions

   Users writing to a replicated table must have permissions to read and
   write on _rserv_log_1_ and _rserv_log_2_, and must be able to select
   from _rserv_active_log_id_.

  D. Trigger action

   Suppose we are replicating target_table, then. We have the relevant
   entries and the trigger is set up. When you insert a record into
   target_table, you will cause the trigger to fire. It will insert into
   the current log table (see below about cleanlog, but if this is the
   first time, it should be _rserv_log_1_) a record with the reloid (see
   _rserv_tables_), the logid (which is the xid from the transaction),
   the timestamp, whether the record is deleted (-1 is an insert, 0 is an
   update, 1 a delete), and the value of the key (that is, the actual
   value in the key field for that row).

III. Replicating

  A. Things on the slave

   In order to replicate to a slave, you must have set up
   _rserv_slave_tables_ on the slave analogously to how it is set up on
   the master. Note that the oids are unlikely to be the same, so you
   can't just copy the values from the master. Also, your slave tables
   must have _exactly_ the same structure as they do on the master. In
   particular, the key field must be available on the slave. Some
   versions of the setup scripts apparently have done things in the wrong
   order, and the _ers_uniq field is not correctly set up on the slave.
   You'll be wanting to fix that. If you don't have the slave completely
   set up before the sync starts, you will lose data.

  B. Configuration file.

   If people ask for an annotation of the config file, I'll provide one.
   I think the comments are not too bad in there, though. It should be
   relatively obvious if you have a misconfiguration there. (Send mail to
   the [1]erserver-general list if not.)

  C. Run the replication engine

   You _must_ use the Sun JDK to run this software: apparently there are
   bits of it which depend on special Sun bits. Java is only actually
   portable if it's written that way, and this isn't. Patches are
   welcome. The logs normally end up /path/to/erserver/logs.

  D. Worth noting

   If you look at the query generated by the replication engine, you will
   note that it joins _rserv_log_1_ (or _2_) to the replicated tables,
   and replicates the version of the row in the table. That means that
   _old versions of the row are not replicated_. This is efficient, but
   can surprise you if you're not expecting it.

  E. What the engine does

   The engine looks at _rserv_sync_ for the master and the current
   thread's target slave, and goes through the current log (either
   _rserv_log_1_ or _2_, depending on the value in _rserv_active_log_id_
   and the state of _rserv_old_log_status_) and gets everything that is
   affected _after_ the most recent successful sync for the target
   server, and which is not currently active. It joins that result to the
   tables in question (using LEFT JOIN so that it gets null rows for the
   missing ones), and then applies all the affected rows in the result
   set for each table to the slave. All this is done in one transaction
   on the master and one on the slave. The master commits last (which
   means it is possible that the master crashes without knowing a slave
   has been updated, but not that the slave crashes when the master
   thinks rows have been committed, but haven't been).

IV. CleanLog

  A. What is a CleanLog?

   The erserver engine uses more than one log into which it writes in
   order to keep track of the rows to be replicated. That could be
   inefficient, or it could cause maintenance to have to be done
   regularly. Instead, a strategy of log rotation is used.

  B. How does it work?

   The _rserv_log_ trigger inserts its results into whichever table is
   dictated by the current value of _rserv_active_log_id_. So, when it is
   time to switch logs, the clean log portion of the program just selects
   nextval('_rserv_active_log_id_'). The sequence cycles, so it returns
   to 1 after 2.

   Of course, immediately after the switch, there is a period in which
   old transactions still are using _rserv_log_1_ while _rserv_log_2_ is
   filling up. The replication engine knows to use both logs by the value
   of _rserv_old_log_status_. Once it has caught up all the slaves with
   all the items in _rserv_log_1_, it sets the _rserv_old_log_status_
   back to normal, and then clean log knows it can proceed with issuing a
   TRUNCATE on _rserv_log_1_.

   The interval between clean log events is controlled in the
   configuration file by replic.server.CleanLogInterval. The default is
   once every 24 hours. You'll find that is too infrequent on a busy
   database.

   That's everything I can think of for now. I hope these notes are
   useful. Feel free to ask questions.

   Andrew Sullivan
   andrew at libertyrms.info
   Formatting by [2]A. Elein Mustain 

References

   1. mailto:erserver-general@postgresql
   2. mailto:elein@varlena.com
1	dpavlin	1.1	This article discuss ErServer 1.2, new version which uses Java replication
2			engine. URL: http://www.varlena.com/GeneralBits/Tidbits/erserver_works.html
3
4
5			Quick Outline of ErServer Internals
6
7			by Andrew Sullivan
8			23-Oct-2003
9
10			Abstract
11
12			This is a quick conceptual outline of what erserver does once you have
13			it working, if you ever get it there. That is, this is not an outline
14			of how the setup scripts, etc., work (or don't), but how a tuple on
15			the master ends up on the slave(s).
16
17			I. Required support
18
19			A. On the master
20
21			The master must have the following tables set up:
22			_rserv_log_1_
23			_rserv_log_2_
24			_rserv_sync_
25			_rserv_tables_
26
27			Any other tables are not strictly required. There is a table that
28			normally gets set up by the system, _rserv_servers_, but it is never
29			actually used. It was apparently a stub for some planned features.
30			There is a server number in _rserv_sync_, and you could therefore in
31			principle use _rserv_servers_ for convenient lookups, but it won't get
32			used by anything.
33
34			The master must also have a couple of installed sequences:
35			_rserv_active_log_id_
36			_rserv_old_log_status_
37			_rserv_sync_seq_
38
39			Again, there is a sequence which exists for _rserv_servers_. It's not
40			actually used.
41
42			The function for the replication log trigger must be created. It's
43			generally called _rserv_log_, but the important detail is that it is
44			what is called by the trigger created on each replicated table, and
45			that it in turn is a C function which is compiled from src/erserver.c;
46			it should compile to a function called erserver.so, which will be in
47			the directory where you installed erserver. (In case you're wondering,
48			by the way, yes this breaks the nice $libdir trick in 7.2 and later.)
49
50			B. On the slave
51
52			On the slave, you need exact copies of the tables you are going to
53			replicate, plus you need _rserv_slave_tables_ and _rserv_slave_sync_.
54
55			C. General
56
57			All this is set up as a database super user. In fact, you really only
58			need to be super user to install the function, because it's a C
59			function. In practice, it's easier to set up the relevant bits as
60			super user, and then make some limited GRANTs to the relevant tables.
61
62			II. Queueing things to be replicated
63
64			A table which is to be replicated must meet several conditions on the
65			master.
66
67			A. It must be listed in the table _rserv_tables_:
68
69			Column Type Modifiers
70			tname name
71			cname name
72			reloid oid
73			key integer
74
75			The tname is the table name. The cname is the unique, not null key of
76			the table for replication purposes (the setup script creates
77			_ers_uniq, but there's nothing requiring that; if you had
78			single-column primary keys on all your tables, you could use them).
79			The reloid is the oid in pg_class for that table (i.e. select a.oid
80			from pg_class a, _rserv_tables_ b where a.relname = b.tname and
81			b.tname = [target table]), and the key is the attnum from pg_attribute
82			for that field (i.e. select attnum from pg_attribute a, _rserv_tables_
83			b where a.attname = b.cname and a.attrelid = b.reloid and b.tname =
84			[target table]; note that you'd have to have reloid set correctly for
85			this to work).
86
87			B. Trigger Installed
88
89			On the master, it must have the replication trigger installed. It is
90			normally called _rserv_trigger_t_, but the name is not important. What
91			_is_ important is that the trigger calls the _rserv_log_ C function
92			(the actual relevant bit is that the function which is called is the
93			function, distributed with the erserver source, which compiles to
94			rserv.so).
95
96			C. Permissions
97
98			Users writing to a replicated table must have permissions to read and
99			write on _rserv_log_1_ and _rserv_log_2_, and must be able to select
100			from _rserv_active_log_id_.
101
102			D. Trigger action
103
104			Suppose we are replicating target_table, then. We have the relevant
105			entries and the trigger is set up. When you insert a record into
106			target_table, you will cause the trigger to fire. It will insert into
107			the current log table (see below about cleanlog, but if this is the
108			first time, it should be _rserv_log_1_) a record with the reloid (see
109			_rserv_tables_), the logid (which is the xid from the transaction),
110			the timestamp, whether the record is deleted (-1 is an insert, 0 is an
111			update, 1 a delete), and the value of the key (that is, the actual
112			value in the key field for that row).
113
114			III. Replicating
115
116			A. Things on the slave
117
118			In order to replicate to a slave, you must have set up
119			_rserv_slave_tables_ on the slave analogously to how it is set up on
120			the master. Note that the oids are unlikely to be the same, so you
121			can't just copy the values from the master. Also, your slave tables
122			must have _exactly_ the same structure as they do on the master. In
123			particular, the key field must be available on the slave. Some
124			versions of the setup scripts apparently have done things in the wrong
125			order, and the _ers_uniq field is not correctly set up on the slave.
126			You'll be wanting to fix that. If you don't have the slave completely
127			set up before the sync starts, you will lose data.
128
129			B. Configuration file.
130
131			If people ask for an annotation of the config file, I'll provide one.
132			I think the comments are not too bad in there, though. It should be
133			relatively obvious if you have a misconfiguration there. (Send mail to
134			the [1]erserver-general list if not.)
135
136			C. Run the replication engine
137
138			You _must_ use the Sun JDK to run this software: apparently there are
139			bits of it which depend on special Sun bits. Java is only actually
140			portable if it's written that way, and this isn't. Patches are
141			welcome. The logs normally end up /path/to/erserver/logs.
142
143			D. Worth noting
144
145			If you look at the query generated by the replication engine, you will
146			note that it joins _rserv_log_1_ (or _2_) to the replicated tables,
147			and replicates the version of the row in the table. That means that
148			_old versions of the row are not replicated_. This is efficient, but
149			can surprise you if you're not expecting it.
150
151			E. What the engine does
152
153			The engine looks at _rserv_sync_ for the master and the current
154			thread's target slave, and goes through the current log (either
155			_rserv_log_1_ or _2_, depending on the value in _rserv_active_log_id_
156			and the state of _rserv_old_log_status_) and gets everything that is
157			affected _after_ the most recent successful sync for the target
158			server, and which is not currently active. It joins that result to the
159			tables in question (using LEFT JOIN so that it gets null rows for the
160			missing ones), and then applies all the affected rows in the result
161			set for each table to the slave. All this is done in one transaction
162			on the master and one on the slave. The master commits last (which
163			means it is possible that the master crashes without knowing a slave
164			has been updated, but not that the slave crashes when the master
165			thinks rows have been committed, but haven't been).
166
167			IV. CleanLog
168
169			A. What is a CleanLog?
170
171			The erserver engine uses more than one log into which it writes in
172			order to keep track of the rows to be replicated. That could be
173			inefficient, or it could cause maintenance to have to be done
174			regularly. Instead, a strategy of log rotation is used.
175
176			B. How does it work?
177
178			The _rserv_log_ trigger inserts its results into whichever table is
179			dictated by the current value of _rserv_active_log_id_. So, when it is
180			time to switch logs, the clean log portion of the program just selects
181			nextval('_rserv_active_log_id_'). The sequence cycles, so it returns
182			to 1 after 2.
183
184			Of course, immediately after the switch, there is a period in which
185			old transactions still are using _rserv_log_1_ while _rserv_log_2_ is
186			filling up. The replication engine knows to use both logs by the value
187			of _rserv_old_log_status_. Once it has caught up all the slaves with
188			all the items in _rserv_log_1_, it sets the _rserv_old_log_status_
189			back to normal, and then clean log knows it can proceed with issuing a
190			TRUNCATE on _rserv_log_1_.
191
192			The interval between clean log events is controlled in the
193			configuration file by replic.server.CleanLogInterval. The default is
194			once every 24 hours. You'll find that is too infrequent on a busy
195			database.
196
197			That's everything I can think of for now. I hope these notes are
198			useful. Feel free to ask questions.
199
200			Andrew Sullivan
201			andrew at libertyrms.info
202			Formatting by [2]A. Elein Mustain
203
204			References
205
206			1. mailto:erserver-general@postgresql
207			2. mailto:elein@varlena.com