1 |
This article discuss ErServer 1.2, new version which uses Java replication |
2 |
engine. URL: http://www.varlena.com/GeneralBits/Tidbits/erserver_works.html |
3 |
|
4 |
|
5 |
Quick Outline of ErServer Internals |
6 |
|
7 |
by Andrew Sullivan |
8 |
23-Oct-2003 |
9 |
|
10 |
Abstract |
11 |
|
12 |
This is a quick conceptual outline of what erserver does once you have |
13 |
it working, if you ever get it there. That is, this is not an outline |
14 |
of how the setup scripts, etc., work (or don't), but how a tuple on |
15 |
the master ends up on the slave(s). |
16 |
|
17 |
I. Required support |
18 |
|
19 |
A. On the master |
20 |
|
21 |
The master must have the following tables set up: |
22 |
_rserv_log_1_ |
23 |
_rserv_log_2_ |
24 |
_rserv_sync_ |
25 |
_rserv_tables_ |
26 |
|
27 |
Any other tables are not strictly required. There is a table that |
28 |
normally gets set up by the system, _rserv_servers_, but it is never |
29 |
actually used. It was apparently a stub for some planned features. |
30 |
There is a server number in _rserv_sync_, and you could therefore in |
31 |
principle use _rserv_servers_ for convenient lookups, but it won't get |
32 |
used by anything. |
33 |
|
34 |
The master must also have a couple of installed sequences: |
35 |
_rserv_active_log_id_ |
36 |
_rserv_old_log_status_ |
37 |
_rserv_sync_seq_ |
38 |
|
39 |
Again, there is a sequence which exists for _rserv_servers_. It's not |
40 |
actually used. |
41 |
|
42 |
The function for the replication log trigger must be created. It's |
43 |
generally called _rserv_log_, but the important detail is that it is |
44 |
what is called by the trigger created on each replicated table, and |
45 |
that it in turn is a C function which is compiled from src/erserver.c; |
46 |
it should compile to a function called erserver.so, which will be in |
47 |
the directory where you installed erserver. (In case you're wondering, |
48 |
by the way, yes this breaks the nice $libdir trick in 7.2 and later.) |
49 |
|
50 |
B. On the slave |
51 |
|
52 |
On the slave, you need exact copies of the tables you are going to |
53 |
replicate, plus you need _rserv_slave_tables_ and _rserv_slave_sync_. |
54 |
|
55 |
C. General |
56 |
|
57 |
All this is set up as a database super user. In fact, you really only |
58 |
need to be super user to install the function, because it's a C |
59 |
function. In practice, it's easier to set up the relevant bits as |
60 |
super user, and then make some limited GRANTs to the relevant tables. |
61 |
|
62 |
II. Queueing things to be replicated |
63 |
|
64 |
A table which is to be replicated must meet several conditions on the |
65 |
master. |
66 |
|
67 |
A. It must be listed in the table _rserv_tables_: |
68 |
|
69 |
Column Type Modifiers |
70 |
tname name |
71 |
cname name |
72 |
reloid oid |
73 |
key integer |
74 |
|
75 |
The tname is the table name. The cname is the unique, not null key of |
76 |
the table for replication purposes (the setup script creates |
77 |
_ers_uniq, but there's nothing requiring that; if you had |
78 |
single-column primary keys on all your tables, you could use them). |
79 |
The reloid is the oid in pg_class for that table (i.e. select a.oid |
80 |
from pg_class a, _rserv_tables_ b where a.relname = b.tname and |
81 |
b.tname = [target table]), and the key is the attnum from pg_attribute |
82 |
for that field (i.e. select attnum from pg_attribute a, _rserv_tables_ |
83 |
b where a.attname = b.cname and a.attrelid = b.reloid and b.tname = |
84 |
[target table]; note that you'd have to have reloid set correctly for |
85 |
this to work). |
86 |
|
87 |
B. Trigger Installed |
88 |
|
89 |
On the master, it must have the replication trigger installed. It is |
90 |
normally called _rserv_trigger_t_, but the name is not important. What |
91 |
_is_ important is that the trigger calls the _rserv_log_ C function |
92 |
(the actual relevant bit is that the function which is called is the |
93 |
function, distributed with the erserver source, which compiles to |
94 |
rserv.so). |
95 |
|
96 |
C. Permissions |
97 |
|
98 |
Users writing to a replicated table must have permissions to read and |
99 |
write on _rserv_log_1_ and _rserv_log_2_, and must be able to select |
100 |
from _rserv_active_log_id_. |
101 |
|
102 |
D. Trigger action |
103 |
|
104 |
Suppose we are replicating target_table, then. We have the relevant |
105 |
entries and the trigger is set up. When you insert a record into |
106 |
target_table, you will cause the trigger to fire. It will insert into |
107 |
the current log table (see below about cleanlog, but if this is the |
108 |
first time, it should be _rserv_log_1_) a record with the reloid (see |
109 |
_rserv_tables_), the logid (which is the xid from the transaction), |
110 |
the timestamp, whether the record is deleted (-1 is an insert, 0 is an |
111 |
update, 1 a delete), and the value of the key (that is, the actual |
112 |
value in the key field for that row). |
113 |
|
114 |
III. Replicating |
115 |
|
116 |
A. Things on the slave |
117 |
|
118 |
In order to replicate to a slave, you must have set up |
119 |
_rserv_slave_tables_ on the slave analogously to how it is set up on |
120 |
the master. Note that the oids are unlikely to be the same, so you |
121 |
can't just copy the values from the master. Also, your slave tables |
122 |
must have _exactly_ the same structure as they do on the master. In |
123 |
particular, the key field must be available on the slave. Some |
124 |
versions of the setup scripts apparently have done things in the wrong |
125 |
order, and the _ers_uniq field is not correctly set up on the slave. |
126 |
You'll be wanting to fix that. If you don't have the slave completely |
127 |
set up before the sync starts, you will lose data. |
128 |
|
129 |
B. Configuration file. |
130 |
|
131 |
If people ask for an annotation of the config file, I'll provide one. |
132 |
I think the comments are not too bad in there, though. It should be |
133 |
relatively obvious if you have a misconfiguration there. (Send mail to |
134 |
the [1]erserver-general list if not.) |
135 |
|
136 |
C. Run the replication engine |
137 |
|
138 |
You _must_ use the Sun JDK to run this software: apparently there are |
139 |
bits of it which depend on special Sun bits. Java is only actually |
140 |
portable if it's written that way, and this isn't. Patches are |
141 |
welcome. The logs normally end up /path/to/erserver/logs. |
142 |
|
143 |
D. Worth noting |
144 |
|
145 |
If you look at the query generated by the replication engine, you will |
146 |
note that it joins _rserv_log_1_ (or _2_) to the replicated tables, |
147 |
and replicates the version of the row in the table. That means that |
148 |
_old versions of the row are not replicated_. This is efficient, but |
149 |
can surprise you if you're not expecting it. |
150 |
|
151 |
E. What the engine does |
152 |
|
153 |
The engine looks at _rserv_sync_ for the master and the current |
154 |
thread's target slave, and goes through the current log (either |
155 |
_rserv_log_1_ or _2_, depending on the value in _rserv_active_log_id_ |
156 |
and the state of _rserv_old_log_status_) and gets everything that is |
157 |
affected _after_ the most recent successful sync for the target |
158 |
server, and which is not currently active. It joins that result to the |
159 |
tables in question (using LEFT JOIN so that it gets null rows for the |
160 |
missing ones), and then applies all the affected rows in the result |
161 |
set for each table to the slave. All this is done in one transaction |
162 |
on the master and one on the slave. The master commits last (which |
163 |
means it is possible that the master crashes without knowing a slave |
164 |
has been updated, but not that the slave crashes when the master |
165 |
thinks rows have been committed, but haven't been). |
166 |
|
167 |
IV. CleanLog |
168 |
|
169 |
A. What is a CleanLog? |
170 |
|
171 |
The erserver engine uses more than one log into which it writes in |
172 |
order to keep track of the rows to be replicated. That could be |
173 |
inefficient, or it could cause maintenance to have to be done |
174 |
regularly. Instead, a strategy of log rotation is used. |
175 |
|
176 |
B. How does it work? |
177 |
|
178 |
The _rserv_log_ trigger inserts its results into whichever table is |
179 |
dictated by the current value of _rserv_active_log_id_. So, when it is |
180 |
time to switch logs, the clean log portion of the program just selects |
181 |
nextval('_rserv_active_log_id_'). The sequence cycles, so it returns |
182 |
to 1 after 2. |
183 |
|
184 |
Of course, immediately after the switch, there is a period in which |
185 |
old transactions still are using _rserv_log_1_ while _rserv_log_2_ is |
186 |
filling up. The replication engine knows to use both logs by the value |
187 |
of _rserv_old_log_status_. Once it has caught up all the slaves with |
188 |
all the items in _rserv_log_1_, it sets the _rserv_old_log_status_ |
189 |
back to normal, and then clean log knows it can proceed with issuing a |
190 |
TRUNCATE on _rserv_log_1_. |
191 |
|
192 |
The interval between clean log events is controlled in the |
193 |
configuration file by replic.server.CleanLogInterval. The default is |
194 |
once every 24 hours. You'll find that is too infrequent on a busy |
195 |
database. |
196 |
|
197 |
That's everything I can think of for now. I hope these notes are |
198 |
useful. Feel free to ask questions. |
199 |
|
200 |
Andrew Sullivan |
201 |
andrew at libertyrms.info |
202 |
Formatting by [2]A. Elein Mustain |
203 |
|
204 |
References |
205 |
|
206 |
1. mailto:erserver-general@postgresql |
207 |
2. mailto:elein@varlena.com |