lib/WebPAC/Input.pm

package WebPAC::Input;

use warnings;
use strict;

use WebPAC::Common;
use base qw/WebPAC::Common/;
use Text::Iconv;

=head1 NAME

WebPAC::Input - read different file formats into WebPAC

=head1 VERSION

Version 0.03

=cut

our $VERSION = '0.03';

=head1 SYNOPSIS

This module implements input as database which have fixed and known
I<size> while indexing and single unique numeric identifier for database
position ranging from 1 to I<size>.

Simply, something that is indexed by unmber from 1 .. I<size>.

Examples of such databases are CDS/ISIS files, MARC files, lines in
text file, and so on.

Specific file formats are implemented using low-level interface modules,
located in C<WebPAC::Input::*> namespace which export C<open_db>,
C<fetch_rec> and optional C<init> functions.

Perhaps a little code snippet.

    use WebPAC::Input;

    my $db = WebPAC::Input->new(
        module => 'WebPAC::Input::ISIS',
                config => $config,
                lookup => $lookup_obj,
                low_mem => 1,
    );

    $db->open('/path/to/database');
    print "database size: ",$db->size,"\n";
    while (my $rec = $db->fetch) {
    }


=head1 FUNCTIONS

=head2 new

Create new input database object.

  my $db = new WebPAC::Input(
        module => 'WebPAC::Input::MARC',
        code_page => 'ISO-8859-2',
        low_mem => 1,
  );

C<module> is low-level file format module. See L<WebPAC::Input::Isis> and
L<WebPAC::Input::MARC>.

Optional parametar C<code_page> specify application code page (which will be
used internally). This should probably be your terminal encoding, and by
default, it C<ISO-8859-2>.

Default is not to use C<low_mem> options (see L<MEMORY USAGE> below).

This function will also call low-level C<init> if it exists with same
parametars.

=cut

sub new {
        my $class = shift;
        my $self = {@_};
        bless($self, $class);

        my $log = $self->_get_logger;

        $log->logconfess("specify low-level file format module") unless ($self->{module});
        my $module = $self->{module};
        $module =~ s#::#/#g;
        $module .= '.pm';
        $log->debug("require low-level module $self->{module} from $module");

        require $module;
        #eval $self->{module} .'->import';

        # check if required subclasses are implemented
        foreach my $subclass (qw/open_db fetch_rec init/) {
                my $n = $self->{module} . '::' . $subclass;
                if (! defined &{ $n }) {
                        my $missing = "missing $subclass in $self->{module}";
                        $self->{$subclass} = sub { $log->logwarn($missing) };
                } else {
                        $self->{$subclass} = \&{ $n };
                }
        }

        if ($self->{init}) {
                $log->debug("calling init");
                $self->{init}->($self, @_);
        }

        $self->{'code_page'} ||= 'ISO-8859-2';

        # running with low_mem flag? well, use DBM::Deep then.
        if ($self->{'low_mem'}) {
                $log->info("running with low_mem which impacts performance (<32 Mb memory usage)");

                my $db_file = "data.db";

                if (-e $db_file) {
                        unlink $db_file or $log->logdie("can't remove '$db_file' from last run");
                        $log->debug("removed '$db_file' from last run");
                }

                require DBM::Deep;

                my $db = new DBM::Deep $db_file;

                $log->logdie("DBM::Deep error: $!") unless ($db);

                if ($db->error()) {
                        $log->logdie("can't open '$db_file' under low_mem: ",$db->error());
                } else {
                        $log->debug("using file '$db_file' for DBM::Deep");
                }

                $self->{'db'} = $db;
        }

        $self ? return $self : return undef;
}

=head2 open

This function will read whole database in memory and produce lookups.

 $input->open(
        path => '/path/to/database/file',
        code_page => '852',
        limit => 500,
        offset => 6000,
        lookup => $lookup_obj,
 );

By default, C<code_page> is assumed to be C<852>.

C<offset> is optional parametar to position at some offset before reading from database.

C<limit> is optional parametar to read just C<limit> records from database

Returns size of database, regardless of C<offset> and C<limit>
parametars, see also C<size>.

=cut

sub open {
        my $self = shift;
        my $arg = {@_};

        my $log = $self->_get_logger();

        $log->logcroak("need path") if (! $arg->{'path'});
        my $code_page = $arg->{'code_page'} || '852';

        # store data in object
        $self->{'input_code_page'} = $code_page;
        foreach my $v (qw/path offset limit/) {
                $self->{$v} = $arg->{$v} if ($arg->{$v});
        }

        # create Text::Iconv object
        $self->{iconv} = Text::Iconv->new($code_page,$self->{'code_page'});

        my ($db, $size) = $self->{open_db}->( $self, 
                path => $arg->{path},
        );

        unless ($db) {
                $log->logwarn("can't open database $arg->{path}, skipping...");
                return;
        }

        unless ($size) {
                $log->logwarn("no records in database $arg->{path}, skipping...");
                return;
        }

        my $offset = 1;
        my $limit = $size;

        if (my $s = $self->{offset}) {
                $log->info("skipping to MFN $s");
                $offset = $s;
        } else {
                $self->{offset} = $offset;
        }

        if ($self->{limit}) {
                $log->debug("limiting to ",$self->{limit}," records");
                $limit = $offset + $self->{limit} - 1;
                $limit = $size if ($limit > $size);
        }

        # store size for later
        $self->{size} = ($limit - $offset) ? ($limit - $offset + 1) : 0;

        $log->info("processing $self->{size} records in $code_page, convert to $self->{code_page}");

        # read database
        for (my $pos = $offset; $pos <= $limit; $pos++) {

                $log->debug("position: $pos\n");

                my $rec = $self->{fetch_rec}->($self, $db, $pos );

                if (! $rec) {
                        $log->warn("record $pos empty? skipping...");
                        next;
                }

                # store
                if ($self->{low_mem}) {
                        $self->{db}->put($pos, $rec);
                } else {
                        $self->{data}->{$pos} = $rec;
                }

                # create lookup
                $self->{'lookup'}->add( $rec ) if ($rec && $self->{'lookup'});

                $self->progress_bar($pos,$limit);

        }

        $self->{pos} = -1;
        $self->{last_pcnt} = 0;

        # store max mfn and return it.
        $self->{max_pos} = $limit;
        $log->debug("max_pos: $limit");

        return $size;
}

=head2 fetch

Fetch next record from database. It will also displays progress bar.

 my $rec = $isis->fetch;

Record from this function should probably go to C<data_structure> for
normalisation.

=cut

sub fetch {
        my $self = shift;

        my $log = $self->_get_logger();

        $log->logconfess("it seems that you didn't load database!") unless ($self->{pos});

        if ($self->{pos} == -1) {
                $self->{pos} = $self->{offset};
        } else {
                $self->{pos}++;
        }

        my $mfn = $self->{pos};

        if ($mfn > $self->{max_pos}) {
                $self->{pos} = $self->{max_pos};
                $log->debug("at EOF");
                return;
        }

        $self->progress_bar($mfn,$self->{max_pos});

        my $rec;

        if ($self->{low_mem}) {
                $rec = $self->{db}->get($mfn);
        } else {
                $rec = $self->{data}->{$mfn};
        }

        $rec ||= 0E0;
}

=head2 pos

Returns current record number (MFN).

 print $isis->pos;

First record in database has position 1.

=cut

sub pos {
        my $self = shift;
        return $self->{pos};
}


=head2 size

Returns number of records in database

 print $isis->size;

Result from this function can be used to loop through all records

 foreach my $mfn ( 1 ... $isis->size ) { ... }

because it takes into account C<offset> and C<limit>.

=cut

sub size {
        my $self = shift;
        return $self->{size};
}

=head2 seek

Seek to specified MFN in file.

 $isis->seek(42);

First record in database has position 1.

=cut

sub seek {
        my $self = shift;
        my $pos = shift || return;

        my $log = $self->_get_logger();

        if ($pos < 1) {
                $log->warn("seek before first record");
                $pos = 1;
        } elsif ($pos > $self->{max_pos}) {
                $log->warn("seek beyond last record");
                $pos = $self->{max_pos};
        }

        return $self->{pos} = (($pos - 1) || -1);
}


=head1 MEMORY USAGE

C<low_mem> options is double-edged sword. If enabled, WebPAC
will run on memory constraint machines (which doesn't have enough
physical RAM to create memory structure for whole source database).

If your machine has 512Mb or more of RAM and database is around 10000 records,
memory shouldn't be an issue. If you don't have enough physical RAM, you
might consider using virtual memory (if your operating system is handling it
well, like on FreeBSD or Linux) instead of dropping to L<DBM::Deep> to handle
parsed structure of ISIS database (this is what C<low_mem> option does).

Hitting swap at end of reading source database is probably o.k. However,
hitting swap before 90% will dramatically decrease performance and you will
be better off with C<low_mem> and using rest of availble memory for
operating system disk cache (Linux is particuallary good about this).
However, every access to database record will require disk access, so
generation phase will be slower 10-100 times.

Parsed structures are essential - you just have option to trade RAM memory
(which is fast) for disk space (which is slow). Be sure to have planty of
disk space if you are using C<low_mem> and thus L<DBM::Deep>.

However, when WebPAC is running on desktop machines (or laptops :-), it's
highly undesireable for system to start swapping. Using C<low_mem> option can
reduce WecPAC memory usage to around 64Mb for same database with lookup
fields and sorted indexes which stay in RAM. Performance will suffer, but
memory usage will really be minimal. It might be also more confortable to
run WebPAC reniced on those machines.


=head1 AUTHOR

Dobrica Pavlinusic, C<< <dpavlin@rot13.org> >>

=head1 COPYRIGHT & LICENSE

Copyright 2005 Dobrica Pavlinusic, All Rights Reserved.

This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

=cut

1; # End of WebPAC::Input
1	package WebPAC::Input;
2
3	use warnings;
4	use strict;
5
6	use WebPAC::Common;
7	use base qw/WebPAC::Common/;
8	use Text::Iconv;
9
10	=head1 NAME
11
12	WebPAC::Input - read different file formats into WebPAC
13
14	=head1 VERSION
15
16	Version 0.03
17
18	=cut
19
20	our $VERSION = '0.03';
21
22	=head1 SYNOPSIS
23
24	This module implements input as database which have fixed and known
25	I<size> while indexing and single unique numeric identifier for database
26	position ranging from 1 to I<size>.
27
28	Simply, something that is indexed by unmber from 1 .. I<size>.
29
30	Examples of such databases are CDS/ISIS files, MARC files, lines in
31	text file, and so on.
32
33	Specific file formats are implemented using low-level interface modules,
34	located in C<WebPAC::Input::*> namespace which export C<open_db>,
35	C<fetch_rec> and optional C<init> functions.
36
37	Perhaps a little code snippet.
38
39	use WebPAC::Input;
40
41	my $db = WebPAC::Input->new(
42	module => 'WebPAC::Input::ISIS',
43	config => $config,
44	lookup => $lookup_obj,
45	low_mem => 1,
46	);
47
48	$db->open('/path/to/database');
49	print "database size: ",$db->size,"\n";
50	while (my $rec = $db->fetch) {
51	}
52
53
54
55	=head1 FUNCTIONS
56
57	=head2 new
58
59	Create new input database object.
60
61	my $db = new WebPAC::Input(
62	module => 'WebPAC::Input::MARC',
63	code_page => 'ISO-8859-2',
64	low_mem => 1,
65	);
66
67	C<module> is low-level file format module. See L<WebPAC::Input::Isis> and
68	L<WebPAC::Input::MARC>.
69
70	Optional parametar C<code_page> specify application code page (which will be
71	used internally). This should probably be your terminal encoding, and by
72	default, it C<ISO-8859-2>.
73
74	Default is not to use C<low_mem> options (see L<MEMORY USAGE> below).
75
76	This function will also call low-level C<init> if it exists with same
77	parametars.
78
79	=cut
80
81	sub new {
82	my $class = shift;
83	my $self = {@_};
84	bless($self, $class);
85
86	my $log = $self->_get_logger;
87
88	$log->logconfess("specify low-level file format module") unless ($self->{module});
89	my $module = $self->{module};
90	$module =~ s#::#/#g;
91	$module .= '.pm';
92	$log->debug("require low-level module $self->{module} from $module");
93
94	require $module;
95	#eval $self->{module} .'->import';
96
97	# check if required subclasses are implemented
98	foreach my $subclass (qw/open_db fetch_rec init/) {
99	my $n = $self->{module} . '::' . $subclass;
100	if (! defined &{ $n }) {
101	my $missing = "missing $subclass in $self->{module}";
102	$self->{$subclass} = sub { $log->logwarn($missing) };
103	} else {
104	$self->{$subclass} = \&{ $n };
105	}
106	}
107
108	if ($self->{init}) {
109	$log->debug("calling init");
110	$self->{init}->($self, @_);
111	}
112
113	$self->{'code_page'} \|\|= 'ISO-8859-2';
114
115	# running with low_mem flag? well, use DBM::Deep then.
116	if ($self->{'low_mem'}) {
117	$log->info("running with low_mem which impacts performance (<32 Mb memory usage)");
118
119	my $db_file = "data.db";
120
121	if (-e $db_file) {
122	unlink $db_file or $log->logdie("can't remove '$db_file' from last run");
123	$log->debug("removed '$db_file' from last run");
124	}
125
126	require DBM::Deep;
127
128	my $db = new DBM::Deep $db_file;
129
130	$log->logdie("DBM::Deep error: $!") unless ($db);
131
132	if ($db->error()) {
133	$log->logdie("can't open '$db_file' under low_mem: ",$db->error());
134	} else {
135	$log->debug("using file '$db_file' for DBM::Deep");
136	}
137
138	$self->{'db'} = $db;
139	}
140
141	$self ? return $self : return undef;
142	}
143
144	=head2 open
145
146	This function will read whole database in memory and produce lookups.
147
148	$input->open(
149	path => '/path/to/database/file',
150	code_page => '852',
151	limit => 500,
152	offset => 6000,
153	lookup => $lookup_obj,
154	);
155
156	By default, C<code_page> is assumed to be C<852>.
157
158	C<offset> is optional parametar to position at some offset before reading from database.
159
160	C<limit> is optional parametar to read just C<limit> records from database
161
162	Returns size of database, regardless of C<offset> and C<limit>
163	parametars, see also C<size>.
164
165	=cut
166
167	sub open {
168	my $self = shift;
169	my $arg = {@_};
170
171	my $log = $self->_get_logger();
172
173	$log->logcroak("need path") if (! $arg->{'path'});
174	my $code_page = $arg->{'code_page'} \|\| '852';
175
176	# store data in object
177	$self->{'input_code_page'} = $code_page;
178	foreach my $v (qw/path offset limit/) {
179	$self->{$v} = $arg->{$v} if ($arg->{$v});
180	}
181
182	# create Text::Iconv object
183	$self->{iconv} = Text::Iconv->new($code_page,$self->{'code_page'});
184
185	my ($db, $size) = $self->{open_db}->( $self,
186	path => $arg->{path},
187	);
188
189	unless ($db) {
190	$log->logwarn("can't open database $arg->{path}, skipping...");
191	return;
192	}
193
194	unless ($size) {
195	$log->logwarn("no records in database $arg->{path}, skipping...");
196	return;
197	}
198
199	my $offset = 1;
200	my $limit = $size;
201
202	if (my $s = $self->{offset}) {
203	$log->info("skipping to MFN $s");
204	$offset = $s;
205	} else {
206	$self->{offset} = $offset;
207	}
208
209	if ($self->{limit}) {
210	$log->debug("limiting to ",$self->{limit}," records");
211	$limit = $offset + $self->{limit} - 1;
212	$limit = $size if ($limit > $size);
213	}
214
215	# store size for later
216	$self->{size} = ($limit - $offset) ? ($limit - $offset + 1) : 0;
217
218	$log->info("processing $self->{size} records in $code_page, convert to $self->{code_page}");
219
220	# read database
221	for (my $pos = $offset; $pos <= $limit; $pos++) {
222
223	$log->debug("position: $pos\n");
224
225	my $rec = $self->{fetch_rec}->($self, $db, $pos );
226
227	if (! $rec) {
228	$log->warn("record $pos empty? skipping...");
229	next;
230	}
231
232	# store
233	if ($self->{low_mem}) {
234	$self->{db}->put($pos, $rec);
235	} else {
236	$self->{data}->{$pos} = $rec;
237	}
238
239	# create lookup
240	$self->{'lookup'}->add( $rec ) if ($rec && $self->{'lookup'});
241
242	$self->progress_bar($pos,$limit);
243
244	}
245
246	$self->{pos} = -1;
247	$self->{last_pcnt} = 0;
248
249	# store max mfn and return it.
250	$self->{max_pos} = $limit;
251	$log->debug("max_pos: $limit");
252
253	return $size;
254	}
255
256	=head2 fetch
257
258	Fetch next record from database. It will also displays progress bar.
259
260	my $rec = $isis->fetch;
261
262	Record from this function should probably go to C<data_structure> for
263	normalisation.
264
265	=cut
266
267	sub fetch {
268	my $self = shift;
269
270	my $log = $self->_get_logger();
271
272	$log->logconfess("it seems that you didn't load database!") unless ($self->{pos});
273
274	if ($self->{pos} == -1) {
275	$self->{pos} = $self->{offset};
276	} else {
277	$self->{pos}++;
278	}
279
280	my $mfn = $self->{pos};
281
282	if ($mfn > $self->{max_pos}) {
283	$self->{pos} = $self->{max_pos};
284	$log->debug("at EOF");
285	return;
286	}
287
288	$self->progress_bar($mfn,$self->{max_pos});
289
290	my $rec;
291
292	if ($self->{low_mem}) {
293	$rec = $self->{db}->get($mfn);
294	} else {
295	$rec = $self->{data}->{$mfn};
296	}
297
298	$rec \|\|= 0E0;
299	}
300
301	=head2 pos
302
303	Returns current record number (MFN).
304
305	print $isis->pos;
306
307	First record in database has position 1.
308
309	=cut
310
311	sub pos {
312	my $self = shift;
313	return $self->{pos};
314	}
315
316
317	=head2 size
318
319	Returns number of records in database
320
321	print $isis->size;
322
323	Result from this function can be used to loop through all records
324
325	foreach my $mfn ( 1 ... $isis->size ) { ... }
326
327	because it takes into account C<offset> and C<limit>.
328
329	=cut
330
331	sub size {
332	my $self = shift;
333	return $self->{size};
334	}
335
336	=head2 seek
337
338	Seek to specified MFN in file.
339
340	$isis->seek(42);
341
342	First record in database has position 1.
343
344	=cut
345
346	sub seek {
347	my $self = shift;
348	my $pos = shift \|\| return;
349
350	my $log = $self->_get_logger();
351
352	if ($pos < 1) {
353	$log->warn("seek before first record");
354	$pos = 1;
355	} elsif ($pos > $self->{max_pos}) {
356	$log->warn("seek beyond last record");
357	$pos = $self->{max_pos};
358	}
359
360	return $self->{pos} = (($pos - 1) \|\| -1);
361	}
362
363
364	=head1 MEMORY USAGE
365
366	C<low_mem> options is double-edged sword. If enabled, WebPAC
367	will run on memory constraint machines (which doesn't have enough
368	physical RAM to create memory structure for whole source database).
369
370	If your machine has 512Mb or more of RAM and database is around 10000 records,
371	memory shouldn't be an issue. If you don't have enough physical RAM, you
372	might consider using virtual memory (if your operating system is handling it
373	well, like on FreeBSD or Linux) instead of dropping to L<DBM::Deep> to handle
374	parsed structure of ISIS database (this is what C<low_mem> option does).
375
376	Hitting swap at end of reading source database is probably o.k. However,
377	hitting swap before 90% will dramatically decrease performance and you will
378	be better off with C<low_mem> and using rest of availble memory for
379	operating system disk cache (Linux is particuallary good about this).
380	However, every access to database record will require disk access, so
381	generation phase will be slower 10-100 times.
382
383	Parsed structures are essential - you just have option to trade RAM memory
384	(which is fast) for disk space (which is slow). Be sure to have planty of
385	disk space if you are using C<low_mem> and thus L<DBM::Deep>.
386
387	However, when WebPAC is running on desktop machines (or laptops :-), it's
388	highly undesireable for system to start swapping. Using C<low_mem> option can
389	reduce WecPAC memory usage to around 64Mb for same database with lookup
390	fields and sorted indexes which stay in RAM. Performance will suffer, but
391	memory usage will really be minimal. It might be also more confortable to
392	run WebPAC reniced on those machines.
393
394
395	=head1 AUTHOR
396
397	Dobrica Pavlinusic, C<< <dpavlin@rot13.org> >>
398
399	=head1 COPYRIGHT & LICENSE
400
401	Copyright 2005 Dobrica Pavlinusic, All Rights Reserved.
402
403	This program is free software; you can redistribute it and/or modify it
404	under the same terms as Perl itself.
405
406	=cut
407
408	1; # End of WebPAC::Input