lib/WebPAC/Input.pm

package WebPAC::Input;

use warnings;
use strict;

use blib;

use WebPAC::Common;
use base qw/WebPAC::Common/;
use Data::Dumper;
use Encode qw/from_to/;

=head1 NAME

WebPAC::Input - read different file formats into WebPAC

=head1 VERSION

Version 0.11

=cut

our $VERSION = '0.11';

=head1 SYNOPSIS

This module implements input as database which have fixed and known
I<size> while indexing and single unique numeric identifier for database
position ranging from 1 to I<size>.

Simply, something that is indexed by unmber from 1 .. I<size>.

Examples of such databases are CDS/ISIS files, MARC files, lines in
text file, and so on.

Specific file formats are implemented using low-level interface modules,
located in C<WebPAC::Input::*> namespace which export C<open_db>,
C<fetch_rec> and optional C<init> functions.

Perhaps a little code snippet.

        use WebPAC::Input;

        my $db = WebPAC::Input->new(
                module => 'WebPAC::Input::ISIS',
                low_mem => 1,
        );

        $db->open( path => '/path/to/database' );
        print "database size: ",$db->size,"\n";
        while (my $rec = $db->fetch) {
                # do something with $rec
        }


=head1 FUNCTIONS

=head2 new

Create new input database object.

  my $db = new WebPAC::Input(
        module => 'WebPAC::Input::MARC',
        encoding => 'ISO-8859-2',
        low_mem => 1,
        recode => 'char pairs',
        no_progress_bar => 1,
  );

C<module> is low-level file format module. See L<WebPAC::Input::ISIS> and
L<WebPAC::Input::MARC>.

Optional parametar C<encoding> specify application code page (which will be
used internally). This should probably be your terminal encoding, and by
default, it C<ISO-8859-2>.

Default is not to use C<low_mem> options (see L<MEMORY USAGE> below).

C<recode> is optional string constisting of character or words pairs that
should be replaced in input stream.

C<no_progress_bar> disables progress bar output on C<STDOUT>

This function will also call low-level C<init> if it exists with same
parametars.

=cut

sub new {
        my $class = shift;
        my $self = {@_};
        bless($self, $class);

        my $log = $self->_get_logger;

        $log->logconfess("code_page argument is not suppored any more. change it to encoding") if ($self->{lookup});
        $log->logconfess("lookup argument is not suppored any more. rewrite call to lookup_ref") if ($self->{lookup});

        $log->logconfess("specify low-level file format module") unless ($self->{module});
        my $module = $self->{module};
        $module =~ s#::#/#g;
        $module .= '.pm';
        $log->debug("require low-level module $self->{module} from $module");

        require $module;
        #eval $self->{module} .'->import';

        # check if required subclasses are implemented
        foreach my $subclass (qw/open_db fetch_rec init/) {
                my $n = $self->{module} . '::' . $subclass;
                if (! defined &{ $n }) {
                        my $missing = "missing $subclass in $self->{module}";
                        $self->{$subclass} = sub { $log->logwarn($missing) };
                } else {
                        $self->{$subclass} = \&{ $n };
                }
        }

        if ($self->{init}) {
                $log->debug("calling init");
                $self->{init}->($self, @_);
        }

        $self->{'encoding'} ||= 'ISO-8859-2';

        # running with low_mem flag? well, use DBM::Deep then.
        if ($self->{'low_mem'}) {
                $log->info("running with low_mem which impacts performance (<32 Mb memory usage)");

                my $db_file = "data.db";

                if (-e $db_file) {
                        unlink $db_file or $log->logdie("can't remove '$db_file' from last run");
                        $log->debug("removed '$db_file' from last run");
                }

                require DBM::Deep;

                my $db = new DBM::Deep $db_file;

                $log->logdie("DBM::Deep error: $!") unless ($db);

                if ($db->error()) {
                        $log->logdie("can't open '$db_file' under low_mem: ",$db->error());
                } else {
                        $log->debug("using file '$db_file' for DBM::Deep");
                }

                $self->{'db'} = $db;
        }

        $self ? return $self : return undef;
}

=head2 open

This function will read whole database in memory and produce lookups.

 $input->open(
        path => '/path/to/database/file',
        code_page => 'cp852',
        limit => 500,
        offset => 6000,
        lookup => $lookup_obj,
        stats => 1,
        lookup_ref => sub {
                my ($k,$v) = @_;
                # store lookup $k => $v
        },
        modify_records => {
                900 => { '^a' => { ' : ' => '^b' } },
                901 => { '*' => { '^b' => ' ; ' } },
        },
 );

By default, C<code_page> is assumed to be C<cp852>.

C<offset> is optional parametar to position at some offset before reading from database.

C<limit> is optional parametar to read just C<limit> records from database

C<stats> create optional report about usage of fields and subfields

C<lookup_coderef> is closure to call when adding C<< key => 'value' >> combinations to
lookup.

C<modify_records> specify mapping from subfields to delimiters or from
delimiters to subfields, as well as oprations on fields (if subfield is
defined as C<*>.

Returns size of database, regardless of C<offset> and C<limit>
parametars, see also C<size>.

=cut

sub open {
        my $self = shift;
        my $arg = {@_};

        my $log = $self->_get_logger();

        $log->logconfess("lookup argument is not suppored any more. rewrite call to lookup_coderef") if ($arg->{lookup});
        $log->logconfess("lookup_coderef must be CODE, not ",ref($arg->{lookup_coderef}))
                if ($arg->{lookup_coderef} && ref($arg->{lookup_coderef}) ne 'CODE');

        $log->logcroak("need path") if (! $arg->{'path'});
        my $code_page = $arg->{'code_page'} || 'cp852';

        # store data in object
        $self->{'input_code_page'} = $code_page;
        foreach my $v (qw/path offset limit/) {
                $self->{$v} = $arg->{$v} if ($arg->{$v});
        }

        my $filter_ref;
        my $recode_regex;
        my $recode_map;

        if ($self->{recode}) {
                my @r = split(/\s/, $self->{recode});
                if ($#r % 2 != 1) {
                        $log->logwarn("recode needs even number of elements (some number of valid pairs)");
                } else {
                        while (@r) {
                                my $from = shift @r;
                                my $to = shift @r;
                                $recode_map->{$from} = $to;
                        }

                        $recode_regex = join '|' => keys %{ $recode_map };

                        $log->debug("using recode regex: $recode_regex");
                }

        }

        my $rec_regex = $self->modify_record_regexps(%{ $arg->{modify_records} });
        $log->debug("rec_regex: ", Dumper($rec_regex));

        my ($db, $size) = $self->{open_db}->( $self, 
                path => $arg->{path},
#               filter => sub {
#                       my ($l,$f_nr) = @_;
#                       return unless defined($l);
#                       from_to($l, $code_page, $self->{'encoding'});
#                       $l =~ s/($recode_regex)/$recode_map->{$1}/g if ($recode_regex && $recode_map);
#                       return $l;
#               },
                %{ $arg },
        );

        unless (defined($db)) {
                $log->logwarn("can't open database $arg->{path}, skipping...");
                return;
        }

        unless ($size) {
                $log->logwarn("no records in database $arg->{path}, skipping...");
                return;
        }

        my $from_rec = 1;
        my $to_rec = $size;

        if (my $s = $self->{offset}) {
                $log->debug("skipping to MFN $s");
                $from_rec = $s;
        } else {
                $self->{offset} = $from_rec;
        }

        if ($self->{limit}) {
                $log->debug("limiting to ",$self->{limit}," records");
                $to_rec = $from_rec + $self->{limit} - 1;
                $to_rec = $size if ($to_rec > $size);
        }

        # store size for later
        $self->{size} = ($to_rec - $from_rec) ? ($to_rec - $from_rec + 1) : 0;

        $log->info("processing $self->{size}/$size records [$from_rec-$to_rec] convert $code_page -> $self->{encoding}", $self->{stats} ? ' [stats]' : '');

        # read database
        for (my $pos = $from_rec; $pos <= $to_rec; $pos++) {

                $log->debug("position: $pos\n");

                my $rec = $self->{fetch_rec}->($self, $db, $pos, sub {
                                my ($l,$f_nr) = @_;
#                               return unless defined($l);
#                               return $l unless ($rec_regex && $f_nr);

                                $log->debug("-=> $f_nr ## $l");

                                # codepage conversion and recode_regex
#                               from_to($l, $code_page, $self->{'encoding'});
                                from_to($l, $code_page, 'utf-8');
                                $l =~ s/($recode_regex)/$recode_map->{$1}/g if ($recode_regex && $recode_map);

                                # apply regexps
                                if ($rec_regex && defined($rec_regex->{$f_nr})) {
                                        $log->logconfess("regexps->{$f_nr} must be ARRAY") if (ref($rec_regex->{$f_nr}) ne 'ARRAY');
                                        my $c = 0;
                                        foreach my $r (@{ $rec_regex->{$f_nr} }) {
                                                #$log->debug("\$l = $l\neval \$l =~ $r");
                                                eval '$l =~ ' . $r;
                                                $log->error("error applying regex: $r") if ($@);
                                        }
                                }

                                $log->debug("<=- $f_nr ## $l");
                                return $l;
                });

                $log->debug(sub { Dumper($rec) });

                if (! $rec) {
                        $log->warn("record $pos empty? skipping...");
                        next;
                }

                # store
                if ($self->{low_mem}) {
                        $self->{db}->put($pos, $rec);
                } else {
                        $self->{data}->{$pos} = $rec;
                }

                # create lookup
                $arg->{'lookup_coderef'}->( $rec ) if ($rec && $arg->{'lookup_coderef'});

                # update counters for statistics
                if ($self->{stats}) {

                        # fetch clean record with regexpes applied for statistics
                        my $rec = $self->{fetch_rec}->($self, $db, $pos);

                        foreach my $fld (keys %{ $rec }) {
                                $self->{_stats}->{fld}->{ $fld }++;

                                $log->logdie("invalid record fild $fld, not ARRAY")
                                        unless (ref($rec->{ $fld }) eq 'ARRAY');
        
                                foreach my $row (@{ $rec->{$fld} }) {

                                        if (ref($row) eq 'HASH') {

                                                foreach my $sf (keys %{ $row }) {
                                                        next if ($sf eq 'subfields');
                                                        $self->{_stats}->{sf}->{ $fld }->{ $sf }->{count}++;
                                                        $self->{_stats}->{sf}->{ $fld }->{ $sf }->{repeatable}++
                                                                        if (ref($row->{$sf}) eq 'ARRAY');
                                                }

                                        } else {
                                                $self->{_stats}->{repeatable}->{ $fld }++;
                                        }
                                }
                        }
                }

                $self->progress_bar($pos,$to_rec) unless ($self->{no_progress_bar});

        }

        $self->{pos} = -1;
        $self->{last_pcnt} = 0;

        # store max mfn and return it.
        $self->{max_pos} = $to_rec;
        $log->debug("max_pos: $to_rec");

        return $size;
}

=head2 fetch

Fetch next record from database. It will also displays progress bar.

 my $rec = $isis->fetch;

Record from this function should probably go to C<data_structure> for
normalisation.

=cut

sub fetch {
        my $self = shift;

        my $log = $self->_get_logger();

        $log->logconfess("it seems that you didn't load database!") unless ($self->{pos});

        if ($self->{pos} == -1) {
                $self->{pos} = $self->{offset};
        } else {
                $self->{pos}++;
        }

        my $mfn = $self->{pos};

        if ($mfn > $self->{max_pos}) {
                $self->{pos} = $self->{max_pos};
                $log->debug("at EOF");
                return;
        }

        $self->progress_bar($mfn,$self->{max_pos}) unless ($self->{no_progress_bar});

        my $rec;

        if ($self->{low_mem}) {
                $rec = $self->{db}->get($mfn);
        } else {
                $rec = $self->{data}->{$mfn};
        }

        $rec ||= 0E0;
}

=head2 pos

Returns current record number (MFN).

 print $isis->pos;

First record in database has position 1.

=cut

sub pos {
        my $self = shift;
        return $self->{pos};
}


=head2 size

Returns number of records in database

 print $isis->size;

Result from this function can be used to loop through all records

 foreach my $mfn ( 1 ... $isis->size ) { ... }

because it takes into account C<offset> and C<limit>.

=cut

sub size {
        my $self = shift;
        return $self->{size};
}

=head2 seek

Seek to specified MFN in file.

 $isis->seek(42);

First record in database has position 1.

=cut

sub seek {
        my $self = shift;
        my $pos = shift || return;

        my $log = $self->_get_logger();

        if ($pos < 1) {
                $log->warn("seek before first record");
                $pos = 1;
        } elsif ($pos > $self->{max_pos}) {
                $log->warn("seek beyond last record");
                $pos = $self->{max_pos};
        }

        return $self->{pos} = (($pos - 1) || -1);
}

=head2 stats

Dump statistics about field and subfield usage

  print $input->stats;

=cut

sub stats {
        my $self = shift;

        my $log = $self->_get_logger();

        my $s = $self->{_stats};
        if (! $s) {
                $log->warn("called stats, but there is no statistics collected");
                return;
        }

        my $max_fld = 0;

        my $out = join("\n",
                map {
                        my $f = $_ || die "no field";
                        my $v = $s->{fld}->{$f} || die "no s->{fld}->{$f}";
                        $max_fld = $v if ($v > $max_fld);

                        my $o = sprintf("%4s %d ~", $f, $v);

                        if (defined($s->{sf}->{$f})) {
                                map {
                                        $o .= sprintf(" %s:%d%s", $_, 
                                                $s->{sf}->{$f}->{$_}->{count},
                                                $s->{sf}->{$f}->{$_}->{repeatable} ? '*' : '',
                                        );
                                } sort keys %{ $s->{sf}->{$f} };
                        }

                        if (my $v_r = $s->{repeatable}->{$f}) {
                                $o .= " ($v_r)" if ($v_r != $v);
                        }

                        $o;
                } sort { $a cmp $b } keys %{ $s->{fld} }
        );

        $log->debug( sub { Dumper($s) } );

        return $out;
}

=head2 modify_record_regexps

Generate hash with regexpes to be applied using L<filter>.

  my $regexpes = $input->modify_record_regexps(
                900 => { '^a' => { ' : ' => '^b' } },
                901 => { '*' => { '^b' => ' ; ' } },
  );

=cut

sub modify_record_regexps {
        my $self = shift;
        my $modify_record = {@_};

        my $regexpes;

        foreach my $f (keys %$modify_record) {
warn "--- f: $f\n";
                foreach my $sf (keys %{ $modify_record->{$f} }) {
warn "---- sf: $sf\n";
                        foreach my $from (keys %{ $modify_record->{$f}->{$sf} }) {
                                my $to = $modify_record->{$f}->{$sf}->{$from};
                                #die "no field?" unless defined($to);
warn "----- transform: |$from| -> |$to|\n";

                                if ($sf =~ /^\^/) {
                                        my $regex = 
                                                's/\Q'. $sf .'\E([^\^]+)\Q'. $from .'\E([^\^]+)/'. $sf .'$1'. $to .'$2/g';
                                        push @{ $regexpes->{$f} }, $regex;
warn ">>>>> $regex [sf]\n";
                                } else {
                                        my $regex =
                                                's/\Q'. $from .'\E/'. $to .'/g';
                                        push @{ $regexpes->{$f} }, $regex;
warn ">>>>> $regex [global]\n";
                                }

                        }
                }
        }

        return $regexpes;
}

=head1 MEMORY USAGE

C<low_mem> options is double-edged sword. If enabled, WebPAC
will run on memory constraint machines (which doesn't have enough
physical RAM to create memory structure for whole source database).

If your machine has 512Mb or more of RAM and database is around 10000 records,
memory shouldn't be an issue. If you don't have enough physical RAM, you
might consider using virtual memory (if your operating system is handling it
well, like on FreeBSD or Linux) instead of dropping to L<DBM::Deep> to handle
parsed structure of ISIS database (this is what C<low_mem> option does).

Hitting swap at end of reading source database is probably o.k. However,
hitting swap before 90% will dramatically decrease performance and you will
be better off with C<low_mem> and using rest of availble memory for
operating system disk cache (Linux is particuallary good about this).
However, every access to database record will require disk access, so
generation phase will be slower 10-100 times.

Parsed structures are essential - you just have option to trade RAM memory
(which is fast) for disk space (which is slow). Be sure to have planty of
disk space if you are using C<low_mem> and thus L<DBM::Deep>.

However, when WebPAC is running on desktop machines (or laptops :-), it's
highly undesireable for system to start swapping. Using C<low_mem> option can
reduce WecPAC memory usage to around 64Mb for same database with lookup
fields and sorted indexes which stay in RAM. Performance will suffer, but
memory usage will really be minimal. It might be also more confortable to
run WebPAC reniced on those machines.


=head1 AUTHOR

Dobrica Pavlinusic, C<< <dpavlin@rot13.org> >>

=head1 COPYRIGHT & LICENSE

Copyright 2005-2006 Dobrica Pavlinusic, All Rights Reserved.

This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

=cut

1; # End of WebPAC::Input
1	dpavlin	1	package WebPAC::Input;
2
3			use warnings;
4			use strict;
5
6	dpavlin	507	use blib;
7
8	dpavlin	487	use WebPAC::Common;
9	dpavlin	285	use base qw/WebPAC::Common/;
10	dpavlin	308	use Data::Dumper;
11	dpavlin	624	use Encode qw/from_to/;
12	dpavlin	285
13	dpavlin	1	=head1 NAME
14
15	dpavlin	286	WebPAC::Input - read different file formats into WebPAC
16	dpavlin	1
17			=head1 VERSION
18
19	dpavlin	619	Version 0.11
20	dpavlin	1
21			=cut
22
23	dpavlin	619	our $VERSION = '0.11';
24	dpavlin	1
25			=head1 SYNOPSIS
26
27	dpavlin	286	This module implements input as database which have fixed and known
28			I<size> while indexing and single unique numeric identifier for database
29			position ranging from 1 to I<size>.
30	dpavlin	1
31	dpavlin	286	Simply, something that is indexed by unmber from 1 .. I<size>.
32
33			Examples of such databases are CDS/ISIS files, MARC files, lines in
34			text file, and so on.
35
36			Specific file formats are implemented using low-level interface modules,
37			located in C<WebPAC::Input::*> namespace which export C<open_db>,
38			C<fetch_rec> and optional C<init> functions.
39
40	dpavlin	1	Perhaps a little code snippet.
41
42	dpavlin	597	use WebPAC::Input;
43	dpavlin	1
44	dpavlin	597	my $db = WebPAC::Input->new(
45			module => 'WebPAC::Input::ISIS',
46	dpavlin	286	low_mem => 1,
47	dpavlin	597	);
48	dpavlin	1
49	dpavlin	597	$db->open( path => '/path/to/database' );
50	dpavlin	416	print "database size: ",$db->size,"\n";
51			while (my $rec = $db->fetch) {
52			# do something with $rec
53			}
54	dpavlin	1
55	dpavlin	286
56
57	dpavlin	1	=head1 FUNCTIONS
58
59	dpavlin	3	=head2 new
60	dpavlin	1
61	dpavlin	3	Create new input database object.
62
63	dpavlin	9	my $db = new WebPAC::Input(
64	dpavlin	286	module => 'WebPAC::Input::MARC',
65	dpavlin	585	encoding => 'ISO-8859-2',
66	dpavlin	10	low_mem => 1,
67	dpavlin	416	recode => 'char pairs',
68	dpavlin	483	no_progress_bar => 1,
69	dpavlin	9	);
70	dpavlin	3
71	dpavlin	597	C<module> is low-level file format module. See L<WebPAC::Input::ISIS> and
72	dpavlin	286	L<WebPAC::Input::MARC>.
73
74	dpavlin	585	Optional parametar C<encoding> specify application code page (which will be
75	dpavlin	9	used internally). This should probably be your terminal encoding, and by
76			default, it C<ISO-8859-2>.
77
78	dpavlin	10	Default is not to use C<low_mem> options (see L<MEMORY USAGE> below).
79
80	dpavlin	483	C<recode> is optional string constisting of character or words pairs that
81			should be replaced in input stream.
82
83			C<no_progress_bar> disables progress bar output on C<STDOUT>
84
85	dpavlin	285	This function will also call low-level C<init> if it exists with same
86			parametars.
87
88	dpavlin	1	=cut
89
90	dpavlin	3	sub new {
91	dpavlin	285	my $class = shift;
92			my $self = {@_};
93	dpavlin	3	bless($self, $class);
94
95	dpavlin	285	my $log = $self->_get_logger;
96
97	dpavlin	585	$log->logconfess("code_page argument is not suppored any more. change it to encoding") if ($self->{lookup});
98			$log->logconfess("lookup argument is not suppored any more. rewrite call to lookup_ref") if ($self->{lookup});
99
100	dpavlin	286	$log->logconfess("specify low-level file format module") unless ($self->{module});
101			my $module = $self->{module};
102			$module =~ s#::#/#g;
103			$module .= '.pm';
104			$log->debug("require low-level module $self->{module} from $module");
105	dpavlin	289
106	dpavlin	286	require $module;
107	dpavlin	289	#eval $self->{module} .'->import';
108	dpavlin	286
109	dpavlin	285	# check if required subclasses are implemented
110	dpavlin	289	foreach my $subclass (qw/open_db fetch_rec init/) {
111			my $n = $self->{module} . '::' . $subclass;
112			if (! defined &{ $n }) {
113	dpavlin	290	my $missing = "missing $subclass in $self->{module}";
114	dpavlin	301	$self->{$subclass} = sub { $log->logwarn($missing) };
115	dpavlin	286	} else {
116	dpavlin	289	$self->{$subclass} = \&{ $n };
117	dpavlin	286	}
118	dpavlin	285	}
119
120	dpavlin	289	if ($self->{init}) {
121	dpavlin	285	$log->debug("calling init");
122	dpavlin	289	$self->{init}->($self, @_);
123	dpavlin	285	}
124
125	dpavlin	585	$self->{'encoding'} \|\|= 'ISO-8859-2';
126	dpavlin	9
127	dpavlin	10	# running with low_mem flag? well, use DBM::Deep then.
128			if ($self->{'low_mem'}) {
129			$log->info("running with low_mem which impacts performance (<32 Mb memory usage)");
130
131			my $db_file = "data.db";
132
133			if (-e $db_file) {
134			unlink $db_file or $log->logdie("can't remove '$db_file' from last run");
135			$log->debug("removed '$db_file' from last run");
136			}
137
138			require DBM::Deep;
139
140			my $db = new DBM::Deep $db_file;
141
142			$log->logdie("DBM::Deep error: $!") unless ($db);
143
144			if ($db->error()) {
145			$log->logdie("can't open '$db_file' under low_mem: ",$db->error());
146			} else {
147			$log->debug("using file '$db_file' for DBM::Deep");
148			}
149
150			$self->{'db'} = $db;
151			}
152
153	dpavlin	3	$self ? return $self : return undef;
154	dpavlin	1	}
155
156	dpavlin	285	=head2 open
157
158			This function will read whole database in memory and produce lookups.
159
160	dpavlin	286	$input->open(
161	dpavlin	285	path => '/path/to/database/file',
162	dpavlin	624	code_page => 'cp852',
163	dpavlin	286	limit => 500,
164			offset => 6000,
165	dpavlin	285	lookup => $lookup_obj,
166	dpavlin	506	stats => 1,
167	dpavlin	585	lookup_ref => sub {
168			my ($k,$v) = @_;
169			# store lookup $k => $v
170			},
171	dpavlin	597	modify_records => {
172			900 => { '^a' => { ' : ' => '^b' } },
173			901 => { '*' => { '^b' => ' ; ' } },
174			},
175	dpavlin	285	);
176
177	dpavlin	624	By default, C<code_page> is assumed to be C<cp852>.
178	dpavlin	285
179	dpavlin	286	C<offset> is optional parametar to position at some offset before reading from database.
180	dpavlin	285
181	dpavlin	286	C<limit> is optional parametar to read just C<limit> records from database
182	dpavlin	285
183	dpavlin	506	C<stats> create optional report about usage of fields and subfields
184
185	dpavlin	597	C<lookup_coderef> is closure to call when adding C<< key => 'value' >> combinations to
186	dpavlin	585	lookup.
187
188	dpavlin	597	C<modify_records> specify mapping from subfields to delimiters or from
189			delimiters to subfields, as well as oprations on fields (if subfield is
190			defined as C<*>.
191
192	dpavlin	286	Returns size of database, regardless of C<offset> and C<limit>
193			parametars, see also C<size>.
194	dpavlin	285
195			=cut
196
197			sub open {
198			my $self = shift;
199			my $arg = {@_};
200
201			my $log = $self->_get_logger();
202
203	dpavlin	585	$log->logconfess("lookup argument is not suppored any more. rewrite call to lookup_coderef") if ($arg->{lookup});
204			$log->logconfess("lookup_coderef must be CODE, not ",ref($arg->{lookup_coderef}))
205			if ($arg->{lookup_coderef} && ref($arg->{lookup_coderef}) ne 'CODE');
206
207	dpavlin	285	$log->logcroak("need path") if (! $arg->{'path'});
208	dpavlin	624	my $code_page = $arg->{'code_page'} \|\| 'cp852';
209	dpavlin	285
210			# store data in object
211	dpavlin	292	$self->{'input_code_page'} = $code_page;
212	dpavlin	286	foreach my $v (qw/path offset limit/) {
213	dpavlin	285	$self->{$v} = $arg->{$v} if ($arg->{$v});
214			}
215
216	dpavlin	416	my $filter_ref;
217	dpavlin	597	my $recode_regex;
218			my $recode_map;
219	dpavlin	416
220			if ($self->{recode}) {
221			my @r = split(/\s/, $self->{recode});
222			if ($#r % 2 != 1) {
223			$log->logwarn("recode needs even number of elements (some number of valid pairs)");
224			} else {
225			while (@r) {
226			my $from = shift @r;
227			my $to = shift @r;
228	dpavlin	597	$recode_map->{$from} = $to;
229	dpavlin	416	}
230
231	dpavlin	597	$recode_regex = join '\|' => keys %{ $recode_map };
232	dpavlin	416
233	dpavlin	597	$log->debug("using recode regex: $recode_regex");
234	dpavlin	416	}
235
236			}
237
238	dpavlin	597	my $rec_regex = $self->modify_record_regexps(%{ $arg->{modify_records} });
239			$log->debug("rec_regex: ", Dumper($rec_regex));
240
241	dpavlin	289	my ($db, $size) = $self->{open_db}->( $self,
242	dpavlin	285	path => $arg->{path},
243	dpavlin	624	# filter => sub {
244			# my ($l,$f_nr) = @_;
245			# return unless defined($l);
246			# from_to($l, $code_page, $self->{'encoding'});
247			# $l =~ s/($recode_regex)/$recode_map->{$1}/g if ($recode_regex && $recode_map);
248			# return $l;
249			# },
250	dpavlin	523	%{ $arg },
251	dpavlin	285	);
252
253	dpavlin	496	unless (defined($db)) {
254	dpavlin	285	$log->logwarn("can't open database $arg->{path}, skipping...");
255			return;
256			}
257
258			unless ($size) {
259			$log->logwarn("no records in database $arg->{path}, skipping...");
260			return;
261			}
262
263	dpavlin	339	my $from_rec = 1;
264			my $to_rec = $size;
265	dpavlin	285
266	dpavlin	286	if (my $s = $self->{offset}) {
267	dpavlin	513	$log->debug("skipping to MFN $s");
268	dpavlin	339	$from_rec = $s;
269	dpavlin	285	} else {
270	dpavlin	339	$self->{offset} = $from_rec;
271	dpavlin	285	}
272
273	dpavlin	286	if ($self->{limit}) {
274	dpavlin	301	$log->debug("limiting to ",$self->{limit}," records");
275	dpavlin	339	$to_rec = $from_rec + $self->{limit} - 1;
276			$to_rec = $size if ($to_rec > $size);
277	dpavlin	285	}
278
279			# store size for later
280	dpavlin	339	$self->{size} = ($to_rec - $from_rec) ? ($to_rec - $from_rec + 1) : 0;
281	dpavlin	285
282	dpavlin	585	$log->info("processing $self->{size}/$size records [$from_rec-$to_rec] convert $code_page -> $self->{encoding}", $self->{stats} ? ' [stats]' : '');
283	dpavlin	285
284			# read database
285	dpavlin	339	for (my $pos = $from_rec; $pos <= $to_rec; $pos++) {
286	dpavlin	285
287	dpavlin	286	$log->debug("position: $pos\n");
288	dpavlin	285
289	dpavlin	619	my $rec = $self->{fetch_rec}->($self, $db, $pos, sub {
290			my ($l,$f_nr) = @_;
291	dpavlin	624	# return unless defined($l);
292			# return $l unless ($rec_regex && $f_nr);
293	dpavlin	285
294	dpavlin	625	$log->debug("-=> $f_nr ## $l");
295
296	dpavlin	624	# codepage conversion and recode_regex
297			# from_to($l, $code_page, $self->{'encoding'});
298			from_to($l, $code_page, 'utf-8');
299			$l =~ s/($recode_regex)/$recode_map->{$1}/g if ($recode_regex && $recode_map);
300
301	dpavlin	619	# apply regexps
302			if ($rec_regex && defined($rec_regex->{$f_nr})) {
303			$log->logconfess("regexps->{$f_nr} must be ARRAY") if (ref($rec_regex->{$f_nr}) ne 'ARRAY');
304			my $c = 0;
305			foreach my $r (@{ $rec_regex->{$f_nr} }) {
306			#$log->debug("\$l = $l\neval \$l =~ $r");
307			eval '$l =~ ' . $r;
308			$log->error("error applying regex: $r") if ($@);
309			}
310			}
311
312	dpavlin	625	$log->debug("<=- $f_nr ## $l");
313	dpavlin	619	return $l;
314			});
315
316	dpavlin	308	$log->debug(sub { Dumper($rec) });
317
318	dpavlin	285	if (! $rec) {
319	dpavlin	286	$log->warn("record $pos empty? skipping...");
320	dpavlin	285	next;
321			}
322
323			# store
324	dpavlin	286	if ($self->{low_mem}) {
325			$self->{db}->put($pos, $rec);
326	dpavlin	285	} else {
327	dpavlin	286	$self->{data}->{$pos} = $rec;
328	dpavlin	285	}
329
330			# create lookup
331	dpavlin	585	$arg->{'lookup_coderef'}->( $rec ) if ($rec && $arg->{'lookup_coderef'});
332	dpavlin	285
333	dpavlin	506	# update counters for statistics
334			if ($self->{stats}) {
335	dpavlin	593
336	dpavlin	619	# fetch clean record with regexpes applied for statistics
337			my $rec = $self->{fetch_rec}->($self, $db, $pos);
338
339	dpavlin	593	foreach my $fld (keys %{ $rec }) {
340	dpavlin	506	$self->{_stats}->{fld}->{ $fld }++;
341	dpavlin	593
342			$log->logdie("invalid record fild $fld, not ARRAY")
343			unless (ref($rec->{ $fld }) eq 'ARRAY');
344
345			foreach my $row (@{ $rec->{$fld} }) {
346
347			if (ref($row) eq 'HASH') {
348
349			foreach my $sf (keys %{ $row }) {
350	dpavlin	619	next if ($sf eq 'subfields');
351	dpavlin	593	$self->{_stats}->{sf}->{ $fld }->{ $sf }->{count}++;
352			$self->{_stats}->{sf}->{ $fld }->{ $sf }->{repeatable}++
353			if (ref($row->{$sf}) eq 'ARRAY');
354	dpavlin	506	}
355	dpavlin	593
356			} else {
357			$self->{_stats}->{repeatable}->{ $fld }++;
358			}
359	dpavlin	506	}
360	dpavlin	593	}
361	dpavlin	506	}
362
363	dpavlin	483	$self->progress_bar($pos,$to_rec) unless ($self->{no_progress_bar});
364	dpavlin	285
365			}
366
367	dpavlin	286	$self->{pos} = -1;
368			$self->{last_pcnt} = 0;
369	dpavlin	285
370			# store max mfn and return it.
371	dpavlin	339	$self->{max_pos} = $to_rec;
372			$log->debug("max_pos: $to_rec");
373	dpavlin	285
374			return $size;
375			}
376
377			=head2 fetch
378
379			Fetch next record from database. It will also displays progress bar.
380
381			my $rec = $isis->fetch;
382
383			Record from this function should probably go to C<data_structure> for
384			normalisation.
385
386			=cut
387
388			sub fetch {
389			my $self = shift;
390
391			my $log = $self->_get_logger();
392
393	dpavlin	286	$log->logconfess("it seems that you didn't load database!") unless ($self->{pos});
394	dpavlin	285
395	dpavlin	286	if ($self->{pos} == -1) {
396			$self->{pos} = $self->{offset};
397	dpavlin	285	} else {
398	dpavlin	286	$self->{pos}++;
399	dpavlin	285	}
400
401	dpavlin	286	my $mfn = $self->{pos};
402	dpavlin	285
403	dpavlin	286	if ($mfn > $self->{max_pos}) {
404			$self->{pos} = $self->{max_pos};
405	dpavlin	285	$log->debug("at EOF");
406			return;
407			}
408
409	dpavlin	483	$self->progress_bar($mfn,$self->{max_pos}) unless ($self->{no_progress_bar});
410	dpavlin	285
411			my $rec;
412
413	dpavlin	286	if ($self->{low_mem}) {
414			$rec = $self->{db}->get($mfn);
415	dpavlin	285	} else {
416	dpavlin	286	$rec = $self->{data}->{$mfn};
417	dpavlin	285	}
418
419			$rec \|\|= 0E0;
420			}
421
422			=head2 pos
423
424			Returns current record number (MFN).
425
426			print $isis->pos;
427
428			First record in database has position 1.
429
430			=cut
431
432			sub pos {
433			my $self = shift;
434	dpavlin	286	return $self->{pos};
435	dpavlin	285	}
436
437
438			=head2 size
439
440			Returns number of records in database
441
442			print $isis->size;
443
444			Result from this function can be used to loop through all records
445
446			foreach my $mfn ( 1 ... $isis->size ) { ... }
447
448	dpavlin	286	because it takes into account C<offset> and C<limit>.
449	dpavlin	285
450			=cut
451
452			sub size {
453			my $self = shift;
454	dpavlin	286	return $self->{size};
455	dpavlin	285	}
456
457			=head2 seek
458
459			Seek to specified MFN in file.
460
461			$isis->seek(42);
462
463			First record in database has position 1.
464
465			=cut
466
467			sub seek {
468			my $self = shift;
469			my $pos = shift \|\| return;
470
471			my $log = $self->_get_logger();
472
473			if ($pos < 1) {
474			$log->warn("seek before first record");
475			$pos = 1;
476	dpavlin	286	} elsif ($pos > $self->{max_pos}) {
477	dpavlin	285	$log->warn("seek beyond last record");
478	dpavlin	286	$pos = $self->{max_pos};
479	dpavlin	285	}
480
481	dpavlin	286	return $self->{pos} = (($pos - 1) \|\| -1);
482	dpavlin	285	}
483
484	dpavlin	506	=head2 stats
485	dpavlin	285
486	dpavlin	506	Dump statistics about field and subfield usage
487
488	dpavlin	507	print $input->stats;
489	dpavlin	506
490			=cut
491
492			sub stats {
493			my $self = shift;
494	dpavlin	507
495			my $log = $self->_get_logger();
496
497			my $s = $self->{_stats};
498			if (! $s) {
499			$log->warn("called stats, but there is no statistics collected");
500			return;
501			}
502
503			my $max_fld = 0;
504
505			my $out = join("\n",
506			map {
507			my $f = $_ \|\| die "no field";
508			my $v = $s->{fld}->{$f} \|\| die "no s->{fld}->{$f}";
509			$max_fld = $v if ($v > $max_fld);
510
511	dpavlin	519	my $o = sprintf("%4s %d ~", $f, $v);
512	dpavlin	507
513			if (defined($s->{sf}->{$f})) {
514			map {
515	dpavlin	593	$o .= sprintf(" %s:%d%s", $_,
516			$s->{sf}->{$f}->{$_}->{count},
517			$s->{sf}->{$f}->{$_}->{repeatable} ? '*' : '',
518			);
519	dpavlin	507	} sort keys %{ $s->{sf}->{$f} };
520			}
521
522			if (my $v_r = $s->{repeatable}->{$f}) {
523			$o .= " ($v_r)" if ($v_r != $v);
524			}
525
526			$o;
527	dpavlin	519	} sort { $a cmp $b } keys %{ $s->{fld} }
528	dpavlin	507	);
529
530			$log->debug( sub { Dumper($s) } );
531
532			return $out;
533	dpavlin	506	}
534
535	dpavlin	598	=head2 modify_record_regexps
536	dpavlin	597
537			Generate hash with regexpes to be applied using L<filter>.
538
539			my $regexpes = $input->modify_record_regexps(
540			900 => { '^a' => { ' : ' => '^b' } },
541			901 => { '*' => { '^b' => ' ; ' } },
542			);
543
544			=cut
545
546			sub modify_record_regexps {
547			my $self = shift;
548			my $modify_record = {@_};
549
550			my $regexpes;
551
552			foreach my $f (keys %$modify_record) {
553			warn "--- f: $f\n";
554			foreach my $sf (keys %{ $modify_record->{$f} }) {
555			warn "---- sf: $sf\n";
556			foreach my $from (keys %{ $modify_record->{$f}->{$sf} }) {
557			my $to = $modify_record->{$f}->{$sf}->{$from};
558			#die "no field?" unless defined($to);
559			warn "----- transform: \|$from\| -> \|$to\|\n";
560
561			if ($sf =~ /^\^/) {
562			my $regex =
563			's/\Q'. $sf .'\E([^\^]+)\Q'. $from .'\E([^\^]+)/'. $sf .'$1'. $to .'$2/g';
564			push @{ $regexpes->{$f} }, $regex;
565			warn ">>>>> $regex [sf]\n";
566			} else {
567			my $regex =
568			's/\Q'. $from .'\E/'. $to .'/g';
569			push @{ $regexpes->{$f} }, $regex;
570			warn ">>>>> $regex [global]\n";
571			}
572
573			}
574			}
575			}
576
577			return $regexpes;
578			}
579
580	dpavlin	3	=head1 MEMORY USAGE
581	dpavlin	1
582	dpavlin	3	C<low_mem> options is double-edged sword. If enabled, WebPAC
583			will run on memory constraint machines (which doesn't have enough
584			physical RAM to create memory structure for whole source database).
585	dpavlin	1
586	dpavlin	3	If your machine has 512Mb or more of RAM and database is around 10000 records,
587			memory shouldn't be an issue. If you don't have enough physical RAM, you
588			might consider using virtual memory (if your operating system is handling it
589			well, like on FreeBSD or Linux) instead of dropping to L<DBM::Deep> to handle
590			parsed structure of ISIS database (this is what C<low_mem> option does).
591	dpavlin	1
592	dpavlin	3	Hitting swap at end of reading source database is probably o.k. However,
593			hitting swap before 90% will dramatically decrease performance and you will
594			be better off with C<low_mem> and using rest of availble memory for
595			operating system disk cache (Linux is particuallary good about this).
596			However, every access to database record will require disk access, so
597			generation phase will be slower 10-100 times.
598
599			Parsed structures are essential - you just have option to trade RAM memory
600			(which is fast) for disk space (which is slow). Be sure to have planty of
601			disk space if you are using C<low_mem> and thus L<DBM::Deep>.
602
603			However, when WebPAC is running on desktop machines (or laptops :-), it's
604			highly undesireable for system to start swapping. Using C<low_mem> option can
605			reduce WecPAC memory usage to around 64Mb for same database with lookup
606			fields and sorted indexes which stay in RAM. Performance will suffer, but
607			memory usage will really be minimal. It might be also more confortable to
608			run WebPAC reniced on those machines.
609
610
611			=head1 AUTHOR
612
613			Dobrica Pavlinusic, C<< <dpavlin@rot13.org> >>
614
615	dpavlin	1	=head1 COPYRIGHT & LICENSE
616
617	dpavlin	599	Copyright 2005-2006 Dobrica Pavlinusic, All Rights Reserved.
618	dpavlin	1
619			This program is free software; you can redistribute it and/or modify it
620			under the same terms as Perl itself.
621
622			=cut
623
624			1; # End of WebPAC::Input