/[webpac2]/trunk/lib/WebPAC/Input.pm
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /trunk/lib/WebPAC/Input.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 760 - (hide annotations)
Wed Oct 25 15:56:44 2006 UTC (17 years, 6 months ago) by dpavlin
File size: 17026 byte(s)
Turn on option low_mem (which need rewrite to use db/row) if there
are more than 10000 rows (hardcoded, but should go away).

This prevents webpac from running out of memory with databases
of about 300000 records on 4Gb of (virtual) memory.

1 dpavlin 1 package WebPAC::Input;
2    
3     use warnings;
4     use strict;
5    
6 dpavlin 507 use blib;
7    
8 dpavlin 487 use WebPAC::Common;
9 dpavlin 285 use base qw/WebPAC::Common/;
10 dpavlin 308 use Data::Dumper;
11 dpavlin 624 use Encode qw/from_to/;
12 dpavlin 285
13 dpavlin 1 =head1 NAME
14    
15 dpavlin 286 WebPAC::Input - read different file formats into WebPAC
16 dpavlin 1
17     =head1 VERSION
18    
19 dpavlin 707 Version 0.13
20 dpavlin 1
21     =cut
22    
23 dpavlin 707 our $VERSION = '0.13';
24 dpavlin 1
25     =head1 SYNOPSIS
26    
27 dpavlin 286 This module implements input as database which have fixed and known
28     I<size> while indexing and single unique numeric identifier for database
29     position ranging from 1 to I<size>.
30 dpavlin 1
31 dpavlin 286 Simply, something that is indexed by unmber from 1 .. I<size>.
32    
33     Examples of such databases are CDS/ISIS files, MARC files, lines in
34     text file, and so on.
35    
36     Specific file formats are implemented using low-level interface modules,
37     located in C<WebPAC::Input::*> namespace which export C<open_db>,
38     C<fetch_rec> and optional C<init> functions.
39    
40 dpavlin 1 Perhaps a little code snippet.
41    
42 dpavlin 597 use WebPAC::Input;
43 dpavlin 1
44 dpavlin 597 my $db = WebPAC::Input->new(
45     module => 'WebPAC::Input::ISIS',
46 dpavlin 286 low_mem => 1,
47 dpavlin 597 );
48 dpavlin 1
49 dpavlin 597 $db->open( path => '/path/to/database' );
50 dpavlin 416 print "database size: ",$db->size,"\n";
51     while (my $rec = $db->fetch) {
52     # do something with $rec
53     }
54 dpavlin 1
55 dpavlin 286
56    
57 dpavlin 1 =head1 FUNCTIONS
58    
59 dpavlin 3 =head2 new
60 dpavlin 1
61 dpavlin 3 Create new input database object.
62    
63 dpavlin 9 my $db = new WebPAC::Input(
64 dpavlin 286 module => 'WebPAC::Input::MARC',
65 dpavlin 585 encoding => 'ISO-8859-2',
66 dpavlin 10 low_mem => 1,
67 dpavlin 416 recode => 'char pairs',
68 dpavlin 483 no_progress_bar => 1,
69 dpavlin 9 );
70 dpavlin 3
71 dpavlin 597 C<module> is low-level file format module. See L<WebPAC::Input::ISIS> and
72 dpavlin 286 L<WebPAC::Input::MARC>.
73    
74 dpavlin 585 Optional parametar C<encoding> specify application code page (which will be
75 dpavlin 9 used internally). This should probably be your terminal encoding, and by
76     default, it C<ISO-8859-2>.
77    
78 dpavlin 10 Default is not to use C<low_mem> options (see L<MEMORY USAGE> below).
79    
80 dpavlin 483 C<recode> is optional string constisting of character or words pairs that
81     should be replaced in input stream.
82    
83     C<no_progress_bar> disables progress bar output on C<STDOUT>
84    
85 dpavlin 285 This function will also call low-level C<init> if it exists with same
86     parametars.
87    
88 dpavlin 1 =cut
89    
90 dpavlin 3 sub new {
91 dpavlin 285 my $class = shift;
92     my $self = {@_};
93 dpavlin 3 bless($self, $class);
94    
95 dpavlin 285 my $log = $self->_get_logger;
96    
97 dpavlin 585 $log->logconfess("code_page argument is not suppored any more. change it to encoding") if ($self->{lookup});
98     $log->logconfess("lookup argument is not suppored any more. rewrite call to lookup_ref") if ($self->{lookup});
99    
100 dpavlin 286 $log->logconfess("specify low-level file format module") unless ($self->{module});
101 dpavlin 726 my $module_path = $self->{module};
102     $module_path =~ s#::#/#g;
103     $module_path .= '.pm';
104     $log->debug("require low-level module $self->{module} from $module_path");
105 dpavlin 289
106 dpavlin 726 require $module_path;
107 dpavlin 286
108 dpavlin 285 # check if required subclasses are implemented
109 dpavlin 652 foreach my $subclass (qw/open_db fetch_rec init dump_rec/) {
110 dpavlin 726 # FIXME
111 dpavlin 285 }
112    
113 dpavlin 585 $self->{'encoding'} ||= 'ISO-8859-2';
114 dpavlin 9
115 dpavlin 3 $self ? return $self : return undef;
116 dpavlin 1 }
117    
118 dpavlin 285 =head2 open
119    
120     This function will read whole database in memory and produce lookups.
121    
122 dpavlin 286 $input->open(
123 dpavlin 285 path => '/path/to/database/file',
124 dpavlin 624 code_page => 'cp852',
125 dpavlin 286 limit => 500,
126     offset => 6000,
127 dpavlin 506 stats => 1,
128 dpavlin 707 lookup_coderef => sub {
129     my $rec = shift;
130     # store lookups
131 dpavlin 585 },
132 dpavlin 597 modify_records => {
133     900 => { '^a' => { ' : ' => '^b' } },
134     901 => { '*' => { '^b' => ' ; ' } },
135     },
136 dpavlin 636 modify_file => 'conf/modify/mapping.map',
137 dpavlin 285 );
138    
139 dpavlin 624 By default, C<code_page> is assumed to be C<cp852>.
140 dpavlin 285
141 dpavlin 286 C<offset> is optional parametar to position at some offset before reading from database.
142 dpavlin 285
143 dpavlin 286 C<limit> is optional parametar to read just C<limit> records from database
144 dpavlin 285
145 dpavlin 506 C<stats> create optional report about usage of fields and subfields
146    
147 dpavlin 707 C<lookup_coderef> is closure to called to save data into lookups
148 dpavlin 585
149 dpavlin 597 C<modify_records> specify mapping from subfields to delimiters or from
150     delimiters to subfields, as well as oprations on fields (if subfield is
151     defined as C<*>.
152    
153 dpavlin 636 C<modify_file> is alternative for C<modify_records> above which preserves order and offers
154     (hopefully) simplier sintax than YAML or perl (see L</modify_file_regex>). This option
155     overrides C<modify_records> if both exists for same input.
156    
157 dpavlin 286 Returns size of database, regardless of C<offset> and C<limit>
158     parametars, see also C<size>.
159 dpavlin 285
160     =cut
161    
162     sub open {
163     my $self = shift;
164     my $arg = {@_};
165    
166     my $log = $self->_get_logger();
167    
168 dpavlin 585 $log->logconfess("lookup argument is not suppored any more. rewrite call to lookup_coderef") if ($arg->{lookup});
169     $log->logconfess("lookup_coderef must be CODE, not ",ref($arg->{lookup_coderef}))
170     if ($arg->{lookup_coderef} && ref($arg->{lookup_coderef}) ne 'CODE');
171    
172 dpavlin 707 $log->debug( $arg->{lookup_coderef} ? '' : 'not ', "using lookup_coderef");
173    
174 dpavlin 285 $log->logcroak("need path") if (! $arg->{'path'});
175 dpavlin 624 my $code_page = $arg->{'code_page'} || 'cp852';
176 dpavlin 285
177     # store data in object
178 dpavlin 292 $self->{'input_code_page'} = $code_page;
179 dpavlin 286 foreach my $v (qw/path offset limit/) {
180 dpavlin 285 $self->{$v} = $arg->{$v} if ($arg->{$v});
181     }
182    
183 dpavlin 416 my $filter_ref;
184 dpavlin 597 my $recode_regex;
185     my $recode_map;
186 dpavlin 416
187     if ($self->{recode}) {
188     my @r = split(/\s/, $self->{recode});
189     if ($#r % 2 != 1) {
190     $log->logwarn("recode needs even number of elements (some number of valid pairs)");
191     } else {
192     while (@r) {
193     my $from = shift @r;
194     my $to = shift @r;
195 dpavlin 597 $recode_map->{$from} = $to;
196 dpavlin 416 }
197    
198 dpavlin 597 $recode_regex = join '|' => keys %{ $recode_map };
199 dpavlin 416
200 dpavlin 597 $log->debug("using recode regex: $recode_regex");
201 dpavlin 416 }
202    
203     }
204    
205 dpavlin 636 my $rec_regex;
206     if (my $p = $arg->{modify_file}) {
207     $log->debug("using modify_file $p");
208     $rec_regex = $self->modify_file_regexps( $p );
209     } elsif (my $h = $arg->{modify_records}) {
210     $log->debug("using modify_records ", Dumper( $h ));
211     $rec_regex = $self->modify_record_regexps(%{ $h });
212     }
213     $log->debug("rec_regex: ", Dumper($rec_regex)) if ($rec_regex);
214 dpavlin 597
215 dpavlin 726 my $class = $self->{module} || $log->logconfess("can't get low-level module name!");
216    
217     my $ll_db = $class->new(
218 dpavlin 285 path => $arg->{path},
219 dpavlin 624 # filter => sub {
220     # my ($l,$f_nr) = @_;
221     # return unless defined($l);
222     # from_to($l, $code_page, $self->{'encoding'});
223     # $l =~ s/($recode_regex)/$recode_map->{$1}/g if ($recode_regex && $recode_map);
224     # return $l;
225     # },
226 dpavlin 523 %{ $arg },
227 dpavlin 285 );
228    
229 dpavlin 726 unless (defined($ll_db)) {
230 dpavlin 285 $log->logwarn("can't open database $arg->{path}, skipping...");
231     return;
232     }
233    
234 dpavlin 726 my $size = $ll_db->size;
235    
236 dpavlin 285 unless ($size) {
237     $log->logwarn("no records in database $arg->{path}, skipping...");
238     return;
239     }
240    
241 dpavlin 339 my $from_rec = 1;
242     my $to_rec = $size;
243 dpavlin 285
244 dpavlin 286 if (my $s = $self->{offset}) {
245 dpavlin 513 $log->debug("skipping to MFN $s");
246 dpavlin 339 $from_rec = $s;
247 dpavlin 285 } else {
248 dpavlin 339 $self->{offset} = $from_rec;
249 dpavlin 285 }
250    
251 dpavlin 286 if ($self->{limit}) {
252 dpavlin 301 $log->debug("limiting to ",$self->{limit}," records");
253 dpavlin 339 $to_rec = $from_rec + $self->{limit} - 1;
254     $to_rec = $size if ($to_rec > $size);
255 dpavlin 285 }
256    
257     # store size for later
258 dpavlin 339 $self->{size} = ($to_rec - $from_rec) ? ($to_rec - $from_rec + 1) : 0;
259 dpavlin 285
260 dpavlin 585 $log->info("processing $self->{size}/$size records [$from_rec-$to_rec] convert $code_page -> $self->{encoding}", $self->{stats} ? ' [stats]' : '');
261 dpavlin 285
262 dpavlin 760 # turn on low_mem for databases with more than 100000 records!
263     if (! $self->{low_mem} && $size > 100000) {
264     $log->warn("Using on-disk storage instead of memory for input data. This will affect performance.");
265     $self->{low_mem}++;
266     }
267    
268     # running with low_mem flag? well, use DBM::Deep then.
269     if ($self->{'low_mem'}) {
270     $log->info("running with low_mem which impacts performance (<32 Mb memory usage)");
271    
272     my $db_file = "data.db";
273    
274     if (-e $db_file) {
275     unlink $db_file or $log->logdie("can't remove '$db_file' from last run");
276     $log->debug("removed '$db_file' from last run");
277     }
278    
279     require DBM::Deep;
280    
281     my $db = new DBM::Deep $db_file;
282    
283     $log->logdie("DBM::Deep error: $!") unless ($db);
284    
285     if ($db->error()) {
286     $log->logdie("can't open '$db_file' under low_mem: ",$db->error());
287     } else {
288     $log->debug("using file '$db_file' for DBM::Deep");
289     }
290    
291     $self->{'db'} = $db;
292     }
293    
294 dpavlin 285 # read database
295 dpavlin 339 for (my $pos = $from_rec; $pos <= $to_rec; $pos++) {
296 dpavlin 285
297 dpavlin 286 $log->debug("position: $pos\n");
298 dpavlin 285
299 dpavlin 726 my $rec = $ll_db->fetch_rec($pos, sub {
300 dpavlin 619 my ($l,$f_nr) = @_;
301 dpavlin 624 # return unless defined($l);
302     # return $l unless ($rec_regex && $f_nr);
303 dpavlin 285
304 dpavlin 625 $log->debug("-=> $f_nr ## $l");
305    
306 dpavlin 624 # codepage conversion and recode_regex
307 dpavlin 626 from_to($l, $code_page, $self->{'encoding'});
308 dpavlin 624 $l =~ s/($recode_regex)/$recode_map->{$1}/g if ($recode_regex && $recode_map);
309    
310 dpavlin 619 # apply regexps
311     if ($rec_regex && defined($rec_regex->{$f_nr})) {
312     $log->logconfess("regexps->{$f_nr} must be ARRAY") if (ref($rec_regex->{$f_nr}) ne 'ARRAY');
313     my $c = 0;
314     foreach my $r (@{ $rec_regex->{$f_nr} }) {
315 dpavlin 626 my $old_l = $l;
316 dpavlin 619 eval '$l =~ ' . $r;
317 dpavlin 626 if ($old_l ne $l) {
318     $log->debug("REGEX on $f_nr eval \$l =~ $r\n## old l: [$old_l]\n## new l: [$l]");
319     }
320 dpavlin 619 $log->error("error applying regex: $r") if ($@);
321     }
322     }
323    
324 dpavlin 625 $log->debug("<=- $f_nr ## $l");
325 dpavlin 619 return $l;
326     });
327    
328 dpavlin 308 $log->debug(sub { Dumper($rec) });
329    
330 dpavlin 285 if (! $rec) {
331 dpavlin 286 $log->warn("record $pos empty? skipping...");
332 dpavlin 285 next;
333     }
334    
335     # store
336 dpavlin 286 if ($self->{low_mem}) {
337     $self->{db}->put($pos, $rec);
338 dpavlin 285 } else {
339 dpavlin 286 $self->{data}->{$pos} = $rec;
340 dpavlin 285 }
341    
342     # create lookup
343 dpavlin 585 $arg->{'lookup_coderef'}->( $rec ) if ($rec && $arg->{'lookup_coderef'});
344 dpavlin 285
345 dpavlin 506 # update counters for statistics
346     if ($self->{stats}) {
347 dpavlin 593
348 dpavlin 619 # fetch clean record with regexpes applied for statistics
349 dpavlin 726 my $rec = $ll_db->fetch_rec($pos);
350 dpavlin 619
351 dpavlin 593 foreach my $fld (keys %{ $rec }) {
352 dpavlin 506 $self->{_stats}->{fld}->{ $fld }++;
353 dpavlin 593
354     $log->logdie("invalid record fild $fld, not ARRAY")
355     unless (ref($rec->{ $fld }) eq 'ARRAY');
356    
357     foreach my $row (@{ $rec->{$fld} }) {
358    
359     if (ref($row) eq 'HASH') {
360    
361     foreach my $sf (keys %{ $row }) {
362 dpavlin 619 next if ($sf eq 'subfields');
363 dpavlin 593 $self->{_stats}->{sf}->{ $fld }->{ $sf }->{count}++;
364     $self->{_stats}->{sf}->{ $fld }->{ $sf }->{repeatable}++
365     if (ref($row->{$sf}) eq 'ARRAY');
366 dpavlin 506 }
367 dpavlin 593
368     } else {
369     $self->{_stats}->{repeatable}->{ $fld }++;
370     }
371 dpavlin 506 }
372 dpavlin 593 }
373 dpavlin 506 }
374    
375 dpavlin 483 $self->progress_bar($pos,$to_rec) unless ($self->{no_progress_bar});
376 dpavlin 285
377     }
378    
379 dpavlin 286 $self->{pos} = -1;
380     $self->{last_pcnt} = 0;
381 dpavlin 285
382     # store max mfn and return it.
383 dpavlin 339 $self->{max_pos} = $to_rec;
384     $log->debug("max_pos: $to_rec");
385 dpavlin 285
386 dpavlin 757 # save for dump
387     $self->{ll_db} = $ll_db;
388    
389 dpavlin 285 return $size;
390     }
391    
392     =head2 fetch
393    
394     Fetch next record from database. It will also displays progress bar.
395    
396     my $rec = $isis->fetch;
397    
398     Record from this function should probably go to C<data_structure> for
399     normalisation.
400    
401     =cut
402    
403     sub fetch {
404     my $self = shift;
405    
406     my $log = $self->_get_logger();
407    
408 dpavlin 286 $log->logconfess("it seems that you didn't load database!") unless ($self->{pos});
409 dpavlin 285
410 dpavlin 286 if ($self->{pos} == -1) {
411     $self->{pos} = $self->{offset};
412 dpavlin 285 } else {
413 dpavlin 286 $self->{pos}++;
414 dpavlin 285 }
415    
416 dpavlin 286 my $mfn = $self->{pos};
417 dpavlin 285
418 dpavlin 286 if ($mfn > $self->{max_pos}) {
419     $self->{pos} = $self->{max_pos};
420 dpavlin 285 $log->debug("at EOF");
421     return;
422     }
423    
424 dpavlin 483 $self->progress_bar($mfn,$self->{max_pos}) unless ($self->{no_progress_bar});
425 dpavlin 285
426     my $rec;
427    
428 dpavlin 286 if ($self->{low_mem}) {
429     $rec = $self->{db}->get($mfn);
430 dpavlin 285 } else {
431 dpavlin 286 $rec = $self->{data}->{$mfn};
432 dpavlin 285 }
433    
434     $rec ||= 0E0;
435     }
436    
437     =head2 pos
438    
439     Returns current record number (MFN).
440    
441     print $isis->pos;
442    
443     First record in database has position 1.
444    
445     =cut
446    
447     sub pos {
448     my $self = shift;
449 dpavlin 286 return $self->{pos};
450 dpavlin 285 }
451    
452    
453     =head2 size
454    
455     Returns number of records in database
456    
457     print $isis->size;
458    
459     Result from this function can be used to loop through all records
460    
461     foreach my $mfn ( 1 ... $isis->size ) { ... }
462    
463 dpavlin 286 because it takes into account C<offset> and C<limit>.
464 dpavlin 285
465     =cut
466    
467     sub size {
468     my $self = shift;
469 dpavlin 286 return $self->{size};
470 dpavlin 285 }
471    
472     =head2 seek
473    
474     Seek to specified MFN in file.
475    
476     $isis->seek(42);
477    
478     First record in database has position 1.
479    
480     =cut
481    
482     sub seek {
483     my $self = shift;
484     my $pos = shift || return;
485    
486     my $log = $self->_get_logger();
487    
488     if ($pos < 1) {
489     $log->warn("seek before first record");
490     $pos = 1;
491 dpavlin 286 } elsif ($pos > $self->{max_pos}) {
492 dpavlin 285 $log->warn("seek beyond last record");
493 dpavlin 286 $pos = $self->{max_pos};
494 dpavlin 285 }
495    
496 dpavlin 286 return $self->{pos} = (($pos - 1) || -1);
497 dpavlin 285 }
498    
499 dpavlin 506 =head2 stats
500 dpavlin 285
501 dpavlin 506 Dump statistics about field and subfield usage
502    
503 dpavlin 507 print $input->stats;
504 dpavlin 506
505     =cut
506    
507     sub stats {
508     my $self = shift;
509 dpavlin 507
510     my $log = $self->_get_logger();
511    
512     my $s = $self->{_stats};
513     if (! $s) {
514     $log->warn("called stats, but there is no statistics collected");
515     return;
516     }
517    
518     my $max_fld = 0;
519    
520     my $out = join("\n",
521     map {
522     my $f = $_ || die "no field";
523     my $v = $s->{fld}->{$f} || die "no s->{fld}->{$f}";
524     $max_fld = $v if ($v > $max_fld);
525    
526 dpavlin 519 my $o = sprintf("%4s %d ~", $f, $v);
527 dpavlin 507
528     if (defined($s->{sf}->{$f})) {
529     map {
530 dpavlin 593 $o .= sprintf(" %s:%d%s", $_,
531     $s->{sf}->{$f}->{$_}->{count},
532     $s->{sf}->{$f}->{$_}->{repeatable} ? '*' : '',
533     );
534 dpavlin 507 } sort keys %{ $s->{sf}->{$f} };
535     }
536    
537     if (my $v_r = $s->{repeatable}->{$f}) {
538     $o .= " ($v_r)" if ($v_r != $v);
539     }
540    
541     $o;
542 dpavlin 519 } sort { $a cmp $b } keys %{ $s->{fld} }
543 dpavlin 507 );
544    
545     $log->debug( sub { Dumper($s) } );
546    
547     return $out;
548 dpavlin 506 }
549    
550 dpavlin 652 =head2 dump
551    
552     Display humanly readable dump of record
553    
554     =cut
555    
556     sub dump {
557     my $self = shift;
558    
559 dpavlin 757 return $self->{ll_db}->dump_rec( $self->{pos} );
560 dpavlin 652
561     }
562    
563 dpavlin 598 =head2 modify_record_regexps
564 dpavlin 597
565 dpavlin 636 Generate hash with regexpes to be applied using l<filter>.
566 dpavlin 597
567     my $regexpes = $input->modify_record_regexps(
568     900 => { '^a' => { ' : ' => '^b' } },
569     901 => { '*' => { '^b' => ' ; ' } },
570     );
571    
572     =cut
573    
574 dpavlin 636 sub _get_regex {
575     my ($sf,$from,$to) = @_;
576     if ($sf =~ /^\^/) {
577     return
578     's/\Q'. $sf .'\E([^\^]*?)\Q'. $from .'\E([^\^]*?)/'. $sf .'$1'. $to .'$2/';
579     } else {
580     return
581     's/\Q'. $from .'\E/'. $to .'/g';
582     }
583     }
584    
585 dpavlin 597 sub modify_record_regexps {
586     my $self = shift;
587     my $modify_record = {@_};
588    
589     my $regexpes;
590    
591 dpavlin 634 my $log = $self->_get_logger();
592    
593 dpavlin 597 foreach my $f (keys %$modify_record) {
594 dpavlin 634 $log->debug("field: $f");
595    
596 dpavlin 597 foreach my $sf (keys %{ $modify_record->{$f} }) {
597 dpavlin 634 $log->debug("subfield: $sf");
598    
599 dpavlin 597 foreach my $from (keys %{ $modify_record->{$f}->{$sf} }) {
600     my $to = $modify_record->{$f}->{$sf}->{$from};
601     #die "no field?" unless defined($to);
602 dpavlin 634 $log->debug("transform: |$from| -> |$to|");
603 dpavlin 597
604 dpavlin 636 my $regex = _get_regex($sf,$from,$to);
605     push @{ $regexpes->{$f} }, $regex;
606     $log->debug("regex: $regex");
607 dpavlin 597 }
608     }
609     }
610    
611     return $regexpes;
612     }
613    
614 dpavlin 636 =head2 modify_file_regexps
615    
616     Generate hash with regexpes to be applied using l<filter> from
617     pseudo hash/yaml format for regex mappings.
618    
619     It should be obvious:
620    
621     200
622     '^a'
623     ' : ' => '^e'
624     ' = ' => '^d'
625    
626     In field I<200> find C<'^a'> and then C<' : '>, and replace it with C<'^e'>.
627     In field I<200> find C<'^a'> and then C<' = '>, and replace it with C<'^d'>.
628    
629     my $regexpes = $input->modify_file_regexps( 'conf/modify/common.pl' );
630    
631     On undef path it will just return.
632    
633     =cut
634    
635     sub modify_file_regexps {
636     my $self = shift;
637    
638     my $modify_path = shift || return;
639    
640     my $log = $self->_get_logger();
641    
642     my $regexpes;
643    
644 dpavlin 697 CORE::open(my $fh, $modify_path) || $log->logdie("can't open modify file $modify_path: $!");
645 dpavlin 636
646     my ($f,$sf);
647    
648     while(<$fh>) {
649     chomp;
650     next if (/^#/ || /^\s*$/);
651    
652     if (/^\s*(\d+)\s*$/) {
653     $f = $1;
654     $log->debug("field: $f");
655     next;
656     } elsif (/^\s*'([^']*)'\s*$/) {
657     $sf = $1;
658     $log->die("can't define subfiled before field in: $_") unless ($f);
659     $log->debug("subfield: $sf");
660     } elsif (/^\s*'([^']*)'\s*=>\s*'([^']*)'\s*$/) {
661     my ($from,$to) = ($1, $2);
662    
663     $log->debug("transform: |$from| -> |$to|");
664    
665     my $regex = _get_regex($sf,$from,$to);
666     push @{ $regexpes->{$f} }, $regex;
667     $log->debug("regex: $regex");
668     }
669     }
670    
671     return $regexpes;
672     }
673    
674 dpavlin 3 =head1 MEMORY USAGE
675 dpavlin 1
676 dpavlin 3 C<low_mem> options is double-edged sword. If enabled, WebPAC
677     will run on memory constraint machines (which doesn't have enough
678     physical RAM to create memory structure for whole source database).
679 dpavlin 1
680 dpavlin 3 If your machine has 512Mb or more of RAM and database is around 10000 records,
681     memory shouldn't be an issue. If you don't have enough physical RAM, you
682     might consider using virtual memory (if your operating system is handling it
683     well, like on FreeBSD or Linux) instead of dropping to L<DBM::Deep> to handle
684     parsed structure of ISIS database (this is what C<low_mem> option does).
685 dpavlin 1
686 dpavlin 3 Hitting swap at end of reading source database is probably o.k. However,
687     hitting swap before 90% will dramatically decrease performance and you will
688     be better off with C<low_mem> and using rest of availble memory for
689     operating system disk cache (Linux is particuallary good about this).
690     However, every access to database record will require disk access, so
691     generation phase will be slower 10-100 times.
692    
693     Parsed structures are essential - you just have option to trade RAM memory
694     (which is fast) for disk space (which is slow). Be sure to have planty of
695     disk space if you are using C<low_mem> and thus L<DBM::Deep>.
696    
697     However, when WebPAC is running on desktop machines (or laptops :-), it's
698     highly undesireable for system to start swapping. Using C<low_mem> option can
699     reduce WecPAC memory usage to around 64Mb for same database with lookup
700     fields and sorted indexes which stay in RAM. Performance will suffer, but
701     memory usage will really be minimal. It might be also more confortable to
702     run WebPAC reniced on those machines.
703    
704    
705     =head1 AUTHOR
706    
707     Dobrica Pavlinusic, C<< <dpavlin@rot13.org> >>
708    
709 dpavlin 1 =head1 COPYRIGHT & LICENSE
710    
711 dpavlin 599 Copyright 2005-2006 Dobrica Pavlinusic, All Rights Reserved.
712 dpavlin 1
713     This program is free software; you can redistribute it and/or modify it
714     under the same terms as Perl itself.
715    
716     =cut
717    
718     1; # End of WebPAC::Input

  ViewVC Help
Powered by ViewVC 1.1.26