lib/WebPAC/Manual.pod

=head1 WebPAC - Search engine or data-warehouse manual

It's quite hard to explain conceisly what webpac is. It's a mix between
search engine and data warehousing application. Let's see that in detail...

WebPAC was originally written to search CDS/ISIS records using C<swish-e>.
Since then it has, however, adopted different other input formats and added
support for alphabetical lists (earlier described as indexes).

With evolution of this concept, we decided to produce following work-flow
of your data:

  step

   source data      CDS/ISIS, MARC, Excel, robots, ...
      |
  0   | apply lookup rules (optional)
  1   | apply input normalisation rules (xml or yaml)
      V
   intermidiate     this data is re-formatted source data converted
     data           to chunks based on tag names from config/input/
      |
  2   | optionally apply output filter (TT2)
      V
     data           search engine, HTML, OAI, RDBMS
      |
  3   | filter using query in REST format
  4   | apply output filter (TT2)
      V
    client          Web browser (html), JSON

=head2 Source data

WebPAC supports various input formats:

=over 2

=item L<WebPAC::Input::ISIS> CDS/ISIS data

=item L<WebPAC::Input::MARC> for MARC records

=item L<WebPAC::Input::Excel> Microsoft Excel C<.xls> support

=item L<WebPAC::Input::DBF> support legacy tables (e.g. Clipper)

=item L<WebPAC::Input::Gutenberg> for RDF catalog data from Project Gutenberg

=back

=head2 Create data lookups

Before you can begin normalisation, you might want to create lookups which store
C<< key -> value(s) >> pair(s). Lookups are especially useful if you want to
I<well> lookup value of some other record using some sort of identifier.

Lookup are described in more details in L<WebPAC::Lookup>.

=head2 Normalisation to intermidiate data

Intermidiate data is internal representation of data on which WebPAC operates.

You are creating mappings, one-to-one from source data records to documents
in WebPAC. You can split or merge data from input records, apply regexes,
use lookups within same source file, do conditions, branches and/or
simple evaluations while producing intermidiate data.

All that is controlled with C<config/config.yml> configuration file.
This file is in human-readable YAML format, and it describes all configuration of
WebPAC and it's front-end Webpacus.


All that is controlled with C<config/input/> configuration files. You
will want to create fine-grained chunks of data (like separate first and
last name), which will later be used to produce output. You can think of
conversation process as application of C<config/input/> recepie on
every input record.

Each tag within recepie is creating one new records as long as there are
fields in input format (which can be repeatable) that satisfy at least one
field within tag.

Users of older webpac should note that this file doesn't contain any more
formatting or specification of output type and that granularity of each tag
has increased.

B<this document should really be updated to reflect Webpacus front-end from
this point...>

=head2 Output filter

Now that we have normalized record, we can create some output. You can create
html from it, data files for search engine or insert them into RDBMS.

The twist is that application of output filters can be recursive, allowing
you to query data generated in previous step. This enables to you represent
lists or trees from source data that have structure. This also requires to
produce structured data in step 2 which can be filtered and queried in steps
3 and 4 to produce final output.

You should note that you can query intermidiate data in step 4 also, not
just data produced in step 2.

Output filter use Template Toolkit 2, so you have full power of simple
procedural language (loops, conditions) and handy built-in functions to
produce output.

=head2 REST Query Format

Design decision is to use REST query format. This has benefit of simplicity
and ability to create unique URLs to all content within webpac. Simple query
format is:

  http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995

This REST query can be broken down to:

=over

=item http://webpac

Hostname on which service is running. Not required if doing lookups, just
for browser usage.

=item search

Name of output filtering methods. This will specify search engine.

=item html

Specified template that will be used to produce output.

=item perlsonal_name/Joe%20Doe...

URL encoded query string. It is specific to filtering method used.

=back

You can easily produce RSS feed for same query using follwing REST url:

  http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995

Yes, it really is that simple. As it should be.

=head1 Tehnical stuff

Following text will be more hard-code tehnical stuff about how is webpac
implemented and why.

=head2 Search Engine

We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings
for it.

It should be relativly easy to plugin another one if need arise.

=head2 Data Warehouse

In a nutshell, webpac has evolved to support hybrid data as input. That
means it has become kind of data-warehouse application. It doesn't support
directly roll-up and roll-down operations, but they can be emulated using
intermidiate data step or output step.

1	dpavlin	1	=head1 WebPAC - Search engine or data-warehouse manual
2
3			It's quite hard to explain conceisly what webpac is. It's a mix between
4			search engine and data warehousing application. Let's see that in detail...
5
6			WebPAC was originally written to search CDS/ISIS records using C<swish-e>.
7			Since then it has, however, adopted different other input formats and added
8			support for alphabetical lists (earlier described as indexes).
9
10			With evolution of this concept, we decided to produce following work-flow
11			of your data:
12
13			step
14
15	dpavlin	892	source data CDS/ISIS, MARC, Excel, robots, ...
16	dpavlin	1	\|
17	dpavlin	892	0 \| apply lookup rules (optional)
18	dpavlin	311	1 \| apply input normalisation rules (xml or yaml)
19	dpavlin	1	V
20	dpavlin	892	intermidiate this data is re-formatted source data converted
21			data to chunks based on tag names from config/input/
22	dpavlin	1	\|
23	dpavlin	311	2 \| optionally apply output filter (TT2)
24	dpavlin	1	V
25	dpavlin	892	data search engine, HTML, OAI, RDBMS
26	dpavlin	1	\|
27			3 \| filter using query in REST format
28			4 \| apply output filter (TT2)
29			V
30	dpavlin	892	client Web browser (html), JSON
31	dpavlin	1
32	dpavlin	892	=head2 Source data
33	dpavlin	1
34	dpavlin	892	WebPAC supports various input formats:
35	dpavlin	1
36	dpavlin	892	=over 2
37
38			=item L<WebPAC::Input::ISIS> CDS/ISIS data
39
40			=item L<WebPAC::Input::MARC> for MARC records
41
42			=item L<WebPAC::Input::Excel> Microsoft Excel C<.xls> support
43
44			=item L<WebPAC::Input::DBF> support legacy tables (e.g. Clipper)
45
46	dpavlin	1063	=item L<WebPAC::Input::Gutenberg> for RDF catalog data from Project Gutenberg
47	dpavlin	892
48			=back
49
50			=head2 Create data lookups
51
52			Before you can begin normalisation, you might want to create lookups which store
53			C<< key -> value(s) >> pair(s). Lookups are especially useful if you want to
54			I<well> lookup value of some other record using some sort of identifier.
55
56			Lookup are described in more details in L<WebPAC::Lookup>.
57
58			=head2 Normalisation to intermidiate data
59
60			Intermidiate data is internal representation of data on which WebPAC operates.
61
62	dpavlin	1	You are creating mappings, one-to-one from source data records to documents
63	dpavlin	892	in WebPAC. You can split or merge data from input records, apply regexes,
64			use lookups within same source file, do conditions, branches and/or
65			simple evaluations while producing intermidiate data.
66	dpavlin	1
67	dpavlin	892	All that is controlled with C<config/config.yml> configuration file.
68			This file is in human-readable YAML format, and it describes all configuration of
69			WebPAC and it's front-end Webpacus.
70
71
72			All that is controlled with C<config/input/> configuration files. You
73	dpavlin	8	will want to create fine-grained chunks of data (like separate first and
74			last name), which will later be used to produce output. You can think of
75	dpavlin	311	conversation process as application of C<config/input/> recepie on
76	dpavlin	8	every input record.
77	dpavlin	1
78			Each tag within recepie is creating one new records as long as there are
79			fields in input format (which can be repeatable) that satisfy at least one
80			field within tag.
81
82			Users of older webpac should note that this file doesn't contain any more
83			formatting or specification of output type and that granularity of each tag
84			has increased.
85
86	dpavlin	311	B<this document should really be updated to reflect Webpacus front-end from
87			this point...>
88
89	dpavlin	1	=head2 Output filter
90
91			Now that we have normalized record, we can create some output. You can create
92			html from it, data files for search engine or insert them into RDBMS.
93
94			The twist is that application of output filters can be recursive, allowing
95			you to query data generated in previous step. This enables to you represent
96			lists or trees from source data that have structure. This also requires to
97			produce structured data in step 2 which can be filtered and queried in steps
98			3 and 4 to produce final output.
99
100			You should note that you can query intermidiate data in step 4 also, not
101			just data produced in step 2.
102
103			Output filter use Template Toolkit 2, so you have full power of simple
104			procedural language (loops, conditions) and handy built-in functions to
105			produce output.
106
107			=head2 REST Query Format
108
109			Design decision is to use REST query format. This has benefit of simplicity
110			and ability to create unique URLs to all content within webpac. Simple query
111			format is:
112
113			http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995
114
115			This REST query can be broken down to:
116
117			=over
118
119			=item http://webpac
120
121			Hostname on which service is running. Not required if doing lookups, just
122			for browser usage.
123
124			=item search
125
126			Name of output filtering methods. This will specify search engine.
127
128			=item html
129
130			Specified template that will be used to produce output.
131
132			=item perlsonal_name/Joe%20Doe...
133
134			URL encoded query string. It is specific to filtering method used.
135
136			=back
137
138			You can easily produce RSS feed for same query using follwing REST url:
139
140			http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995
141
142			Yes, it really is that simple. As it should be.
143
144			=head1 Tehnical stuff
145
146			Following text will be more hard-code tehnical stuff about how is webpac
147			implemented and why.
148
149			=head2 Search Engine
150
151			We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings
152			for it.
153
154			It should be relativly easy to plugin another one if need arise.
155
156			=head2 Data Warehouse
157
158			In a nutshell, webpac has evolved to support hybrid data as input. That
159			means it has become kind of data-warehouse application. It doesn't support
160			directly roll-up and roll-down operations, but they can be emulated using
161			intermidiate data step or output step.
162