1 |
=head1 WebPAC - Search engine or data-warehouse manual |
2 |
|
3 |
It's quite hard to explain conceisly what webpac is. It's a mix between |
4 |
search engine and data warehousing application. Let's see that in detail... |
5 |
|
6 |
WebPAC was originally written to search CDS/ISIS records using C<swish-e>. |
7 |
Since then it has, however, adopted different other input formats and added |
8 |
support for alphabetical lists (earlier described as indexes). |
9 |
|
10 |
With evolution of this concept, we decided to produce following work-flow |
11 |
of your data: |
12 |
|
13 |
step |
14 |
|
15 |
source data CDS/ISIS, MARC, Excel, robots, ... |
16 |
| |
17 |
0 | apply lookup rules (optional) |
18 |
1 | apply input normalisation rules (xml or yaml) |
19 |
V |
20 |
intermidiate this data is re-formatted source data converted |
21 |
data to chunks based on tag names from config/input/ |
22 |
| |
23 |
2 | optionally apply output filter (TT2) |
24 |
V |
25 |
data search engine, HTML, OAI, RDBMS |
26 |
| |
27 |
3 | filter using query in REST format |
28 |
4 | apply output filter (TT2) |
29 |
V |
30 |
client Web browser (html), JSON |
31 |
|
32 |
=head2 Source data |
33 |
|
34 |
WebPAC supports various input formats: |
35 |
|
36 |
=over 2 |
37 |
|
38 |
=item L<WebPAC::Input::ISIS> CDS/ISIS data |
39 |
|
40 |
=item L<WebPAC::Input::MARC> for MARC records |
41 |
|
42 |
=item L<WebPAC::Input::Excel> Microsoft Excel C<.xls> support |
43 |
|
44 |
=item L<WebPAC::Input::DBF> support legacy tables (e.g. Clipper) |
45 |
|
46 |
=item L<WebPAC::Input::Gutenberg> for RDF catalog data from Project Gutenberg |
47 |
|
48 |
=back |
49 |
|
50 |
=head2 Create data lookups |
51 |
|
52 |
Before you can begin normalisation, you might want to create lookups which store |
53 |
C<< key -> value(s) >> pair(s). Lookups are especially useful if you want to |
54 |
I<well> lookup value of some other record using some sort of identifier. |
55 |
|
56 |
Lookup are described in more details in L<WebPAC::Lookup>. |
57 |
|
58 |
=head2 Normalisation to intermidiate data |
59 |
|
60 |
Intermidiate data is internal representation of data on which WebPAC operates. |
61 |
|
62 |
You are creating mappings, one-to-one from source data records to documents |
63 |
in WebPAC. You can split or merge data from input records, apply regexes, |
64 |
use lookups within same source file, do conditions, branches and/or |
65 |
simple evaluations while producing intermidiate data. |
66 |
|
67 |
All that is controlled with C<config/config.yml> configuration file. |
68 |
This file is in human-readable YAML format, and it describes all configuration of |
69 |
WebPAC and it's front-end Webpacus. |
70 |
|
71 |
|
72 |
All that is controlled with C<config/input/> configuration files. You |
73 |
will want to create fine-grained chunks of data (like separate first and |
74 |
last name), which will later be used to produce output. You can think of |
75 |
conversation process as application of C<config/input/> recepie on |
76 |
every input record. |
77 |
|
78 |
Each tag within recepie is creating one new records as long as there are |
79 |
fields in input format (which can be repeatable) that satisfy at least one |
80 |
field within tag. |
81 |
|
82 |
Users of older webpac should note that this file doesn't contain any more |
83 |
formatting or specification of output type and that granularity of each tag |
84 |
has increased. |
85 |
|
86 |
B<this document should really be updated to reflect Webpacus front-end from |
87 |
this point...> |
88 |
|
89 |
=head2 Output filter |
90 |
|
91 |
Now that we have normalized record, we can create some output. You can create |
92 |
html from it, data files for search engine or insert them into RDBMS. |
93 |
|
94 |
The twist is that application of output filters can be recursive, allowing |
95 |
you to query data generated in previous step. This enables to you represent |
96 |
lists or trees from source data that have structure. This also requires to |
97 |
produce structured data in step 2 which can be filtered and queried in steps |
98 |
3 and 4 to produce final output. |
99 |
|
100 |
You should note that you can query intermidiate data in step 4 also, not |
101 |
just data produced in step 2. |
102 |
|
103 |
Output filter use Template Toolkit 2, so you have full power of simple |
104 |
procedural language (loops, conditions) and handy built-in functions to |
105 |
produce output. |
106 |
|
107 |
=head2 REST Query Format |
108 |
|
109 |
Design decision is to use REST query format. This has benefit of simplicity |
110 |
and ability to create unique URLs to all content within webpac. Simple query |
111 |
format is: |
112 |
|
113 |
http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995 |
114 |
|
115 |
This REST query can be broken down to: |
116 |
|
117 |
=over |
118 |
|
119 |
=item http://webpac |
120 |
|
121 |
Hostname on which service is running. Not required if doing lookups, just |
122 |
for browser usage. |
123 |
|
124 |
=item search |
125 |
|
126 |
Name of output filtering methods. This will specify search engine. |
127 |
|
128 |
=item html |
129 |
|
130 |
Specified template that will be used to produce output. |
131 |
|
132 |
=item perlsonal_name/Joe%20Doe... |
133 |
|
134 |
URL encoded query string. It is specific to filtering method used. |
135 |
|
136 |
=back |
137 |
|
138 |
You can easily produce RSS feed for same query using follwing REST url: |
139 |
|
140 |
http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995 |
141 |
|
142 |
Yes, it really is that simple. As it should be. |
143 |
|
144 |
=head1 Tehnical stuff |
145 |
|
146 |
Following text will be more hard-code tehnical stuff about how is webpac |
147 |
implemented and why. |
148 |
|
149 |
=head2 Search Engine |
150 |
|
151 |
We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings |
152 |
for it. |
153 |
|
154 |
It should be relativly easy to plugin another one if need arise. |
155 |
|
156 |
=head2 Data Warehouse |
157 |
|
158 |
In a nutshell, webpac has evolved to support hybrid data as input. That |
159 |
means it has become kind of data-warehouse application. It doesn't support |
160 |
directly roll-up and roll-down operations, but they can be emulated using |
161 |
intermidiate data step or output step. |
162 |
|