/[webpac2]/trunk/lib/WebPAC/Manual.pod
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /trunk/lib/WebPAC/Manual.pod

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1063 - (hide annotations)
Tue Nov 27 21:01:44 2007 UTC (16 years, 5 months ago) by dpavlin
File size: 5305 byte(s)
pod fixes

1 dpavlin 1 =head1 WebPAC - Search engine or data-warehouse manual
2    
3     It's quite hard to explain conceisly what webpac is. It's a mix between
4     search engine and data warehousing application. Let's see that in detail...
5    
6     WebPAC was originally written to search CDS/ISIS records using C<swish-e>.
7     Since then it has, however, adopted different other input formats and added
8     support for alphabetical lists (earlier described as indexes).
9    
10     With evolution of this concept, we decided to produce following work-flow
11     of your data:
12    
13     step
14    
15 dpavlin 892 source data CDS/ISIS, MARC, Excel, robots, ...
16 dpavlin 1 |
17 dpavlin 892 0 | apply lookup rules (optional)
18 dpavlin 311 1 | apply input normalisation rules (xml or yaml)
19 dpavlin 1 V
20 dpavlin 892 intermidiate this data is re-formatted source data converted
21     data to chunks based on tag names from config/input/
22 dpavlin 1 |
23 dpavlin 311 2 | optionally apply output filter (TT2)
24 dpavlin 1 V
25 dpavlin 892 data search engine, HTML, OAI, RDBMS
26 dpavlin 1 |
27     3 | filter using query in REST format
28     4 | apply output filter (TT2)
29     V
30 dpavlin 892 client Web browser (html), JSON
31 dpavlin 1
32 dpavlin 892 =head2 Source data
33 dpavlin 1
34 dpavlin 892 WebPAC supports various input formats:
35 dpavlin 1
36 dpavlin 892 =over 2
37    
38     =item L<WebPAC::Input::ISIS> CDS/ISIS data
39    
40     =item L<WebPAC::Input::MARC> for MARC records
41    
42     =item L<WebPAC::Input::Excel> Microsoft Excel C<.xls> support
43    
44     =item L<WebPAC::Input::DBF> support legacy tables (e.g. Clipper)
45    
46 dpavlin 1063 =item L<WebPAC::Input::Gutenberg> for RDF catalog data from Project Gutenberg
47 dpavlin 892
48     =back
49    
50     =head2 Create data lookups
51    
52     Before you can begin normalisation, you might want to create lookups which store
53     C<< key -> value(s) >> pair(s). Lookups are especially useful if you want to
54     I<well> lookup value of some other record using some sort of identifier.
55    
56     Lookup are described in more details in L<WebPAC::Lookup>.
57    
58     =head2 Normalisation to intermidiate data
59    
60     Intermidiate data is internal representation of data on which WebPAC operates.
61    
62 dpavlin 1 You are creating mappings, one-to-one from source data records to documents
63 dpavlin 892 in WebPAC. You can split or merge data from input records, apply regexes,
64     use lookups within same source file, do conditions, branches and/or
65     simple evaluations while producing intermidiate data.
66 dpavlin 1
67 dpavlin 892 All that is controlled with C<config/config.yml> configuration file.
68     This file is in human-readable YAML format, and it describes all configuration of
69     WebPAC and it's front-end Webpacus.
70    
71    
72     All that is controlled with C<config/input/> configuration files. You
73 dpavlin 8 will want to create fine-grained chunks of data (like separate first and
74     last name), which will later be used to produce output. You can think of
75 dpavlin 311 conversation process as application of C<config/input/> recepie on
76 dpavlin 8 every input record.
77 dpavlin 1
78     Each tag within recepie is creating one new records as long as there are
79     fields in input format (which can be repeatable) that satisfy at least one
80     field within tag.
81    
82     Users of older webpac should note that this file doesn't contain any more
83     formatting or specification of output type and that granularity of each tag
84     has increased.
85    
86 dpavlin 311 B<this document should really be updated to reflect Webpacus front-end from
87     this point...>
88    
89 dpavlin 1 =head2 Output filter
90    
91     Now that we have normalized record, we can create some output. You can create
92     html from it, data files for search engine or insert them into RDBMS.
93    
94     The twist is that application of output filters can be recursive, allowing
95     you to query data generated in previous step. This enables to you represent
96     lists or trees from source data that have structure. This also requires to
97     produce structured data in step 2 which can be filtered and queried in steps
98     3 and 4 to produce final output.
99    
100     You should note that you can query intermidiate data in step 4 also, not
101     just data produced in step 2.
102    
103     Output filter use Template Toolkit 2, so you have full power of simple
104     procedural language (loops, conditions) and handy built-in functions to
105     produce output.
106    
107     =head2 REST Query Format
108    
109     Design decision is to use REST query format. This has benefit of simplicity
110     and ability to create unique URLs to all content within webpac. Simple query
111     format is:
112    
113     http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995
114    
115     This REST query can be broken down to:
116    
117     =over
118    
119     =item http://webpac
120    
121     Hostname on which service is running. Not required if doing lookups, just
122     for browser usage.
123    
124     =item search
125    
126     Name of output filtering methods. This will specify search engine.
127    
128     =item html
129    
130     Specified template that will be used to produce output.
131    
132     =item perlsonal_name/Joe%20Doe...
133    
134     URL encoded query string. It is specific to filtering method used.
135    
136     =back
137    
138     You can easily produce RSS feed for same query using follwing REST url:
139    
140     http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995
141    
142     Yes, it really is that simple. As it should be.
143    
144     =head1 Tehnical stuff
145    
146     Following text will be more hard-code tehnical stuff about how is webpac
147     implemented and why.
148    
149     =head2 Search Engine
150    
151     We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings
152     for it.
153    
154     It should be relativly easy to plugin another one if need arise.
155    
156     =head2 Data Warehouse
157    
158     In a nutshell, webpac has evolved to support hybrid data as input. That
159     means it has become kind of data-warehouse application. It doesn't support
160     directly roll-up and roll-down operations, but they can be emulated using
161     intermidiate data step or output step.
162    

  ViewVC Help
Powered by ViewVC 1.1.26