/[webpac2]/trunk/lib/WebPAC/Manual.pod
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /trunk/lib/WebPAC/Manual.pod

Parent Directory Parent Directory | Revision Log Revision Log


Revision 8 - (show annotations)
Sat Jul 16 16:48:35 2005 UTC (18 years, 9 months ago) by dpavlin
File size: 4197 byte(s)
little cleanup and first cut into WebPAC::Normalize::XML

1 =head1 WebPAC - Search engine or data-warehouse manual
2
3 It's quite hard to explain conceisly what webpac is. It's a mix between
4 search engine and data warehousing application. Let's see that in detail...
5
6 WebPAC was originally written to search CDS/ISIS records using C<swish-e>.
7 Since then it has, however, adopted different other input formats and added
8 support for alphabetical lists (earlier described as indexes).
9
10 With evolution of this concept, we decided to produce following work-flow
11 of your data:
12
13 step
14
15 source file CDS/ISIS, MARC, Excel, robots, ...
16 |
17 1 | apply import normalisation rules (xml)
18 V
19 intermidiate this data is re-formatted source data converted
20 data to chunks based on tag names from config/input/*.xml
21 |
22 2 | apply output filter (TT2)
23 V
24 data search engine, HTML, OAI, RDBMS
25 |
26 3 | filter using query in REST format
27 4 | apply output filter (TT2)
28 V
29 client Web browser, SOAP
30
31 =head2 Normalisation and Intermidiate data
32
33 This is first step in working with your data.
34
35 You are creating mappings, one-to-one from source data records to documents
36 in webpac. You can split or merge data from input records, apply filters
37 (perl subroutines), use lookups within same source file or do simple
38 evaluations while producing output.
39
40 All that is controlled with C<config/input/*.xml> configuration file. You
41 will want to create fine-grained chunks of data (like separate first and
42 last name), which will later be used to produce output. You can think of
43 conversation process as application of C<config/input/*.xml> recepie on
44 every input record.
45
46 Each tag within recepie is creating one new records as long as there are
47 fields in input format (which can be repeatable) that satisfy at least one
48 field within tag.
49
50 Users of older webpac should note that this file doesn't contain any more
51 formatting or specification of output type and that granularity of each tag
52 has increased.
53
54 =head2 Output filter
55
56 Now that we have normalized record, we can create some output. You can create
57 html from it, data files for search engine or insert them into RDBMS.
58
59 The twist is that application of output filters can be recursive, allowing
60 you to query data generated in previous step. This enables to you represent
61 lists or trees from source data that have structure. This also requires to
62 produce structured data in step 2 which can be filtered and queried in steps
63 3 and 4 to produce final output.
64
65 You should note that you can query intermidiate data in step 4 also, not
66 just data produced in step 2.
67
68 Output filter use Template Toolkit 2, so you have full power of simple
69 procedural language (loops, conditions) and handy built-in functions to
70 produce output.
71
72 =head2 REST Query Format
73
74 Design decision is to use REST query format. This has benefit of simplicity
75 and ability to create unique URLs to all content within webpac. Simple query
76 format is:
77
78 http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995
79
80 This REST query can be broken down to:
81
82 =over
83
84 =item http://webpac
85
86 Hostname on which service is running. Not required if doing lookups, just
87 for browser usage.
88
89 =item search
90
91 Name of output filtering methods. This will specify search engine.
92
93 =item html
94
95 Specified template that will be used to produce output.
96
97 =item perlsonal_name/Joe%20Doe...
98
99 URL encoded query string. It is specific to filtering method used.
100
101 =back
102
103 You can easily produce RSS feed for same query using follwing REST url:
104
105 http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995
106
107 Yes, it really is that simple. As it should be.
108
109 =head1 Tehnical stuff
110
111 Following text will be more hard-code tehnical stuff about how is webpac
112 implemented and why.
113
114 =head2 Search Engine
115
116 We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings
117 for it.
118
119 It should be relativly easy to plugin another one if need arise.
120
121 =head2 Data Warehouse
122
123 In a nutshell, webpac has evolved to support hybrid data as input. That
124 means it has become kind of data-warehouse application. It doesn't support
125 directly roll-up and roll-down operations, but they can be emulated using
126 intermidiate data step or output step.
127

  ViewVC Help
Powered by ViewVC 1.1.26