-
Notifications
You must be signed in to change notification settings - Fork 1
Matching
The following points are to be performed for each algorithm.
-
Create temporary bundles in
TEMP_TABLE
:job_match.sh RECORD_TABLE MATCH_ALG TEMP_TABLE
-
Create redirects for equivalent bundles in
TEMP_TABLE
:job_union.sh TEMP_TABLE bundleEqu.xml TEMP_TABLE
-
Move to final
BUNDLE_TABLE
:job_morphHTable.sh TEMP_TABLE matching/bundleMove.xml table BUNDLE_TABLE
-
Calculate statistics:
4.1. Overlap statistics:
job_matchOverlap.sh BUNDLE_TABLE
. Results will be inout/BUNDLE_TABLE/overlap/
4.2. Bundle sizes, redirects, etc:
job_countInHTable.sh BUNDLE_TABLE bundles.xml
. Results will be inout/BUNDLE_TABLE/bundles.xml/
The matching algorithm is defined by the match keys generated from a record. An algorithm is thus just a Metamorph script.
There are several scripts to support the development of a new algorithm:
-
You may want to test the Metamorph script on selected records before starting the matching machinery. See the folder src/main/flows/matching:
match-test-hbase.flow
andmatch-test.flow
offer a simple way to check whether your Metamorph script generates the intended match keys. -
Once you have run your new algorithm, you may want to draw a uniform sample from the bundles to perform quality checks.
bundles-to-html.flow
renders a uniform sample as a stand-alone HTML file.