Skip to content
This repository has been archived by the owner on Dec 19, 2018. It is now read-only.

Matching

Markus M. Geipel edited this page Jul 29, 2013 · 8 revisions

Running an existing algorithm

The following points are to be performed for each algorithm.

  1. Create temporary bundles in TEMP_TABLE: job_match.sh RECORD_TABLE MATCH_ALG TEMP_TABLE

  2. Create redirects for equivalent bundles in TEMP_TABLE: job_union.sh TEMP_TABLE bundleEqu.xml TEMP_TABLE

  3. Move to final BUNDLE_TABLE: job_morphHTable.sh TEMP_TABLE matching/bundleMove.xml table BUNDLE_TABLE

  4. Calculate statistics:

    4.1. Overlap statistics: job_matchOverlap.sh BUNDLE_TABLE. Results will be in out/BUNDLE_TABLE/overlap/

    4.2. Bundle sizes, redirects, etc: job_countInHTable.sh BUNDLE_TABLE bundles.xml. Results will be in out/BUNDLE_TABLE/bundles.xml/

Developing a new Algorithm

The matching algorithm is defined by the match keys generated from a record. An algorithm is thus just a Metamorph script.

There are several scripts to support the development of a new algorithm:

  1. You may want to test the Metamorph script on selected records before starting the matching machinery. See the folder src/main/flows/matching: match-test-hbase.flow and match-test.flow offer a simple way to check whether your Metamorph script generates the intended match keys.

  2. Once you have run your new algorithm, you may want to draw a uniform sample from the bundles to perform quality checks. bundles-to-html.flow renders a uniform sample as a stand-alone HTML file.

Clone this wiki locally