Fixed merge conflict in PassimApp.scala

dasmiq · Mar 16, 2021 · adbecc7 · adbecc7
2 parents 514bdef + bc603a0
commit adbecc7
Show file tree

Hide file tree

Showing 6 changed files with 1,155 additions and 268 deletions.
diff --git a/README.md b/README.md
@@ -58,7 +58,7 @@ of files containing JSON records.  The record for a single document
 with the required `id` and `text` fields, as well as a `series` field,
 would look like:
 ```
-{"id": "d1", "series": "abc", "text": "This is text that&apos;s interpreted as <code>XML</code>; the tags are ignored by default."}
+{"id": "d1", "series": "abc", "text": "This is text."}
 ```
 
 Note that this is must be a single line in the file.  This JSON record
@@ -75,10 +75,10 @@ included in the record for each document will be passed through into
 the output.  In particular, a `date` field, if present, will be used
 to sort passages within each cluster.
 
-Natural language text is redundant, and adding XML markup and JSON
+Natural language text is redundant, and adding markup and JSON
 field names increases the redundancy.  Spark and passim support
 several compression schemes.  For relatively small files, gzip is
-adequate; however, when the input files are large enough that the do
+adequate; however, when the input files are large enough that they do
 not comfortably fit in memory, bzip2 is preferable since programs can
 split it into blocks before decompressing.
 
@@ -97,12 +97,12 @@ Multiple input paths should be separated by commas.  Files may also be
 compressed.
 
 ```
-$ passim input.json,directory-of-json-files,some*.json.bz2 output
+$ passim "{input.json,directory-of-json-files,some*.json.bz2}" output
 ```
 
-Output is to a directory that, on completion, will contain an
+Output is written to a directory that, on completion, will contain an
 `out.json` directory with `part-*` files rather than a single file.
-This allows multiple workers to efficiently write it (and read it back
+This allows multiple workers to write it efficiently (and read it back
 in) in parallel.  In addition, the output directory should contain the
 parameters used to invoke passim in `conf` and the intermediate
 cluster membership data in `clusters.parquet`.
@@ -132,7 +132,7 @@ alignments between all matching passages, invoke passim with the
 `--pairwise` flag.  These alignments will be in the `align.json` or
 `align.parquet`, depending on which output format you choose.
 
-Some useful parameters are:
+Some other useful parameters are:
 
 Parameter | Default value | Description
 --------- | ------------- | -----------
@@ -143,7 +143,7 @@ Parameter | Default value | Description
 
 Pass parameters to the underlying Spark processes using the
 `SPARK_SUBMIT_ARGS` environment variable.  For example, to run passim
-on a local machine with 10 cores and 200GB of memory, do:
+on a local machine with 10 cores and 200GB of memory, run the following command:
 
 ```
 $ SPARK_SUBMIT_ARGS='--master local[10] --driver-memory 200G --executor-memory 200G' passim input.json output
@@ -153,10 +153,52 @@ See the
 [Spark documentation](https://spark.apache.org/docs/latest/index.html)
 for further configuration options.
 
-## <a name="locations"></a> Marking Locations inside Documents
+### Pruning Alignments
+
+The documents input to passim are indexed to determine which pairs should be aligned.  Often, document metadata can provide a priori constraints on which documents should be aligned.  If there were no constraints, every pair of documents in the input would be aligned, in both directions.  By default, however, documents with the same `series` value will not be aliged.  These constraints on alignments are expressed by two arguments to passim: `--fields` and `--filterpairs`.
+
+The `--fields` argument tells passim which fields in the input records to index when determining which documents to align.  Fields has the syntax of SQL `FROM` clause as implemented by Apache Spark, with the exception that multiple fields are separated by semicolons.  By default, the value of the fields argument is:
+```
+--fields 'hashId(id) as uid;hashId(series) as gid'
+```
+
+Since document and series identifiers can be long strings, passim runs more efficiently if they are hashed to long integers by the (builtin) `hashId` function.
+
+The `--filterpairs` argument is an SQL expression that specifies which pairs of documents are candidates for comparison.  A candidate pair consists of a "left-hand" document, whose field names are identical to those in the input, and a "right-hand" document, whose field names have a "2" appended to them.  This is similar to the left- and right-handed sides of a SQL JOIN; in fact, passim is effectively performing a (massively pruned) self-join on the table of input documents.  The default value for the filterpairs argument is:
+```
+--filterpairs 'gid < gid2'
+```
+This ensures that documents from the same series are not aligned and, further, ensures that any given pair of documents is aligned in only one direction, as determined by the lexicographic ordering of the hashes of their series IDs.
+
+As an example, consider aligning only document pairs where the "left-hand" document predates the "right-hand" document by 0 to 30 days.  To perform efficient date arithmetic, we use Apache Spark's built-in `date` function to convert a string `date` field to an integer:
+```
+--fields 'date(date) as day' --filterpairs 'day <= day2 AND (day2 - day) <= 30 AND uid <> uid2'
+```
+Since the dates may be equal, we also include the constraint that the hashed document ids (`uid`) be different.  Had we not done this, the output would also have included alignments of every document with itself.  The `uid` field as a hash of the `id` field is always available.  Note also the SQL inequality operator `<>`.
+
+### Producing Aligned Output
+
+For the purposes of collating related texts or aligning different transcriptions of the same text, one can produce output using the `--docwise` or `--linewise` flags.  Each output record in `out.json` or `out.parquet` will then contain a document or line from the right-hand, "target" text, along with information about corresponding passages in the left-hand, "witness" or "source" text.
+
+In the case of `--docwise` output. each output document contains an array of target line records, and each line record contains zero or more passages from the left-hand side that are aligned to that target line.  Note that we specify "passages" from the left-hand side because the line breaks in the witness, if any, may not correspond to line breaks in the target texts.  Both the target line and witness passage data may contain information about image coordinates, when available.
 
-As mentioned above, the `text` field is interpreted as XML.  The
-parser expands character entities and ignores tags.
+For `--linewise` output, each output document contains a single newline-delimited line from a target document.  This line is identified by the input document `id` and a character offset `begin` into the input text.  The corresponding witness passage is identified with `wid` and `wbegin`.
+
+In both of these output variants, target lines and witness passages are presented in their original textual form and in their aligned form, with hyphens to pad insertions and deletions.  An example of a target line aligned to a single witness passage is:
+```
+{
+  "id": "scheffel_ekkehard_1855#f0261z751",
+  "begin": 721,
+  "text": "Grammatik iſt ein hohes Weib, anders erſcheint ſie Holzhackern, an-\n",
+  "wid": "scheffel_ekkehard_1855/0261",
+  "wbegin": 739,
+  "wtext": "Grammatik iſt ein hohes Weib, anders erſcheint ſie HolzhaFern , an=\n",
+  "talg": "Grammatik iſt ein hohes Weib, anders erſcheint ſie Holzhackern-, an‐\n",
+  "walg": "Grammatik iſt ein hohes Weib, anders erſcheint ſie Holzha-Fern , an=\n"
+}
+```
+
+## <a name="locations"></a> Marking Locations inside Documents
 
 Documents may document their extent on physical pages with the `pages` field.  This field is an array of `Page` regions with the following schema (here written in Scala):
 ```

diff --git a/bin/passim b/bin/passim
@@ -5,6 +5,6 @@ PASSIM_HOME="$(cd "`dirname "$0"`"/..; pwd)"
 SPARK_SUBMIT_ARGS="$SPARK_SUBMIT_ARGS"
 
 spark-submit --class passim.PassimApp \
-	     --packages 'com.github.scopt:scopt_2.11:3.5.0,graphframes:graphframes:0.7.0-spark2.4-s_2.11' \
+	     --packages 'com.github.scopt:scopt_2.12:3.5.0,graphframes:graphframes:0.8.0-spark3.0-s_2.12' \
 	     $SPARK_SUBMIT_ARGS \
-	     "$PASSIM_HOME"/target/scala-2.11/passim_2.11-0.2.0.jar "$@"
+	     "$PASSIM_HOME"/target/scala-2.12/passim_2.12-0.2.0.jar "$@"
diff --git a/build.sbt b/build.sbt
@@ -2,17 +2,17 @@ name := "passim"
 
 version := "0.2.0"
 
-scalaVersion := "2.11.8"
+scalaVersion := "2.12.10"
 
 resolvers += Resolver.mavenLocal
 
-libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0"
-libraryDependencies += "org.apache.spark" %% "spark-graphx" % "2.4.0"
-libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"
+libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.0"
+libraryDependencies += "org.apache.spark" %% "spark-graphx" % "3.0.0"
+libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.0"
 
 resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
 
-libraryDependencies += "graphframes" % "graphframes" % "0.7.0-spark2.4-s_2.11"
+libraryDependencies += "graphframes" % "graphframes" % "0.8.0-spark3.0-s_2.12"
 
 libraryDependencies += "com.github.scopt" %% "scopt" % "3.5.0"
 

diff --git a/project/build.properties b/project/build.properties
@@ -14,4 +14,4 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-sbt.version=0.13.18
+sbt.version=1.3.13