Skip to content

Commit

Permalink
Fixed merge conflict in PassimApp.scala
Browse files Browse the repository at this point in the history
  • Loading branch information
mutherr committed Mar 16, 2021
2 parents 514bdef + bc603a0 commit adbecc7
Show file tree
Hide file tree
Showing 6 changed files with 1,155 additions and 268 deletions.
64 changes: 53 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ of files containing JSON records. The record for a single document
with the required `id` and `text` fields, as well as a `series` field,
would look like:
```
{"id": "d1", "series": "abc", "text": "This is text that&apos;s interpreted as <code>XML</code>; the tags are ignored by default."}
{"id": "d1", "series": "abc", "text": "This is text."}
```

Note that this is must be a single line in the file. This JSON record
Expand All @@ -75,10 +75,10 @@ included in the record for each document will be passed through into
the output. In particular, a `date` field, if present, will be used
to sort passages within each cluster.

Natural language text is redundant, and adding XML markup and JSON
Natural language text is redundant, and adding markup and JSON
field names increases the redundancy. Spark and passim support
several compression schemes. For relatively small files, gzip is
adequate; however, when the input files are large enough that the do
adequate; however, when the input files are large enough that they do
not comfortably fit in memory, bzip2 is preferable since programs can
split it into blocks before decompressing.

Expand All @@ -97,12 +97,12 @@ Multiple input paths should be separated by commas. Files may also be
compressed.

```
$ passim input.json,directory-of-json-files,some*.json.bz2 output
$ passim "{input.json,directory-of-json-files,some*.json.bz2}" output
```

Output is to a directory that, on completion, will contain an
Output is written to a directory that, on completion, will contain an
`out.json` directory with `part-*` files rather than a single file.
This allows multiple workers to efficiently write it (and read it back
This allows multiple workers to write it efficiently (and read it back
in) in parallel. In addition, the output directory should contain the
parameters used to invoke passim in `conf` and the intermediate
cluster membership data in `clusters.parquet`.
Expand Down Expand Up @@ -132,7 +132,7 @@ alignments between all matching passages, invoke passim with the
`--pairwise` flag. These alignments will be in the `align.json` or
`align.parquet`, depending on which output format you choose.

Some useful parameters are:
Some other useful parameters are:

Parameter | Default value | Description
--------- | ------------- | -----------
Expand All @@ -143,7 +143,7 @@ Parameter | Default value | Description

Pass parameters to the underlying Spark processes using the
`SPARK_SUBMIT_ARGS` environment variable. For example, to run passim
on a local machine with 10 cores and 200GB of memory, do:
on a local machine with 10 cores and 200GB of memory, run the following command:

```
$ SPARK_SUBMIT_ARGS='--master local[10] --driver-memory 200G --executor-memory 200G' passim input.json output
Expand All @@ -153,10 +153,52 @@ See the
[Spark documentation](https://spark.apache.org/docs/latest/index.html)
for further configuration options.

## <a name="locations"></a> Marking Locations inside Documents
### Pruning Alignments

The documents input to passim are indexed to determine which pairs should be aligned. Often, document metadata can provide a priori constraints on which documents should be aligned. If there were no constraints, every pair of documents in the input would be aligned, in both directions. By default, however, documents with the same `series` value will not be aliged. These constraints on alignments are expressed by two arguments to passim: `--fields` and `--filterpairs`.

The `--fields` argument tells passim which fields in the input records to index when determining which documents to align. Fields has the syntax of SQL `FROM` clause as implemented by Apache Spark, with the exception that multiple fields are separated by semicolons. By default, the value of the fields argument is:
```
--fields 'hashId(id) as uid;hashId(series) as gid'
```

Since document and series identifiers can be long strings, passim runs more efficiently if they are hashed to long integers by the (builtin) `hashId` function.

The `--filterpairs` argument is an SQL expression that specifies which pairs of documents are candidates for comparison. A candidate pair consists of a "left-hand" document, whose field names are identical to those in the input, and a "right-hand" document, whose field names have a "2" appended to them. This is similar to the left- and right-handed sides of a SQL JOIN; in fact, passim is effectively performing a (massively pruned) self-join on the table of input documents. The default value for the filterpairs argument is:
```
--filterpairs 'gid < gid2'
```
This ensures that documents from the same series are not aligned and, further, ensures that any given pair of documents is aligned in only one direction, as determined by the lexicographic ordering of the hashes of their series IDs.

As an example, consider aligning only document pairs where the "left-hand" document predates the "right-hand" document by 0 to 30 days. To perform efficient date arithmetic, we use Apache Spark's built-in `date` function to convert a string `date` field to an integer:
```
--fields 'date(date) as day' --filterpairs 'day <= day2 AND (day2 - day) <= 30 AND uid <> uid2'
```
Since the dates may be equal, we also include the constraint that the hashed document ids (`uid`) be different. Had we not done this, the output would also have included alignments of every document with itself. The `uid` field as a hash of the `id` field is always available. Note also the SQL inequality operator `<>`.

### Producing Aligned Output

For the purposes of collating related texts or aligning different transcriptions of the same text, one can produce output using the `--docwise` or `--linewise` flags. Each output record in `out.json` or `out.parquet` will then contain a document or line from the right-hand, "target" text, along with information about corresponding passages in the left-hand, "witness" or "source" text.

In the case of `--docwise` output. each output document contains an array of target line records, and each line record contains zero or more passages from the left-hand side that are aligned to that target line. Note that we specify "passages" from the left-hand side because the line breaks in the witness, if any, may not correspond to line breaks in the target texts. Both the target line and witness passage data may contain information about image coordinates, when available.

As mentioned above, the `text` field is interpreted as XML. The
parser expands character entities and ignores tags.
For `--linewise` output, each output document contains a single newline-delimited line from a target document. This line is identified by the input document `id` and a character offset `begin` into the input text. The corresponding witness passage is identified with `wid` and `wbegin`.

In both of these output variants, target lines and witness passages are presented in their original textual form and in their aligned form, with hyphens to pad insertions and deletions. An example of a target line aligned to a single witness passage is:
```
{
"id": "scheffel_ekkehard_1855#f0261z751",
"begin": 721,
"text": "Grammatik iſt ein hohes Weib, anders erſcheint ſie Holzhackern, an-\n",
"wid": "scheffel_ekkehard_1855/0261",
"wbegin": 739,
"wtext": "Grammatik iſt ein hohes Weib, anders erſcheint ſie HolzhaFern , an=\n",
"talg": "Grammatik iſt ein hohes Weib, anders erſcheint ſie Holzhackern-, an‐\n",
"walg": "Grammatik iſt ein hohes Weib, anders erſcheint ſie Holzha-Fern , an=\n"
}
```

## <a name="locations"></a> Marking Locations inside Documents

Documents may document their extent on physical pages with the `pages` field. This field is an array of `Page` regions with the following schema (here written in Scala):
```
Expand Down
4 changes: 2 additions & 2 deletions bin/passim
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,6 @@ PASSIM_HOME="$(cd "`dirname "$0"`"/..; pwd)"
SPARK_SUBMIT_ARGS="$SPARK_SUBMIT_ARGS"

spark-submit --class passim.PassimApp \
--packages 'com.github.scopt:scopt_2.11:3.5.0,graphframes:graphframes:0.7.0-spark2.4-s_2.11' \
--packages 'com.github.scopt:scopt_2.12:3.5.0,graphframes:graphframes:0.8.0-spark3.0-s_2.12' \
$SPARK_SUBMIT_ARGS \
"$PASSIM_HOME"/target/scala-2.11/passim_2.11-0.2.0.jar "$@"
"$PASSIM_HOME"/target/scala-2.12/passim_2.12-0.2.0.jar "$@"
10 changes: 5 additions & 5 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@ name := "passim"

version := "0.2.0"

scalaVersion := "2.11.8"
scalaVersion := "2.12.10"

resolvers += Resolver.mavenLocal

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0"
libraryDependencies += "org.apache.spark" %% "spark-graphx" % "2.4.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.0"
libraryDependencies += "org.apache.spark" %% "spark-graphx" % "3.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.0"

resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"

libraryDependencies += "graphframes" % "graphframes" % "0.7.0-spark2.4-s_2.11"
libraryDependencies += "graphframes" % "graphframes" % "0.8.0-spark3.0-s_2.12"

libraryDependencies += "com.github.scopt" %% "scopt" % "3.5.0"

Expand Down
2 changes: 1 addition & 1 deletion project/build.properties
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
sbt.version=0.13.18
sbt.version=1.3.13
Loading

0 comments on commit adbecc7

Please sign in to comment.