Dataframe-ify the output of Anserini-Spark #13

mayankanand007 · 2021-05-19T16:36:06Z

PySpark: The conversion from PySpark RDD to DataFrame was successful and confirmed that all DF operations are working successfully.
Scala Spark: There are some issues converting docs2 which is of the type org.apache.spark.api.java.JavaRDD[java.util.HashMap[String,String]] to a DataFrame in Scala as conventional options such as toDF() or spark.createDataFrame do not support arguments of this particular data type. So still need to figure out how to do the conversion if we want to read the entire document as a HashMap. There is also no straight method that works with a Scala RDD.

@lintool do let me know if this is what you were expecting, I can make changes accordingly

mayankanand007 added 30 commits May 5, 2021 19:35

Added clarification for using indexPath

42fbbc3

updated versions for spark and scala

c82e4b3

Fixed typo in updated documentation

9226dc8

Update README.md

b01eb95

updated right anserini release version

00169fe

Update pom.xml

c2f9588

Update pom.xml

ba62d80

updated scala-maven-plugin version

7c5eff6

Update pom.xml

504158d

Update pom.xml

0da6578

Update pom.xml

0bf7fe2

Update pom.xml

2fe03c6

Update pom.xml

d111173

updated java 8 -> java 11

2959280

Update pom.xml

083e13b

Update pom.xml

61971ce

Update pom.xml

eeae93d

Update pom.xml

72b2996

Update pom.xml

e6fc66b

Update pom.xml

7abed1c

Update pom.xml

919e804

Update pom.xml

4cbeff8

Making the class abstract to be able to extend

5cfddff

Update HdfsReadOnlyDirectory.java

1b2ee71

Update pom.xml

e6d28c2

Update pom.xml

344eab6

Update pom.xml

bfc67e4

Update pom.xml

a313778

Update pom.xml

53c0dd2

Update HdfsReadOnlyDirectory.java

2cac257

mayankanand007 added 11 commits May 7, 2021 18:16

Update HdfsReadOnlyDirectory.java

fddd79b

Update HdfsReadOnlyDirectory.java

03d2a9f

Update HdfsReadOnlyDirectory.java

809b4c5

Update HdfsReadOnlyDirectory.java

0eb0a02

Update HdfsReadOnlyDirectory.java

a75d367

Update pom.xml

9455a7e

Update pom.xml

1efd522

updated java 8 => java 11 in docs

54d7dfb

Added documentation to deal with build issues

3efac64

Merge branch 'castorini:master' into master

ed6bc06

Dataframe-ified the PySpark output

3ffb290

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe-ify the output of Anserini-Spark #13

Dataframe-ify the output of Anserini-Spark #13

mayankanand007 commented May 19, 2021

Dataframe-ify the output of Anserini-Spark #13

Are you sure you want to change the base?

Dataframe-ify the output of Anserini-Spark #13

Conversation

mayankanand007 commented May 19, 2021