-
Notifications
You must be signed in to change notification settings - Fork 31
Home
Welcome to the hadoopoffice wiki!
hadoopoffice is a library for processing and writing Office documents, such as MS Excel Spreadsheet, on Hadoop and ecosystem components (e.g. Spark/Hive). The data in the documents can be combined with any data you have on Hadoop.
It contains the following components:
-
Hadoop File Format to enable any MapReduce/Tez/Spark application to read rows containing data from Office documents from files in HDFS (or any other supported filesystem). Additionally, it supports writing of Office documents to HDFS (or any other supported file system). This format supports the original mapreduce api (mapred.*) and the alternative mapreduce api (mapreduce.*)
-
Hive Serde to use Hive and SQL queries to read/write Office documents from files in HDFS (or any other supported filesystem)
-
Flink DataSource/DataSink to use Apache Flink to read/write Office documents (recommended if you want to have more control, e.g. defining formulas in spreadsheets, comments etc.)
-
Spark Datasource to use (read/write) the HadoopOffice library via the Spark DataSource API
-
DEPRECATED Flink TableSource/TableSink to use the Table API of Apache Flink to read/write Office documents using SQL
Supported formats:
- MS Excel (*.xls, *.xlsx) based on parsers and writers of Apache POI
How-To Guides:
- MapReduce: Convert the rows in an Excel document to CSV
- MapReduce: Convert rows in CSV to an Excel document
- Spark: Read Excel document using Spark 1.x
- Spark: Write Excel document using Spark 1.x
- Spark2 Datasource: Read an Excel document using the Spark2 datasource API
- Spark2 Datasource: Write an Excel document using the Spark2 datasource API
- Hive: Query and Write Excel documents using SQL in Apache Hive
- Flink: Using Apache Flink to read/write Excel documents
- Working with templates to include advanced Excel features, such as diagrams
- Saving CPU/memory resources with low footprint mode
- Encrypting your files for improved security
- Digital Signature for files to enable non-repudiation.
- Improve performance for processing a lot of small Office files with Hadoop Archives (HAR)
- Convert spreadsheet cells to simple datatypes and vice versa (e.g. convert Excel cells to data as String, byte, short, BigDecimal and many more)
Find here the status from the continuous integration (CI) platform:
Find here the status from the static code analyzer platform:
- Sonarqube: https://sonarcloud.io/dashboard?id=ZuInnoTe%3Ahadoopoffice
- Codacy (includes also Scala): https://www.codacy.com/app/jornfranke/hadoopoffice
Find here the OpenHub report.
Join us on Gitter.im