This repository provides examples demonstrating how to use Oracle Cloud Infrastructure Data Flow, a service that lets you run any Apache Spark Application at any scale with no infrastructure to deploy or manage.
Oracle Cloud Infrastructure (OCI) Data Flow is a cloud-based serverless platform with a rich user interface. It allows Spark developers and data scientists to create, edit, and run Spark jobs at any scale without the need for clusters, an operations team, or highly specialized Spark knowledge. Being serverless means there is no infrastructure for you to deploy or manage. It is entirely driven by REST APIs, giving you easy integration with applications or workflows. You can:
-
Connect to Apache Spark data sources.
-
Create reusable Apache Spark applications.
-
Launch Apache Spark jobs in seconds.
-
Manage all Apache Spark applications from a single platform.
-
Process data in the Cloud or on-premises in your data center.
-
Create Big Data building blocks that you can easily assemble into advanced Big Data applications.
You must have Set Up Your Tenancy and be able to Access Data Flow
- Setup Tenancy : Before Data Flow can run, you must grant permissions that allow effective log capture and run management.See the Set Up Administration section of Data Flow Service Guide, and follow the instructions given there.
- Access Data Flow : Refer to this section on how to Access Data Flow
Example | Description | Python | Java | Scala |
---|---|---|---|---|
CSV to Parquet | This application shows how to use PySpark to convert CSV data store in OCI Object Store to Apache Parquet format which is then written back to Object Store. | CSV to Parquet | CSV to Parquet | CSV to Parquet |
Load to ADW | This application shows how to read a file from OCI Object Store, perform some transformation and write the results to an Autonomous Data Warehouse instance. | Load to ADW | Load to ADW | Load to ADW |
Structured Streaming Kafka Word Count | This Structured Streaming application shows how to read Kafka stream and calculate word frequencies over one minute window interval | Structured Kafka Word Count | Structured Kafka Word Count | |
Random Forest Regression | This application shows how to build a model and make prediction using Random Forest Regression. | Random Forest Regression | ||
Oracle NoSQL Database cloud service | This application shows how to interface with Oracle NoSQL Database cloud service. | Oracle NoSQL Database cloud service |
For step-by-step instructions, see the README files included with each sample.
These samples show how to use the OCI Data Flow service and are meant to be deployed to and run from Oracle Cloud. You can optionally test these applications locally before you deploy them. When they are ready, you can deploy them to Data Flow without any need to reconfigure them, make code changes, or apply deployment profiles.To test these applications locally, Apache Spark needs to be installed. Refer to section on how to set the Prerequisites before you deploy the application locally Setup locally.
Set up MLFlow Tracking Server: Refer to this section dataflow-mlflow-integration
To install Spark, visit spark.apache.org and pick the installation path that best suits your environment.
You can find the online documentation for Oracle Cloud Infrastructure Data Flow at docs.oracle.com.
- Open a GitHub issue for bug reports, questions, or requests for enhancements.
- Post your question on the OCI Data flow Community.
Please consult the security guide for our responsible security vulnerability disclosure process.
This project welcomes contributions from the community. Before submitting a pull request, please review our contribution guide.
See LICENSE