-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Flink-Hive use case in Jupyter notebook (#91) #98
base: main
Are you sure you want to change the base?
Changes from all commits
440ffd2
df2fc17
a6f4ac4
2e96e61
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,3 @@ | ||
**/.idea | ||
**/.DS_Store | ||
**/packages/** | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -145,6 +145,12 @@ services: | |
volumes: | ||
- ./init/jupyter:/tmp/gravitino | ||
entrypoint: /bin/bash /tmp/gravitino/init.sh | ||
environment: | ||
- HADOOP_CLASSPATH=/tmp/gravitino/packages/hadoop-2.7.3/etc/hadoop:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/*:/tmp/gravitino/packages/contrib/capacity-scheduler/*.jar | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it necessary to add yarn, mapreduce jar to the CLASSPATH? seems only need HDFS |
||
- NB_USER=my-username | ||
- GRANT_SUDO=yes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why adding belew environment and use root?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The GRANT_SUDO=yes and CHOWN_HOME=yes settings were added to allow users to install JDK and PyFlink directly within the Jupyter notebook environment |
||
- CHOWN_HOME=yes | ||
user: root | ||
depends_on: | ||
hive : | ||
condition: service_healthy | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,225 @@ | ||
{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This case looks good to me. Maybe we can add more operations like alter table,drop table later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tested the operations, and Py4JJavaError: An error occurred while calling o31.executeSql. I find it quite strange because other commands work fine. I’m wondering if the issue could be related to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @TungYuChiang Could you replace the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @TungYuChiang I create an issue apache/gravitino#5534. Could you submit a patch to fix it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @coolderli Thank you for your guidance! |
||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "27e1e593-bf96-4977-a272-c6884d46b9e3", | ||
"metadata": {}, | ||
"source": [ | ||
"# Gravitino Flink-Hive Example" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e604cee4-43d6-472c-8227-75e339b92c9f", | ||
"metadata": {}, | ||
"source": [ | ||
"## Setting Up PyFlink with Hive and Gravitino Connectors" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "b13b0b0b-6aca-4cbb-8771-a10f4c79a017", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!sudo apt-get update && sudo apt-get install -y openjdk-17-jdk" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "6ef94f47-5718-4c35-82ce-90bd2c00927a", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"\n", | ||
"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-17-openjdk-arm64\"\n", | ||
"os.environ[\"PATH\"] = f\"{os.environ['JAVA_HOME']}/bin:\" + os.environ[\"PATH\"]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "a3c975dc-afa1-4057-9990-6d4b8c06749b", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!python3 -m pip install apache-flink" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "f0cf8f3e-14f9-4209-8103-a3a0c598a21a", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from pyflink.table import EnvironmentSettings, TableEnvironment\n", | ||
"from pyflink.common import Configuration\n", | ||
"from pyflink.table.expressions import col\n", | ||
"from pyflink.table import DataTypes\n", | ||
"\n", | ||
"configuration = Configuration()\n", | ||
" \n", | ||
"configuration.set_string(\n", | ||
" \"pipeline.jars\",\n", | ||
" \"file:///tmp/gravitino/packages/gravitino-flink-connector-runtime-1.18_2.12-0.6.1-incubating.jar;\"\n", | ||
" \"file:///tmp/gravitino/packages/flink-sql-connector-hive-2.3.10_2.12-1.20.0.jar\"\n", | ||
" )\n", | ||
"configuration.set_string(\"table.catalog-store.kind\", \"gravitino\")\n", | ||
"configuration.set_string(\"table.catalog-store.gravitino.gravitino.uri\", \"http://gravitino:8090\")\n", | ||
"configuration.set_string(\"table.catalog-store.gravitino.gravitino.metalake\", \"metalake_demo\")\n", | ||
"\n", | ||
"env_settings = EnvironmentSettings.new_instance().with_configuration(configuration)\n", | ||
"table_env = TableEnvironment.create(env_settings.in_batch_mode().build())\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d2f87a66-6bd7-4d54-8912-7c03e0661b0f", | ||
"metadata": {}, | ||
"source": [ | ||
"## Write Queries " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "f1037708-56a3-4b7a-80a1-b1015e928a03", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"table_env.use_catalog(\"catalog_hive\")\n", | ||
"table_env.execute_sql(\"CREATE DATABASE IF NOT EXISTS Reading_System\")\n", | ||
"table_env.execute_sql(\"USE Reading_System\")\n", | ||
"table_env.execute_sql(\"\"\"\n", | ||
" CREATE TABLE IF NOT EXISTS books (\n", | ||
" id INT,\n", | ||
" title STRING,\n", | ||
" author STRING,\n", | ||
" publish_date STRING\n", | ||
" ) \n", | ||
"\"\"\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "0996b4d2-35dc-456c-9b08-52b8beb8fe86", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"result = table_env.execute_sql(\"SHOW DATABASES\")\n", | ||
"with result.collect() as results:\n", | ||
" for row in results:\n", | ||
" print(row)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "b860f1dc-cb63-46a4-aff1-bbbe8d47fee8", | ||
"metadata": {}, | ||
"source": [ | ||
"### Write Table API Queries" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "69eea5be-73c9-489a-b294-74bdea0f6bf7", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"new_books = table_env.from_elements(\n", | ||
" [\n", | ||
" (4, 'The Great Gatsby', 'F. Scott Fitzgerald', '1925-04-10'),\n", | ||
" (5, 'Moby Dick', 'Herman Melville', '1851-11-14')\n", | ||
" ],\n", | ||
" DataTypes.ROW([\n", | ||
" DataTypes.FIELD(\"id\", DataTypes.INT()),\n", | ||
" DataTypes.FIELD(\"title\", DataTypes.STRING()),\n", | ||
" DataTypes.FIELD(\"author\", DataTypes.STRING()),\n", | ||
" DataTypes.FIELD(\"publish_date\", DataTypes.STRING())\n", | ||
" ])\n", | ||
")\n", | ||
"\n", | ||
"\n", | ||
"new_books.execute_insert('books').wait()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "46c223b3-a08b-4ffc-b6b0-1e73451036d6", | ||
"metadata": {}, | ||
"source": [ | ||
"### Write SQL Queries" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "228d0ca3-8ad2-4b53-ae99-c430713aeb02", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"table_env.execute_sql(\"\"\"\n", | ||
" INSERT INTO books VALUES \n", | ||
" (6, 'Pride and Prejudice', 'Jane Austen', '1813-01-28'),\n", | ||
" (7, 'The Catcher in the Rye', 'J.D. Salinger', '1951-07-16')\n", | ||
"\"\"\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d27ed14a-4b77-42c6-8c7f-b3e75cb92519", | ||
"metadata": {}, | ||
"source": [ | ||
"### Result" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "5232e493-a699-4d50-b489-4de4652bf344", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"result = table_env.execute_sql(\"SELECT * FROM books\")\n", | ||
"with result.collect() as results:\n", | ||
" for row in results:\n", | ||
" print(row)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2f90cdec-a989-4e3f-b7cb-262a82e88e4f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,3 +33,26 @@ fi | |
ls "${jupyter_dir}/packages/" | xargs -I {} rm "${jupyter_dir}/packages/"{} | ||
find "${jupyter_dir}/../spark/packages/" | grep jar | xargs -I {} ln {} "${jupyter_dir}/packages/" | ||
|
||
FLINK_HIVE_CONNECTOR_JAR="https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-hive-2.3.10_2.12/1.20.0/flink-sql-connector-hive-2.3.10_2.12-1.20.0.jar" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @FANNG1 Should we package the flink-sql-connector to the gravitino-flink-runtime? The kyuubi-spark-connector-hive is packaged to the spark-runtime. Do we need to maintain consistency? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've removed the redundant Gravitino-Flink package as discussed. Thanks for the suggestion! |
||
FLINK_HIVE_CONNECTOR_MD5="${FLINK_HIVE_CONNECTOR_JAR}.md5" | ||
download_and_verify "${FLINK_HIVE_CONNECTOR_JAR}" "${FLINK_HIVE_CONNECTOR_MD5}" "${jupyter_dir}" | ||
|
||
GRAVITINO_FLINK_CONNECTOR_RUNTIME_JAR="https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-flink-connector-runtime-1.18_2.12/0.6.1-incubating/gravitino-flink-connector-runtime-1.18_2.12-0.6.1-incubating.jar" | ||
GRAVITINO_FLINK_CONNECTOR_RUNTIME_MD5="${GRAVITINO_FLINK_CONNECTOR_RUNTIME_JAR}.md5" | ||
download_and_verify "${GRAVITINO_FLINK_CONNECTOR_RUNTIME_JAR}" "${GRAVITINO_FLINK_CONNECTOR_RUNTIME_MD5}" "${jupyter_dir}" | ||
|
||
|
||
HADOOP_VERSION="2.7.3" | ||
HADOOP_URL="https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it may take too much time to download hadoop in low network environment, only need HDFS client here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the feedback. I will work on optimizing the dependencies by using only the HDFS client instead of the full Hadoop installation There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @FANNG1 , @coolderli I was wondering if it's possible to download only a subset of Hadoop components instead of the full package. Would it be possible to provide some guidance or assistance on how to achieve this? Any help would be greatly appreciated! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems there is no link to download hadoop client bundle jar which contains the dependencies jars. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @xunliu WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @TungYuChiang |
||
echo "Downloading Hadoop ${HADOOP_VERSION}..." | ||
|
||
curl -fLo "${jupyter_dir}/packages/hadoop-${HADOOP_VERSION}.tar.gz" "$HADOOP_URL" || { echo "Failed to download Hadoop ${HADOOP_VERSION}"; exit 1; } | ||
echo "Extracting Hadoop ${HADOOP_VERSION}..." | ||
|
||
tar -xzf "${jupyter_dir}/packages/hadoop-${HADOOP_VERSION}.tar.gz" -C "${jupyter_dir}/packages" | ||
rm "${jupyter_dir}/packages/hadoop-${HADOOP_VERSION}.tar.gz" | ||
|
||
echo "Hadoop ${HADOOP_VERSION} downloaded and extracted to ${jupyter_dir}/packages/hadoop-${HADOOP_VERSION}" | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems a little odd to ignore packages here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"I added this line following the suggestion from @xunliu . Perhaps it would be better to only include init/jupyter/packages instead? Let me know if that works."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FANNG1 Because we doesn't needs to commit these download jar files to git repo.