Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Change Data Feed from a Table returns "change data was not recorded for version *" error #433

Open
LazyRen opened this issue Nov 9, 2023 · 0 comments

Comments

@LazyRen
Copy link

LazyRen commented Nov 9, 2023

Hello, I am working on a standalone Delta Sharing server with S3 storage ATM.

Below is the PySpark Script I've used to create & update table.

from pyspark.sql import SparkSession
from delta import *

s3_url = "<my_url>"

builder = SparkSession.builder \
    .appName("quickstart") \
    .master("local[*]") \
    .config("spark.hadoop.fs.s3a.path.style.access", True) \
    .config(
      "spark.hadoop.fs.s3a.impl",
      "org.apache.hadoop.fs.s3a.S3AFileSystem"
    ) \
    .config("spark.jars", "/aws-java-sdk-1.12.424/lib/aws-java-sdk-1.12.424.jar") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.catalog.spark_catalog.type", "hadoop") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config('spark.hadoop.fs.s3a.access.key', "<access_key>") \
    .config('spark.hadoop.fs.s3a.secret.key', "<secret_key>") \
    .config('spark.jars', "/hadoop-aws-3.3.2.jar, /aws-java-sdk-bundle-1.12.425.jar") \
    .config("spark.sql.warehouse.dir", s3_url)
spark = configure_spark_with_delta_pip(builder).getOrCreate()

spark.sql('''CREATE OR REPLACE TABLE `courses`(
    cid                 Integer,
    department          String,
    cap                 Integer,
    instructor          String,
    easy                Double,
    useful              Double
)
USING DELTA
TBLPROPERTIES (delta.enableChangeDataFeed = true)
''')

# Version 0 - insert
spark.sql('''INSERT INTO `courses` VALUES
(341, 'CS', 80, 'Armin Jamshidpey', 0.54, 0.99),
(347, 'PMATH', 65, 'David McKinnon', 0.43, 0.90),
(488, 'CS', 45, 'Toshiya Hachisuka', 0.58, 0.46),
(104, 'PHY', 1200, 'Joseph Smith', 0.92, 0.25),
(130, 'PHY', 250, 'Maria Elrena', 0.79, 0.63),
(140, 'MUSIC', 200, 'Simon Wood', 0.87, 0.58),
(246, 'CS', 90, 'Mark Petrick', 0.59, 0.66),
(250, 'AMATH', 200, 'John Wick', 0.75, 0.86),
(251, 'AMATH', 200, 'OPT 347', 0.75, 0.86),
(370, 'CS', 105, 'Jeff Orchard', 0.61, 0.86)
''')

# Version 1 - delete
spark.sql('''DELETE FROM `courses` WHERE
department = 'AMATH'
''')

# Version 2 - update
spark.sql('''UPDATE `courses`
SET instructor = 'Richard Feynman' WHERE department = 'PHY'
''')

I have enabled CDF by TBLPROPERTIES (delta.enableChangeDataFeed = true).

And executing

spark.read.format("delta") \
  .option("readChangeFeed", "true") \
  .option("startingVersion", 1) \
  .option("endingVersion", 1) \
  .table("`courses`") \
  .show(100)

returns a proper records, while calling Read Change Data Feed from a Table returns

ec2-user:~> curl http://localhost:8080/delta-sharing/shares/demos3select/schemas/school/tables/courses/changes?startingVersion=1&endingVersion=1
[1] 24330
ec2-user:~> {"errorCode":"INVALID_PARAMETER_VALUE","message":"Error getting change data for range [1, 2] as change data was not recorded for version [1]"}
[1]+  Done                    curl http://localhost:8080/delta-sharing/shares/demos3select/schemas/school/tables/courses/changes?startingVersion=1

Both DS 0.7.4 & DS 1.0.2 returns the same error.

I have attached delta table data below. Could you please check what went wrong here?
Thank you.

(By the way, where can I see the documentation on the yaml config file? Took me a good 5 minute to figure out what went wrong after updating DS 0.7.4 to 1.0.2. Turns out config name cdfEnabled was changed to historyShared)

DS server config:

# The format version of this config file
version: 1
# Config shares/schemas/tables to share
shares:
- name: "demos3select"
  schemas:
  - name: "school"
    tables:
    - name: "courses"
      location: "s3a://<s3_url>/delta_lake/poc/courses"
      id: "00000000-0000-0000-0000-000000000000"
      historyShared: true

courses.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant