Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support read and write from hive datasource #100

Merged
merged 3 commits into from
Aug 19, 2024

Conversation

awang12345
Copy link
Contributor

What type of PR is this?

  • feature

What problem(s) does this PR solve?

Issue(s) number:

Description:

Add hive datasource to read and write

How do you solve it?

hive: {
# algo's data source from hive
read: {
#[Optional] spark and hive require configuration on different clusters
metaStoreUris: "thrift://hive-metastore-server-01:9083"
#spark sql
sql: "select column_1,column_2,column_3 from database_01.table_01 "
#[Optional] graph source vid mapping with column of sql result.
srcId: "column_1"
#[Optional] graph dest vid mapping with column of sql result
dstId: "column_2"
#[Optional] graph weight mapping with column of sql result
weight: "column_3"
}

  # algo result sink into hive
  write: {
    #[Optional] spark and hive require configuration on different clusters
    metaStoreUris: "thrift://hive-metastore-server-02:9083"
    #save result to hive table
    dbTableName: "database_02.table_02"
    #[Optional] spark dataframe save mode,optional of Append,Overwrite,ErrorIfExists,Ignore. Default is Overwrite
    saveMode: "Overwrite"
    #[Optional] if auto create hive table. Default is true
    autoCreateTable: true
    #[Optional] algorithm result mapping with hive table column name. Default same with column name of algo result dataframe
    resultTableColumnMapping: {
      # Note: Different algorithms have different output fields, so let's take the pagerank algorithm for example:
      _id: "column_1"
      pagerank: "pagerank_value"
    }
  }
}

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Spark and hive have no environment validation on different clusters. All other cases have been verified

val autoCreateTable: Boolean = getOrElse(config,"hive.write.autoCreateTable",true)
//hive元数据地址
val writeMetaStoreUris: String = getOrElse(config,"hive.write.metaStoreUris","")
//执行结果和表字段映射关系,比如将算法结果中的_id映射为user_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the comment to English~

data.repartition(partitionNum)
}

data.show(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to show.

}

println(s"Save to hive:${config.dbTableName}, saveMode:${saveMode}")
_data.show(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@Nicole00 Nicole00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Nicole00 Nicole00 merged commit 4accdfe into vesoft-inc:master Aug 19, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants