Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement Trino source #585

Merged
merged 12 commits into from
Apr 22, 2024
Merged

implement Trino source #585

merged 12 commits into from
Apr 22, 2024

Conversation

domnikl
Copy link
Contributor

@domnikl domnikl commented Mar 15, 2024

I'd really like to see Trino being implemented as a source in ConnectorX and thus have begun working on it. It uses prusto as a client and currently supports all basic types. Support for tuple, array, row and uuid isn't implemented yet and there's a bug with time columns not being mapped correctly. Also I want to look into fetching results more efficiently and not everything at once.

What do you think, is it worth continuing work on it? Is there anything else I need to consider eg. regarding mapping the results?

@wangxiaoying
Copy link
Contributor

Thanks @domnikl for the PR. I think it in general looks great and complete!

Also I want to look into fetching results more efficiently and not everything at once.

Yes, I think this might be an issue in terms of performance. I don't have experience of using prusto before, but it seems like the get and get_next APIs are working for gradually fetching the results. Not sure whether it is easy to switch from get_all to these two.

time columns not being mapped correctly

I'm not sure what exactly the bug is. Currently we are converting the NaiveTime type in rust to String type in pandas in general (and also I see it in your pr), it is done by the function here. If you want to customize the conversion or mapping it to another type in pandas, it can be done by modifying the mapping in the macro as well as this function.

It would be great to also submit the corresponding seed database script here so others can run the python test_trino locally.



@pytest.fixture(scope="module") # type: ignore
def mysql_url() -> str:
Copy link
Contributor

@wangxiaoying wangxiaoying Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these functions should be rename to test_trino_xxx from test_mysql_xxx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy & paste error, I'll rename it 👍🏻

@domnikl
Copy link
Contributor Author

domnikl commented Mar 22, 2024

@wangxiaoying thanks for looking into it!

I'm not sure what exactly the bug is

The problem is with reading timestamps with time zone types, prusto does not support it (yet) by the looks of it. Instead it returns EmptyData and it appears the result is just empty. But its not that important anyway I think to get starting. I'll create an issue in their project.

It would be great to also submit the corresponding seed database script

Will do in the next days and also rework fetching from the database with the get/get_next calls if possible.

@domnikl domnikl changed the title Draft: implement Trino source implement Trino source Apr 19, 2024
@domnikl
Copy link
Contributor Author

domnikl commented Apr 19, 2024

@wangxiaoying I implemented the missing partitioning as well as switched to get/get_next for prusto and fixed the tests and added test data and squashed some bugs along the way. Could you have a look at it again please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we probably are not going to make trino in our CI on github workflow. Can you make all these tests skipped if TRINO_URL is not set? Something similar to oracle here.


#[throws(TrinoSourceError)]
fn get_total_rows(rt: Arc<Runtime>, client: Arc<Client>, query: &CXQuery<String>) -> usize {
rt.block_on(client.get_all::<Row>(query.to_string()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this will execute the entire query and get all the results, which may be a bottleneck in terms of performance (since the query will be ran twice). Can we make it to using the SELECT COUNT(*) query to get the total rows? We have a util function named count_query. An example can be found here.

@wangxiaoying
Copy link
Contributor

Hi @domnikl , thanks so much for completing the PR!

I've tested the test cases locally and it seems works well. I have two minor comments above but in general I think the code looks good and complete. We probably also need to add a documentation page for trino here.

I think after fixing the above minor issues, we can merge it and make it into our next release!

@domnikl
Copy link
Contributor Author

domnikl commented Apr 22, 2024

@wangxiaoying thanks for having a look! Added the missing pieces and a bit of documentation for it.

Copy link
Contributor

@wangxiaoying wangxiaoying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @domnikl , the update looks good to me! We can merge this PR after the CI is passed : )

@wangxiaoying wangxiaoying merged commit d20ddc5 into sfu-db:main Apr 22, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants