Skip to content

HDFS URI support does not support network location #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vvaten opened this issue Dec 22, 2017 · 3 comments
Open

HDFS URI support does not support network location #162

vvaten opened this issue Dec 22, 2017 · 3 comments

Comments

@vvaten
Copy link

vvaten commented Dec 22, 2017

It seems that "hdfs://host:port/path/file.txt" url scheme is not supported by smart open currently. Therefore, with smart_open it is only possible to access the local HDFS filesystem with the hdfs:// protocol. The URI is parsed correctly by the urlsplit, but the network location is always ignored.

The hdfs dfs command supports the host:port in the URI out of the box, so adding this would be very easy. However, then the url scheme "hdfs://tmp/test.txt" would no longer work as the 'tmp' here would be interpreted as the network location, which it really is if you read the URI literally according to specification.

For me it would be expected result that:

  • ParseUri('hdfs://host:port/path/file.txt', 'wb') ==> ["hdfs", "dfs", "-text", "hdfs://host:port/path/test.txt"]
  • ParseUri ('hdfs:///path/file.txt', 'wb') ==> ["hdfs", "dfs", "-text", "/path/test.txt"]
  • ParseUri ('hdfs://host/path/file.txt', 'wb') ==> ["hdfs", "dfs", "-text", "hdfs://host/path/test.txt"]

Let me know what you think.

@menshikh-iv
Copy link
Contributor

Looks good for me (only /// in second example slightly confused).
"port" support will be a good addition, I agree @vvaten.

@vvaten
Copy link
Author

vvaten commented Jan 2, 2018

Great!

For the second example there would be two alternative implementations:

  • ParseUri ('hdfs:///path/file.txt', 'wb') ==> ["hdfs", "dfs", "-text", "/path/test.txt"]
  • ParseUri ('hdfs:///path/file.txt', 'wb') ==> ["hdfs", "dfs", "-text", "hdfs:///path/test.txt"]

I.e. the 'hdfs://' is optional in the URI if there is no host or port specified. It defaults to the localhost and the default port. For simplicity, I'd suggest that we always keep the protocol there.

I'll make a PR for this.

@vvaten
Copy link
Author

vvaten commented Jan 2, 2018

Here's the PR: #168

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants