Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generating row number while mapping from CSV source #5

Open
seralf opened this issue Oct 13, 2016 · 3 comments
Open

generating row number while mapping from CSV source #5

seralf opened this issue Oct 13, 2016 · 3 comments

Comments

@seralf
Copy link
Contributor

seralf commented Oct 13, 2016

Hi

I need to implement a prototype mapping on a CSV source which does not have a cell containing an identifier which I can use for constructing uri. A possibile workaround for my case is to generate a simple id column (with its values) during an ETL pre-processing phase, but I wonder if there can be more proper solution inside the specification.
For example looking at CSVW it's possible to describe data introducing a _row map designed for extracting row number while processing. Is it possible to use this approach with RML over CSV? If so can someone provide me a simple example?

Otherwise is it possible to provide those functionalities as an extension in Java? What are the interfaces involved?

thank you in advance for any directions or suggestions

@andimou
Copy link
Collaborator

andimou commented Oct 13, 2016

Hm, my concern would be in this case how persistent your URIs would be.. Row numbers is a work around but are you sure they would solve your problem without generating another one? What if a new row is inserted? Wouldn't that affect your whole dataset?

I'm against using row numbers as identifiers and in particular in the case of CSV files, but of course this is also a personal opinion. Why not blank nodes? In this case it's the same as row-based generated URIs without causing the expectation that they are persistent. That's at least what I'm doing in these cases and you can do that in RML by defining the subject map's term type to blank node.

[ ] rml:subjectMap [ rr:termType rr:BlankNode ] .

@seralf
Copy link
Contributor Author

seralf commented Oct 13, 2016

Hi thank you for the reply and the suggestion! :-)

I'm against the row number as well, but in this particular case it was a
requirement for this prototypyng phase: later we will be able to adopt more
proper solution, but I need to generate something on a row basis easily at
this stage, in order to proceed on. Ideally I'd like to use something as an
hash value, which is a bit opaque, but still more useful than a blank node.
I'd like to avoid blank nodes... anyway I like your suggestion because the
skolemization will probably generate a better id than my hash (or worse the
numbers, as you said). I didn't think about blank nodes in this case in
particular, I don't know why, thanks for suggesting, I'll definetely adopt
your solution for this phase :-)

Anyway: this make me integrate a bit one of the previous question: is there
any way to extend the framework, for example for writing a custom component
to use for generating a particular skolemization?

2016-10-13 13:25 GMT+02:00 andimou [email protected]:

Hm, my concern would be in this case how persistent your URIs would be..
Row numbers is a work around but are you sure they would solve your problem
without generating another one? What if a new row is inserted? Wouldn't
that affect your whole dataset?

I'm against using row numbers as identifiers and in particular in the case
of CSV files, but of course this is also a personal opinion. Why not blank
nodes? In this case it's the same as row-based generated URIs without
causing the expectation that they are persistent. That's at least what I'm
doing in these cases and you can do that in RML by defining the subject
map's term type to blank node.

[ ] rml:subjectMap [ rr:termType rr:BlankNode ] .


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#5 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAFYfGyCKXYFhN40xs_0qlPtEJx887__ks5qzhUPgaJpZM4KVw-F
.

@seralf
Copy link
Contributor Author

seralf commented Dec 7, 2016

Hi a good default option seems to be having the chance to adopt well-known uris for replacing blank nodes automatically.
For example a BN _:6CjhEG5FkH could be rewritten as something like: http://example/.well-known/genid/6CjhEG5FkH

as suggested in https://www.w3.org/TR/rdf11-concepts/#rdf-documents (section "3.5 Replacing Blank Nodes with IRIs")

However the possible adoption of user-mad functions to produce ID/URI/IRI suing custom criteria would be very useful in some different practical scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants