Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoded characters not uploaded correctly #1374

Open
simonrihm opened this issue Oct 9, 2024 · 1 comment
Open

Encoded characters not uploaded correctly #1374

simonrihm opened this issue Oct 9, 2024 · 1 comment
Assignees
Labels
bug Something isn't working python-wrapper Issues relating to py4jps and pyderivationagent

Comments

@simonrihm
Copy link
Contributor

When using some special characters in properties of type string and uploading the data to the knowledge graph, the literals will not reflect those characters correctly. For example, a species with rdfs:label "Cu(NO3)2·2.5H2O" will be uploaded as Cu(NO3)2·2.5H2O (note the additional  character).

This seems to be an encoding issue as it is the case only for characters that are encoded differently between UTF-8 and Windows-1252, see https://www.w3schools.com/tags/ref_urlencode.ASP
I assume it happens when the rdflib.Graph is serialized to build an update query here:

update = f"""DELETE {{ {g_to_delete.serialize(format='nt')} }}

The serialized string is then executed as query here:

Here is a minimal example to reproduce this issue:

from rdflib import Graph, Literal, URIRef
from rdflib.namespace import RDFS
from twa.kg_operations import PySparqlClient

g=Graph()
g.add((URIRef('http://www.theworldavatar.com/test'), RDFS.label, Literal("X·Y’")))
update = f"""INSERT DATA {{ {g.serialize(format='nt')} }}"""
print(update)

kg_client = PySparqlClient(
    "http://00.000.000.00:0000/blazegraph/namespace/test/sparql",
    "http://00.000.000.00:0000/blazegraph/namespace/test/sparql"
)
kg_client.perform_update(update)
@simonrihm simonrihm added the bug Something isn't working label Oct 9, 2024
@simonrihm simonrihm added the python-wrapper Issues relating to py4jps and pyderivationagent label Oct 18, 2024
@jb2197
Copy link
Contributor

jb2197 commented Nov 19, 2024

As discussed, encoding-decoding special characters when interacting with blazegraph using twa (previously py4jps) is a known issue that requires investigation on how py4j handles encoding-decoding, see #667

Given the time constraint of various projects, a workaround could be used for the time being (assuming you queried literal from the blazegraph and saved it to python str object your_string):

your_string.encode('ISO-8859-1').decode('utf-8')

In this specific case, it will be "X·Yâ\x80\x99".encode('ISO-8859-1').decode('utf-8') that returns "X·Y’"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python-wrapper Issues relating to py4jps and pyderivationagent
Projects
None yet
Development

No branches or pull requests

2 participants