Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading a document with non-ASCII characters breaks citations #1978

Open
ntabernacle opened this issue Sep 17, 2024 · 2 comments
Open

Uploading a document with non-ASCII characters breaks citations #1978

ntabernacle opened this issue Sep 17, 2024 · 2 comments
Labels
bug Something isn't working open issue A validated issue that should be tackled. Comment if you'd like it assigned to you.

Comments

@ntabernacle
Copy link

ntabernacle commented Sep 17, 2024

This issue is for a: (mark with an x)

- [ X] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Upload a document to to the AI index that contains a non-ASCII character. E.g. ä or ü, ask a question then open the citations tab.

Any log messages given by the failure

You'll receive a HTML not found error when loading the citation PDF.

Expected/desired behavior

#334 added support for non-ASCII characters but this doesn't work with citations. An auto fix is difficult as the user needs to decide what characters to use instead of the special characters. I believe this is also part of the cause of #1798 . A fix can be achieved by renaming the files to ASCII characters only.

OS and Version?

All.

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

@pamelafox
Copy link
Collaborator

Okay, so are you suggesting that we always rename files to ASCII as part of the data ingestion process? That does seem like it'd be the safest, though I dont know if there are situations where folks have two files which differ only by an accent, and will be surprised by the filename change resulting in a collision. That would probably be rare.

@pamelafox pamelafox added bug Something isn't working open issue A validated issue that should be tackled. Comment if you'd like it assigned to you. labels Sep 26, 2024
@ntabernacle
Copy link
Author

@pamelafox No I meant it as a manual fix, rename the file pre-upload. I agree we shouldn't rename the files in the ingestion, mainly as it's not obvious what you'd change a ü to. It's fairly edge case as I was working with a German client.

Perhaps just a line in the docs for the citation not found error (#1798) that the filename could be a cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working open issue A validated issue that should be tackled. Comment if you'd like it assigned to you.
Projects
None yet
Development

No branches or pull requests

2 participants