-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrieving data from a ShiftJIS Interbase instance #18
Comments
The � is expected if the data stored has ordinal values > than what ANSI supports which SJIS can have. While it comes over the wire in bytes as SJIS characters, the client side doesn't know how to convert those bytes to SJIS since SJIS is not in the connection string. So it converts the bytes over the wire using the ANSI encoding to UTF16 for Windows display. Any code points outside the ANSI range gets the � in UTF16. I do not see any issues with how SJIS is set up when used as the character set in the connection string just reading the code. The error message usually means that the data in the column is too big to be placed in the buffer set aside for the SJIS-converted equivalent. This is difficult to track down often on variable multibyte code points like SJIS is. So when data gets stored as if it is ANSI you can fool IB into storing more bytes than is allowed by definition. This was often used when there was a much smaller index total column size limit in older versions but it has been gone for over a decade. So a VarChar(10) character set SJIS column could hold up to 20 bytes of data (10 * the maximum size of a character in this case 2). Since the ANSI characters are all 1 byte if you make an ANSI (no character set) connection to IB you could actually store the string '12345678901234567890' into that column because IB only tests for the passed bytes fitting in the max allotted space not if the passed characters match the definition. But when you try to pull it back using SJIS as the character set on the connection string those 20 bytes get converted to SJIS on the server side and it overflowed the buffer of 20 bytes when converted. While I have a lot of UTF8 unit tests and mostly ANSI, there are none dedicated to SJIS alone. I'll see about trying to write some this weekend and verify things are working right, but what I described above is often what is going on with that error message. One thing you can try (since you have the data and I don't and this is a data issue somehow) is to make an ANSI connection then create an SJIS Encoding object and convert the bytes of the field to an SJIS string and look at that. That should eliminate those � and you can count the number of characters in the string to look for > the definition. In general, my recommendation when using a variable multibyte character set like SJIS in a DB is always to set the character set, don't let things go in as ANSI. While just letting it go as ansi works on the machines that have as their default char set locally to Japanese, you run into issues when using a non-Japanese machine and now setting the SJIS char set. |
We have a legacy Delphi application which connected to Interbase (dialect 1), and I'm now writing a newer application in C# which is meant to connect to that same database. I can confirm that the setup for the legacy application is/was
For my own personal development environment, I initially had my machine set to Windows/English with the beta UTF8 feature turned on. In an effort to try to increase comptability, I changed my machined to Japanese with the UTF8 feature turned off. It doesn't seem like that has had any effect on what Entity Framework is doing. |
Thanks for that thorough, helpful, explanation by the way. That does seem to be consistent with the legacy application's configuration, and likely is the cause of my current problem. |
I've been working on this for over a day trying various things, and I haven't made any progress on it. It seems like the string I'm getting back from the first place is already corrupted in such a way that it is unusable. I'm attaching a file The next file, If I open If I open Analyizing the files on a byte level, the received string is about 3 times longer than it ought to be. Interestingly, you can see the 71, 6A, and 61 in both string which map to q, j, and a in ANSI. If tried using the
But doing so has only ever resulting in the |
So, I've been trying the
✅ - means it displayed correctly without me doing anything extra So it seems like there's two potential problems: Something to do with using .NET Core, and something to do with using the beta Unicode UTF-8 locale setting on Windows. |
Can you attach either the table structure or just a small DB with data in it and a sample app for me? Using just the one line of data found in the sjis.txt file I am not able to reproduce the problem you are seeing on the .NET framework with EN/US/UTF16 (my default setup). It displays fine with no changes using the sample simple grid demo that is included in the installer for the driver. Having at least your sample DB side at minimum eliminates guesswork on my side to how to setup this problem. Your results for the .NET core is worrisome because if the .NET core pulls the data in bytes and not as strings losing the encoding of those bytes along the way. |
The table in question has contact information (names, addresses, phone numbers) for a large variety of companies. I'm currently talking to coworkers about whether or not I can pass that over, if it needs to be masked, etc. I'll try to get you something as soon as possible. |
I don't need that table itself if you can give me something that demonstrates it so I can work from there. I understand problems with private information. |
https://github.com/mcaubrey/masked-kaikei-backup/blob/main/link You should be able to reproduce the problem using the backup I made and running the following
It should come back as Also, I failed to mention this earlier, but I systematically tested both |
Thanks, I have doctor's appointments all day tomorrow but will look at this on Friday or the weekend. This should help a lot. |
No worries. I think for now we're going to move forward with a different approach, so there's no rush to get this figured out on our part. We certainly might want to come back to this (this being creating a new .NET project which connects to our old InterBase instance) in the future though, and I will certainly cooperate with trying to figure out what is going on here. Please take care of your health 👍 |
Ok, I have not ignored this, but just haven't had a lot to report on it. Note - I used lc_ctype and character set interchangeably. Character set gets mapped to the lc_ctype in the connection data so more how I think of it, so where you see lc_ctype that is the equivalent of setting the character set in the connection string. Here is the problem and maybe a solution. The Fb code we based everything on assumed that if there was no code page (a 0 sql_sub_type on the XSQLVar) that what was stored was UTF8. In my experience that is a wrong assumption. Most will have ANSI data that is associated with a corresponding locale (your situation). UTF8 and ASCII (the first 127) overlap perfectly so no issues when it comes to things like system table data cause they are all ASCII-limited (technically). But when working with ANSI data the code page matters. The key is how the NONE (0 sql_sub_type) is created in the CharSet class.
What if I gave you a way to indicate you want a different default encoding? That is unset and will continue to be UTF8, but if set will decode your ANSI data in the encoding you know it is. This solution would end up being an all-or-nothing at the reading of data level. So could potentially allow you to change between executions of reads. A deeper dive of this code over the weeks has shown that the lc_ctype (character set) in the connection string will not impact anything other than if you can set it to utf8, because setting the character set at the connection level does not change the sql_sub_type so that will still result in using the default encoding which is always UTF8 in .NET. I am hoping to update how it decodes data to take into consideration of the connection's lc_ctype, but for now, that is not an immediate priority. UTF8 as a character set only works because the encoding for none is assumed to be UTF8. That assumption is hard-coded, so using others will not really work in the case where you have no character set defined for the column, but have non-ASCII data in that column. Part of the trouble though is that IB can throw athematic overflow errors on conversions to a different character set. In these cases we know we want Latin1 (iso-8859-1) or SJIS etc, but the server will choke if we force the server to do it through the lc_ctype (this is the case with the other person running into this problem). So if there is an option to not set the character set, retrieve the data as is, but define what we want the Encdoing to be for string data that is 'NONE' character set defined on the server side would that help you? If we go that route I am thinking of both a new connectionstring property and a property on the IBConnection itself. That gives the flexibility of doing it once in the connection string as a "set and forget" type of situation while still giving the flexibility of switching it up as needed without needing to reconnect. Sorry for the wall of text, hopefully, I wasn't too confusing. |
Ok update on this. I have changed the default sql_sub_type from assuming 0 is UTF8, to assuming 0 is your locale ANSI encoding. Reminder sql_sub_type 0 means no encoding (character set) is associated with the data. UTF8 works as long as your data is all ASCII range (0-127) above that it doesn't know how to encode from the locale ANSI to UTF8 (hence your weird characters). I am now loading the additional encodings from System.text...CodePages are needed for SJIS/ Latin1 and other potential ANSI encodings. and made sure it is supporting netcore 3.1 (note that is no longer in support by MS and at some point will be dropped from the driver) I think this is going to resolve most of your conditions, but I am afraid I might need to give you an additional encoding override at the DBField level because you have mixed locales EN/JP JA/JP EN/US. If that is needed I know what I'll do, but if not I don't want muddying the waters more than necessary. If you want message me and I'll give you instructions on how to pull the latest from my ProGet server and you can verify how it is working for you. I have another who was having this issue with Latin1 characters showing as that diamond ? and their issues are all resolved, but I'd like to also make sure your issue is resolved before we release anything. |
Interbase version
.NET Core 6
ASP.NET Project
Entity Framework/Interbase related packages:
I have Interbase 2022 installed on my local machine including the ODBC driver.
My Windows machine's encoding is set to ShiftJIS (as it is the default Japanese encoding).
The remote Interbase instance does not have an encoding set, which to my understanding means that it should default to the host machine's encoding. The host machine of that remote instance is also set to ShiftJIS.
If I connect to the instance using
isql
, I can properly retrieve Japanese encoded text.Since I am having trouble using Entity Framework's ORM functions as described here, I've set up a different endpoint to execute SQL directly which looks like this,
If I set the connection string to
@"User=USERNAME;Password=password;Database=c:\DATABASE.GDB;DataSource=999.999.999.999
and execute it, I get this result,It seems like I'm having some kind of problem with the encoding. I thought that I should perhaps try to specify the encoding inside the connection string, however if I set the connection string to
@"User=USERNAME;Password=password;Database=c:\DATABASE.GDB;DataSource=999.999.999.999;Charset=SJIS_0208
and execute it, I get this resultI tried setting breakpoints and debugging this, and it seems like this is coming from the
reader.Read()
call. It seems like perhaps the reader is unable to handle ShiftJIS?The text was updated successfully, but these errors were encountered: