Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mssql] Non-ascii strings corrupted with use_bulk if columns contain quotes/newlines #518

Open
nickolay opened this issue Mar 5, 2025 · 1 comment

Comments

@nickolay
Copy link

nickolay commented Mar 5, 2025

Issue Description

When a character column contains both

  • characters like ", \n that trigger additional processing
  • and non-latin characters that require the MSSQL data type to be NVARCHAR in order to be procesed correctly

..the post-bcp UPDATE step erroneously converts the data to VARCHAR and causes data corruption: non-english characters are replaced with question marks.

For example ex.csv:

foo,id
"a""b",1
йцук,2

loads as foo=???? for id=2.

But if you remove the double "", it loads corectly.

(As an aside, I still think that a better solution would be to let the user avoid post-processing altogether utilizing the bcp's ability to specify Row and Field terminators as I suggested a while ago.)

  • Sling version (sling --version): 1.4.3

  • Operating System (linux, mac, windows): windows

  • Replication Configuration:

sling run --src-stream 'file://C:/temp/ex.csv' --tgt-conn DEV     --tgt-object 'dbo.ex' \
     --mode truncate     --tgt-options '{"use_bulk": true}' -d
  • Log Output (please run command with -d):
``` 2025-03-05 14:45:36 INF Sling CLI | https://slingdata.io 2025-03-05 14:45:36 DBG opened "file" connection (conn-file-8Nb) 2025-03-05 14:45:36 DBG Sling version: 1.4.3 (windows amd64) 2025-03-05 14:45:36 DBG type is file-db 2025-03-05 14:45:36 DBG using: {"columns":null,"mode":"truncate","select":null,"transforms":null} 2025-03-05 14:45:36 DBG using source options: {"empty_as_null":true,"header":true,"fields_per_rec":-1,"compression":"auto","null_if":"NULL","datetime_format":"AUTO","skip_blank_lines":false,"max_decimals":-1} 2025-03-05 14:45:36 DBG using target options: {"datetime_format":"auto","file_max_rows":0,"max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"} 2025-03-05 14:45:36 INF connecting to target database (sqlserver) 2025-03-05 14:45:36 DBG opened "sqlserver" connection (conn-sqlserver-ZC7) 2025-03-05 14:45:37 INF reading from source file system (file) 2025-03-05 14:45:37 DBG opened "file" connection (conn-file-7gW) 2025-03-05 14:45:37 DBG reading single datastream from file://C:/temp/ex.csv [format=csv] 2025-03-05 14:45:37 DBG merging csv readers of 1 files [concurrency=3] from file://C:/temp/ex.csv 2025-03-05 14:45:37 DBG processing reader from file://C:/temp/ex.csv 2025-03-05 14:45:37 DBG auto-decoding UTF-16 (LE) 2025-03-05 14:45:37 DBG delimiter auto-detected: "," 2025-03-05 14:45:37 INF writing to target database [mode: truncate] 2025-03-05 14:45:37 DBG IF OBJECT_ID(N'"dbo"."ex_tmp"', N'U') IS NOT NULL DROP TABLE "dbo"."ex_tmp" 2025-03-05 14:45:37 DBG table "dbo"."ex_tmp" dropped 2025-03-05 14:45:37 DBG create table "dbo"."ex_tmp" ("foo" nvarchar(max), "id" bigint, "_sling_loaded_at" integer) 2025-03-05 14:45:37 INF created table "dbo"."ex_tmp" 2025-03-05 14:45:37 INF streaming data 2025-03-05 14:45:37 DBG bcp version is 11 2025-03-05 14:45:37 WRN bcp version 11 is old. This may give issues with sling, consider upgrading. 2025-03-05 14:45:37 DBG C:\Program Files\Microsoft SQL Server\110\Tools\Binn\bcp.exe 'dbo.ex_tmp' in 'C:/Users/Nickolay/AppData/Local/Temp/sqlserver.dbo.ex_tmp.1741175137103.5661.csv' -S '****' -d 'Jira DWH' -t ',' -m '1' '-w' -q -b '50000' -F '2' -e 'C:/Users/Nickolay/AppData/Local/Temp/sqlserver.dbo.ex_tmp.1741175137135.X2c.error' -U 'etl' -P '****' 2025-03-05 14:45:37 DBG select count(*) cnt from "dbo"."ex_tmp" 2025-03-05 14:45:37 DBG truncate table "dbo"."ex" 2025-03-05 14:45:37 INF truncated table "dbo"."ex" 2025-03-05 14:45:37 DBG inserting id [bigint] into id [int] 2025-03-05 14:45:37 DBG inserting _sling_loaded_at [int] into _sling_loaded_at [bigint] 2025-03-05 14:45:37 DBG insert into "dbo"."ex" ("foo", "id", "_sling_loaded_at") select "foo" as "foo", "id" as "id", "_sling_loaded_at" as "_sling_loaded_at" from "dbo"."ex_tmp" 2025-03-05 14:45:37 DBG inserted rows into "dbo"."ex" from temp table "dbo"."ex_tmp" 2025-03-05 14:45:37 INF inserted 2 rows into "dbo"."ex" in 0 secs [6 r/s] 2025-03-05 14:45:37 DBG IF OBJECT_ID(N'"dbo"."ex_tmp"', N'U') IS NOT NULL DROP TABLE "dbo"."ex_tmp" 2025-03-05 14:45:37 DBG table "dbo"."ex_tmp" dropped 2025-03-05 14:45:37 DBG closed "sqlserver" connection (conn-sqlserver-ZC7) 2025-03-05 14:45:37 INF execution succeeded ```
@flarco
Copy link
Collaborator

flarco commented Mar 8, 2025

Thanks for reporting @nickolay , I had briefly looked into the different terminators and encountered some issues, don't remember though.
Can you share what terminator you would use? Actually, can you share the BCP command you would use with the flags?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants