How to anonymise data within CSV files using Anonimatron? #122

MaryamBasereh · 2021-01-11T15:21:32Z

Hi
I need to anonymise CSV files, consistently (same identifiers, such as names, have the same code). I would like to know about the correct jdbc URL for CSV files.
Any help is greatly appreciated.
Thank you.

realrolfje · 2021-01-17T20:48:47Z

Hello Maryam192, Sorry for the late reply.

Anonymizing CSV files is possible by configuring an intput and output file instead of a database. When you run anonimatron with the option --configexample you will find the csv example at the end. If you just leave the database stuff out, your configufile could look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <file inFile="default_types.in.csv"
        reader="com.rolfje.anonimatron.file.CsvFileReader"
        outFile="default_types.out.csv" writer="com.rolfje.anonimatron.file.CsvFileWriter">
        <column name="A_JAVA_LANG_STRING_COLUMN" type="java.lang.String" size="-1"/>
        <column name="A_JAVA_SQL_DATE_COLUMN" type="java.sql.Date" size="-1"/>
    </file>
</configuration>

You can have multiple csv files in one configuration file, and you can re-use the synonym file between runs.

MaryamBasereh · 2021-01-24T09:45:04Z

Hello Rolf, Thank you so much for your response and information. I have done the same as you said but the problem is that the output file is not readable (it is a number of meaningless characters). I am using Anonimatron on a MacBook and I tried to detect the output file encoding using chardet and other ways but it is not recognisable. Besides, for the first number of runs, a synonyms file was built but now the tool does not build a synonyms file. I would be very grateful if you could help me with this issue. I wonder if I have missed something. So, please find attached the config, the input csv, and the output files. Your help is much appreciated. Thank you very much. Looking forward to hearing from you. Kind regards, Maryam

…

On Mon, Jan 18, 2021 at 12:19 AM Rolf ***@***.***> wrote: Hello Maryam192, Sorry for the late reply. Anonymizing CSV files is possible by configuring an intput and output file instead of a database. When you run anonimatron with the option --configexample you will find the csv example at the end. If you just leave the database stuff out, your configufile could look something like this: <?xml version="1.0" encoding="UTF-8"?> <configuration> <file inFile="default_types.in.csv" reader="com.rolfje.anonimatron.file.CsvFileReader" outFile="default_types.out.csv" writer="com.rolfje.anonimatron.file.CsvFileWriter"> <column name="A_JAVA_LANG_STRING_COLUMN" type="java.lang.String" size="-1"/> <column name="A_JAVA_SQL_DATE_COLUMN" type="java.sql.Date" size="-1"/> </file> </configuration> You can have multiple csv files in one configuration file, and you can re-use the synonym file between runs. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#122 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASNRCCLDEHDNNOIDIRFYY5LS2NELZANCNFSM4V5W35MA> .

realrolfje · 2021-01-24T11:16:28Z

Hello Mayam, the files you attach to a mail are not forwarded by github, and not attached to the github issue. Can you please attach the examples to the issue? I'll be glad to have a look at it. (don't forget to remove passwords and personal info first)

MaryamBasereh · 2021-01-24T17:47:22Z

Hello Rolf,
Thank you again for your consideration. I have attached the files (csv attachments are not supported and I had to convert them to xlsx files. Also xml is not supported and I had to convert it to a txt file).
Thank you.
Regards,
Maryam
out.xlsx

config.txt
IncidentOverview1.xlsx

realrolfje · 2021-01-24T22:54:14Z

Hello Maryam, I see that the CSV reader implementation is not that robust. It can not handle "header rows", as you file has. It treats all the rows the same, and although the configuration file uses "name" for the columns, it is actually a number that needs to be filled in. This needs to be fixed to make it usable. It also can not handle comma's or semicolons inside a field, I just noticed.

For now, a workaround to get it running in your case is:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <file 
      inFile="IncidentOverview1.csv" reader="com.rolfje.anonimatron.file.CsvFileReader"
      outFile="out.csv"              writer="com.rolfje.anonimatron.file.CsvFileWriter">
      <column name="1" type="ROMAN_NAME"   size="100"/>
      <column name="5" type="ROMAN_NAME"   size="100"  />
      <column name="6" type="ROMAN_NAME"   size="100"  />
      <column name="7" type="RANDOMDIGITS" size="20"/>
  </file>
</configuration>

This is not ideal, as the 7th (last) column in your file contains a comma in the data, and it will be treated as column 7 and 8. I'll see what I can do about this, but I need a bit of time to fix it (and also keep it downwards compatible).

I hope this helps you a bit, thanks for the patience, examples and config.

MaryamBasereh · 2021-01-25T11:10:26Z

Hello Rolf, Thank you so much for your help and time. I removed the column titles and used column numbers instead but the problem with the output being illegible still persists. The output for me is a number of strange characters. I am working on Mac and I tried different encodings but I have not found the solution yet. I would be grateful if you could help. Thank you again. Looking forward to hearing from you. Kind regards, Maryam

…

On Mon, Jan 25, 2021 at 2:24 AM Rolf ***@***.***> wrote: Hello Maryam, I see that the CSV reader implementation is not that robust. It can not handle "header rows", as you file has. It treats all the rows the same, and although the configuration file uses "name" for the columns, it is actually a number that needs to be filled in. This needs to be fixed to make it usable. It also can not handle comma's or semicolons inside a field, I just noticed. For now, a workaround to get it running in your case is: <?xml version="1.0" encoding="UTF-8"?> <configuration> <file inFile="IncidentOverview1.csv" reader="com.rolfje.anonimatron.file.CsvFileReader" outFile="out.csv" writer="com.rolfje.anonimatron.file.CsvFileWriter"> <column name="1" type="ROMAN_NAME" size="100"/> <column name="5" type="ROMAN_NAME" size="100" /> <column name="6" type="ROMAN_NAME" size="100" /> <column name="7" type="RANDOMDIGITS" size="20"/> </file> </configuration> This is not ideal, as the 7th (last) column in your file contains a comma in the data, and it will be treated as column 7 and 8. I'll see what I can do about this, but I need a bit of time to fix it (and also keep it downwards compatible). I hope this helps you a bit, thanks for the patience, examples and config. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#122 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASNRCCLBONB36KGZP35STVDS3SQKFANCNFSM4V5W35MA> .

realrolfje · 2021-01-25T11:16:51Z

I think the garbled output has to do with file encoding. Input and output file encoding should be UTF-8, is there a way you can check that?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to anonymise data within CSV files using Anonimatron? #122

How to anonymise data within CSV files using Anonimatron? #122

MaryamBasereh commented Jan 11, 2021 •

edited

Loading

realrolfje commented Jan 17, 2021

MaryamBasereh commented Jan 24, 2021 via email

realrolfje commented Jan 24, 2021

MaryamBasereh commented Jan 24, 2021

realrolfje commented Jan 24, 2021

MaryamBasereh commented Jan 25, 2021 via email

realrolfje commented Jan 25, 2021

How to anonymise data within CSV files using Anonimatron? #122

How to anonymise data within CSV files using Anonimatron? #122

Comments

MaryamBasereh commented Jan 11, 2021 • edited Loading

realrolfje commented Jan 17, 2021

MaryamBasereh commented Jan 24, 2021 via email

realrolfje commented Jan 24, 2021

MaryamBasereh commented Jan 24, 2021

realrolfje commented Jan 24, 2021

MaryamBasereh commented Jan 25, 2021 via email

realrolfje commented Jan 25, 2021

MaryamBasereh commented Jan 11, 2021 •

edited

Loading