Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to anonymise data within CSV files using Anonimatron? #122

Open
MaryamBasereh opened this issue Jan 11, 2021 · 7 comments
Open

How to anonymise data within CSV files using Anonimatron? #122

MaryamBasereh opened this issue Jan 11, 2021 · 7 comments

Comments

@MaryamBasereh
Copy link

MaryamBasereh commented Jan 11, 2021

Hi
I need to anonymise CSV files, consistently (same identifiers, such as names, have the same code). I would like to know about the correct jdbc URL for CSV files.
Any help is greatly appreciated.
Thank you.

@realrolfje
Copy link
Owner

Hello Maryam192, Sorry for the late reply.

Anonymizing CSV files is possible by configuring an intput and output file instead of a database. When you run anonimatron with the option --configexample you will find the csv example at the end. If you just leave the database stuff out, your configufile could look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <file inFile="default_types.in.csv"
        reader="com.rolfje.anonimatron.file.CsvFileReader"
        outFile="default_types.out.csv" writer="com.rolfje.anonimatron.file.CsvFileWriter">
        <column name="A_JAVA_LANG_STRING_COLUMN" type="java.lang.String" size="-1"/>
        <column name="A_JAVA_SQL_DATE_COLUMN" type="java.sql.Date" size="-1"/>
    </file>
</configuration>

You can have multiple csv files in one configuration file, and you can re-use the synonym file between runs.

@MaryamBasereh
Copy link
Author

MaryamBasereh commented Jan 24, 2021 via email

@realrolfje
Copy link
Owner

Hello Mayam, the files you attach to a mail are not forwarded by github, and not attached to the github issue. Can you please attach the examples to the issue? I'll be glad to have a look at it. (don't forget to remove passwords and personal info first)

@MaryamBasereh
Copy link
Author

Hello Rolf,
Thank you again for your consideration. I have attached the files (csv attachments are not supported and I had to convert them to xlsx files. Also xml is not supported and I had to convert it to a txt file).
Thank you.
Regards,
Maryam
out.xlsx

config.txt
IncidentOverview1.xlsx

@realrolfje
Copy link
Owner

Hello Maryam, I see that the CSV reader implementation is not that robust. It can not handle "header rows", as you file has. It treats all the rows the same, and although the configuration file uses "name" for the columns, it is actually a number that needs to be filled in. This needs to be fixed to make it usable. It also can not handle comma's or semicolons inside a field, I just noticed.

For now, a workaround to get it running in your case is:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <file 
      inFile="IncidentOverview1.csv" reader="com.rolfje.anonimatron.file.CsvFileReader"
      outFile="out.csv"              writer="com.rolfje.anonimatron.file.CsvFileWriter">
      <column name="1" type="ROMAN_NAME"   size="100"/>
      <column name="5" type="ROMAN_NAME"   size="100"  />
      <column name="6" type="ROMAN_NAME"   size="100"  />
      <column name="7" type="RANDOMDIGITS" size="20"/>
  </file>
</configuration>

This is not ideal, as the 7th (last) column in your file contains a comma in the data, and it will be treated as column 7 and 8. I'll see what I can do about this, but I need a bit of time to fix it (and also keep it downwards compatible).

I hope this helps you a bit, thanks for the patience, examples and config.

@MaryamBasereh
Copy link
Author

MaryamBasereh commented Jan 25, 2021 via email

@realrolfje
Copy link
Owner

I think the garbled output has to do with file encoding. Input and output file encoding should be UTF-8, is there a way you can check that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants