-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to anonymise data within CSV files using Anonimatron? #122
Comments
Hello Maryam192, Sorry for the late reply. Anonymizing CSV files is possible by configuring an intput and output file instead of a database. When you run anonimatron with the option <?xml version="1.0" encoding="UTF-8"?>
<configuration>
<file inFile="default_types.in.csv"
reader="com.rolfje.anonimatron.file.CsvFileReader"
outFile="default_types.out.csv" writer="com.rolfje.anonimatron.file.CsvFileWriter">
<column name="A_JAVA_LANG_STRING_COLUMN" type="java.lang.String" size="-1"/>
<column name="A_JAVA_SQL_DATE_COLUMN" type="java.sql.Date" size="-1"/>
</file>
</configuration> You can have multiple csv files in one configuration file, and you can re-use the synonym file between runs. |
Hello Rolf,
Thank you so much for your response and information.
I have done the same as you said but the problem is that the output file is
not readable (it is a number of meaningless characters). I am using
Anonimatron on a MacBook and I tried to detect the output file encoding
using chardet and other ways but it is not recognisable. Besides, for the
first number of runs, a synonyms file was built but now the tool does not
build a synonyms file.
I would be very grateful if you could help me with this issue. I wonder if
I have missed something. So, please find attached the config, the input
csv, and the output files.
Your help is much appreciated.
Thank you very much. Looking forward to hearing from you.
Kind regards,
Maryam
…On Mon, Jan 18, 2021 at 12:19 AM Rolf ***@***.***> wrote:
Hello Maryam192, Sorry for the late reply.
Anonymizing CSV files is possible by configuring an intput and output file
instead of a database. When you run anonimatron with the option
--configexample you will find the csv example at the end. If you just
leave the database stuff out, your configufile could look something like
this:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<file inFile="default_types.in.csv"
reader="com.rolfje.anonimatron.file.CsvFileReader"
outFile="default_types.out.csv" writer="com.rolfje.anonimatron.file.CsvFileWriter">
<column name="A_JAVA_LANG_STRING_COLUMN" type="java.lang.String" size="-1"/>
<column name="A_JAVA_SQL_DATE_COLUMN" type="java.sql.Date" size="-1"/>
</file>
</configuration>
You can have multiple csv files in one configuration file, and you can
re-use the synonym file between runs.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#122 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASNRCCLDEHDNNOIDIRFYY5LS2NELZANCNFSM4V5W35MA>
.
|
Hello Mayam, the files you attach to a mail are not forwarded by github, and not attached to the github issue. Can you please attach the examples to the issue? I'll be glad to have a look at it. (don't forget to remove passwords and personal info first) |
Hello Rolf, |
Hello Maryam, I see that the CSV reader implementation is not that robust. It can not handle "header rows", as you file has. It treats all the rows the same, and although the configuration file uses "name" for the columns, it is actually a number that needs to be filled in. This needs to be fixed to make it usable. It also can not handle comma's or semicolons inside a field, I just noticed. For now, a workaround to get it running in your case is: <?xml version="1.0" encoding="UTF-8"?>
<configuration>
<file
inFile="IncidentOverview1.csv" reader="com.rolfje.anonimatron.file.CsvFileReader"
outFile="out.csv" writer="com.rolfje.anonimatron.file.CsvFileWriter">
<column name="1" type="ROMAN_NAME" size="100"/>
<column name="5" type="ROMAN_NAME" size="100" />
<column name="6" type="ROMAN_NAME" size="100" />
<column name="7" type="RANDOMDIGITS" size="20"/>
</file>
</configuration> This is not ideal, as the 7th (last) column in your file contains a comma in the data, and it will be treated as column 7 and 8. I'll see what I can do about this, but I need a bit of time to fix it (and also keep it downwards compatible). I hope this helps you a bit, thanks for the patience, examples and config. |
Hello Rolf,
Thank you so much for your help and time. I removed the column titles and
used column numbers instead but the problem with the output being illegible
still persists. The output for me is a number of strange characters. I am
working on Mac and I tried different encodings but I have not found the
solution yet. I would be grateful if you could help.
Thank you again. Looking forward to hearing from you.
Kind regards,
Maryam
…On Mon, Jan 25, 2021 at 2:24 AM Rolf ***@***.***> wrote:
Hello Maryam, I see that the CSV reader implementation is not that robust.
It can not handle "header rows", as you file has. It treats all the rows
the same, and although the configuration file uses "name" for the columns,
it is actually a number that needs to be filled in. This needs to be fixed
to make it usable. It also can not handle comma's or semicolons inside a
field, I just noticed.
For now, a workaround to get it running in your case is:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<file
inFile="IncidentOverview1.csv" reader="com.rolfje.anonimatron.file.CsvFileReader"
outFile="out.csv" writer="com.rolfje.anonimatron.file.CsvFileWriter">
<column name="1" type="ROMAN_NAME" size="100"/>
<column name="5" type="ROMAN_NAME" size="100" />
<column name="6" type="ROMAN_NAME" size="100" />
<column name="7" type="RANDOMDIGITS" size="20"/>
</file>
</configuration>
This is not ideal, as the 7th (last) column in your file contains a comma
in the data, and it will be treated as column 7 and 8. I'll see what I can
do about this, but I need a bit of time to fix it (and also keep it
downwards compatible).
I hope this helps you a bit, thanks for the patience, examples and config.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#122 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASNRCCLBONB36KGZP35STVDS3SQKFANCNFSM4V5W35MA>
.
|
I think the garbled output has to do with file encoding. Input and output file encoding should be UTF-8, is there a way you can check that? |
Hi
I need to anonymise CSV files, consistently (same identifiers, such as names, have the same code). I would like to know about the correct jdbc URL for CSV files.
Any help is greatly appreciated.
Thank you.
The text was updated successfully, but these errors were encountered: