-
Notifications
You must be signed in to change notification settings - Fork 5
fix: Perform the last action with the column name modified. #9
base: main
Are you sure you want to change the base?
Conversation
…ng types Signed-off-by: tashiro akira <[email protected]>
…ng types Signed-off-by: tashiro akira <[email protected]>
…ng types Signed-off-by: tashiro akira <[email protected]>
Signed-off-by: tashiro akira <[email protected]>
Signed-off-by: tashiro akira <[email protected]>
Signed-off-by: tashiro akira <[email protected]>
Signed-off-by: tashiro akira <[email protected]>
Signed-off-by: tashiro akira <[email protected]>
Signed-off-by: tashiro akira <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the following process is easy to implement. What do you think?
rename_symbol_cols = {col: inhibited_symbol_pattern.sub("", col) in cols_has_symbols }
train_dataset = train_dataset.rename(columns=rename_symbol_cols)
...
rename_symbol_cols = {v: k for k, v in rename_symbol_cols.items()}
... # restore column names by ".rename(columns=rename_symbol_cols)"
An issue would be that renamed names are the same as the original ones or renamed names of two columns are the same.
Signed-off-by: tashiro akira <[email protected]>
An error occurred during the process of outputting a csv when the review results were applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you consider my comment?
{% if training %} | ||
rename_symbol_cols = {col: inhibited_symbol_pattern.sub("", col) if col in cols_has_symbols else col in cols_has_symbols for col in cols_has_symbols } | ||
rename_symbol_cols = {v: k for k, v in rename_symbol_cols.items()} | ||
train_dataset = train_dataset.rename(columns=lambda col: inhibited_symbol_pattern.sub("", col) if col in cols_has_symbols else col) | ||
{% endif %} | ||
{% if test %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following code?
rename_symbol_cols = {col: inhibited_symbol_pattern.sub("", col) if col in cols_has_symbols else col in cols_has_symbols for col in cols_has_symbols }
{% if training %}
train_dataset = train_dataset.rename(columns=rename_symbol_cols)
{% endif %}
{% if test %}
test_dataset = test_dataset.rename(columns=rename_symbol_cols)
{% endif %}
rename_symbol_cols = {v: k for k, v in rename_symbol_cols.items()}
Signed-off-by: tashiro akira <[email protected]>
Reflected the content of the review. |
Signed-off-by: tashiro akira <[email protected]>
sapientml_preprocess/generator.py
Outdated
@@ -31,7 +32,7 @@ | |||
|
|||
logger = setup_logger() | |||
|
|||
INHIBITED_SYMBOL_PATTERN = re.compile(r"[\{\}\[\]\",:<'\\]+") | |||
INHIBITED_SYMBOL_PATTERN = re.compile(r"[\{\}\[\]\",:<'\\\+]+") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you add \+
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was checking if "+" is treated as a symbol in the test string.
I put it back.
sapientml_preprocess/generator.py
Outdated
org_df_column = df.columns.values | ||
org_target_column = task.target_columns | ||
df = df.rename(columns=lambda col: remove_symbols(col) if col in cols_has_symbols else col) | ||
task.target_columns = [ | ||
remove_symbols(col) if col in cols_has_symbols else col for col in task.target_columns | ||
] | ||
same_column = {k: v for k, v in collections.Counter(list(df.columns.values)).items() if v > 1} | ||
rename_dict = {} | ||
if len(same_column) != 0: | ||
for target in same_column.keys(): | ||
rename_dict = {} | ||
rename_target_col = [] | ||
df_cols = list(df.columns.values) | ||
i = 1 | ||
for col in df_cols: | ||
if target in col: | ||
rename_dict[org_df_column[len(rename_dict)]] = str(col + str(i)) | ||
i = i + 1 | ||
else: | ||
rename_dict[org_df_column[len(rename_dict)]] = col | ||
df = df.set_axis(list(rename_dict.values()), axis=1) | ||
i = 1 | ||
for col in org_target_column: | ||
rename_target_col.append(rename_dict[col]) | ||
|
||
task.target_columns = rename_target_col | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think the code may rename column names which originally don't contain symbols, right? If so, could you keep the names when the column names contain symbols.
- This code is hard to read and can have unexpected issues. Could you rewrite it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The review has been applied.
if len(rename_dict) == 0 : | ||
rename_symbol_cols = {col: inhibited_symbol_pattern.sub("", col) if col in cols_has_symbols else col in cols_has_symbols for col in cols_has_symbols } | ||
else: | ||
rename_symbol_cols = rename_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need show this conditional branch to users. Could you move it to "template's if statement"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed conditional branching in template if statement.
Signed-off-by: tashiro-akira <[email protected]>
Signed-off-by: tashiro-akira <[email protected]>
Signed-off-by: tashiro-akira <[email protected]>
Signed-off-by: tashiro-akira <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you consider my comment?
same_column[target] = same_column[target] - 1 | ||
else: | ||
rename_dict[org_column] = target | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current method fails when the renamed names are the same as original names.
For example, there are original columns Age
, Age{}
and Age1
.
In the case, Age
-> Age1
, Age{}
-> Age0
, so there are two Age1
columns.
Could you consider the case?
@AkiraUra @kimusaku
Fix target column names to revert after removing special characters.
Please check.