Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File widget confuses numerical and categorical data #5196

Closed
3 tasks
kaimikael opened this issue Jan 18, 2021 · 4 comments · Fixed by #5295
Closed
3 tasks

File widget confuses numerical and categorical data #5196

kaimikael opened this issue Jan 18, 2021 · 4 comments · Fixed by #5295
Assignees
Labels

Comments

@kaimikael
Copy link

  • What's wrong?

On importing CSV and XLSX files the File widget sometimes get confused on what data type the columns are, by setting columns which only contain numerical values as categorical. One then has to manually set these to be numeric, and this has to be done for each individual column.

  • How can we reproduce the problem?

Add a File to the canvas, use it to read a data file, check the column data type assignments.

I attach a sample data file that triggers the problem. As can be seen in the attached screen shot, several of the columns are interpreted as categorical even though the data in them clearly are numeric. This may be due to the data being sparse, but funnily enough entirely empty columns are correctly parsed as numeric. I think if all present data items in a column are numeric, the column should be interpreted as numeric, it should be quite rare to have floating point values in a categorical variable. It’s also clear that all values in a column are read, as they are shown in the table, so proper parsing should not require extra reading passes.

It would also be quite convenient if one could select multiple columns at once and re-encode them as the appropriate type, as this would be faster than having to go through each column individually. (This might of course require extensive surgery in the user interface.)

  • What's your environment?

se_adt_1524_lt_zs.xlsx
Screenshot 2021-01-18 at 17 45 27

@kaimikael kaimikael added the bug report Bug is reported by user, not yet confirmed by the core team label Jan 18, 2021
@irgolic
Copy link
Member

irgolic commented Jan 18, 2021

You're absolutely right.

I've been working on getting all of File's functionality into CSV File Import #5077, which can then replace File entirely. That one uses pandas.read_csv, and column types are probably inferred as you're suggesting. You can also set ranges of columns to a type.

The PR isn't finished yet, but it's far enough to see if this issue applies to it, check it out if you like. Let me know if you need help installing Orange locally to view the pull request (message me (rafael) on Discord).

@janezd janezd self-assigned this Jan 22, 2021
@janezd
Copy link
Contributor

janezd commented Jan 22, 2021

A summary of today's internal discussion.

Integer-like values are assumed to be categorical if the total number of unique values is small enough. Values like 1,2, 3 or 0, 1 are often coded categorical values, like "survival" or "gender". This behaviour will remain.

If any of variable's values includes a dot, this is currently also assumed be categorical variable (like in 1.1, 1.2, 1.3, 2.1, 2.2, 3.1, 3.2, 3.3). Today we discussed that this is probably undesirable, so if any of variable's values contains a decimal dot, the variable should be treated as numeric. @irgolic wil incliude this in his PR. (But, @irgolic, also check readers that are not based on pandas, or keep this PR open.)

@janezd janezd removed their assignment Jan 22, 2021
@janezd janezd added wish and removed bug report Bug is reported by user, not yet confirmed by the core team labels Jan 22, 2021
@kaimikael
Copy link
Author

That sounds like a reasonable solution. And, if the heuristic is explained in the docs, it shouldn’t lead to confusion.

@janezd
Copy link
Contributor

janezd commented Feb 25, 2021

Actually, I was wrong: we have already changed this long ago. What you encountered was just a bug. :)

@irgolic irgolic self-assigned this Feb 25, 2021
@irgolic irgolic reopened this Feb 25, 2021
@irgolic irgolic closed this as completed Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants