-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File widget confuses numerical and categorical data #5196
Comments
You're absolutely right. I've been working on getting all of File's functionality into CSV File Import #5077, which can then replace File entirely. That one uses The PR isn't finished yet, but it's far enough to see if this issue applies to it, check it out if you like. Let me know if you need help installing Orange locally to view the pull request (message me (rafael) on Discord). |
A summary of today's internal discussion. Integer-like values are assumed to be categorical if the total number of unique values is small enough. Values like 1,2, 3 or 0, 1 are often coded categorical values, like "survival" or "gender". This behaviour will remain. If any of variable's values includes a dot, this is currently also assumed be categorical variable (like in 1.1, 1.2, 1.3, 2.1, 2.2, 3.1, 3.2, 3.3). Today we discussed that this is probably undesirable, so if any of variable's values contains a decimal dot, the variable should be treated as numeric. @irgolic wil incliude this in his PR. (But, @irgolic, also check readers that are not based on pandas, or keep this PR open.) |
That sounds like a reasonable solution. And, if the heuristic is explained in the docs, it shouldn’t lead to confusion. |
Actually, I was wrong: we have already changed this long ago. What you encountered was just a bug. :) |
On importing CSV and XLSX files the File widget sometimes get confused on what data type the columns are, by setting columns which only contain numerical values as categorical. One then has to manually set these to be numeric, and this has to be done for each individual column.
Add a File to the canvas, use it to read a data file, check the column data type assignments.
I attach a sample data file that triggers the problem. As can be seen in the attached screen shot, several of the columns are interpreted as categorical even though the data in them clearly are numeric. This may be due to the data being sparse, but funnily enough entirely empty columns are correctly parsed as numeric. I think if all present data items in a column are numeric, the column should be interpreted as numeric, it should be quite rare to have floating point values in a categorical variable. It’s also clear that all values in a column are read, as they are shown in the table, so proper parsing should not require extra reading passes.
It would also be quite convenient if one could select multiple columns at once and re-encode them as the appropriate type, as this would be faster than having to go through each column individually. (This might of course require extensive surgery in the user interface.)
se_adt_1524_lt_zs.xlsx
The text was updated successfully, but these errors were encountered: