File widget confuses numerical and categorical data #5196

kaimikael · 2021-01-18T17:09:43Z

What's wrong?

On importing CSV and XLSX files the File widget sometimes get confused on what data type the columns are, by setting columns which only contain numerical values as categorical. One then has to manually set these to be numeric, and this has to be done for each individual column.

How can we reproduce the problem?

Add a File to the canvas, use it to read a data file, check the column data type assignments.

I attach a sample data file that triggers the problem. As can be seen in the attached screen shot, several of the columns are interpreted as categorical even though the data in them clearly are numeric. This may be due to the data being sparse, but funnily enough entirely empty columns are correctly parsed as numeric. I think if all present data items in a column are numeric, the column should be interpreted as numeric, it should be quite rare to have floating point values in a categorical variable. It’s also clear that all values in a column are read, as they are shown in the table, so proper parsing should not require extra reading passes.

It would also be quite convenient if one could select multiple columns at once and re-encode them as the appropriate type, as this would be faster than having to go through each column individually. (This might of course require extensive surgery in the user interface.)

What's your environment?

Operating system: macOS 11.1
Orange version: 3.27.1
How you installed Orange: Disk image at https://orangedatamining.com/download/#macos

se_adt_1524_lt_zs.xlsx

irgolic · 2021-01-18T17:14:13Z

You're absolutely right.

I've been working on getting all of File's functionality into CSV File Import #5077, which can then replace File entirely. That one uses pandas.read_csv, and column types are probably inferred as you're suggesting. You can also set ranges of columns to a type.

The PR isn't finished yet, but it's far enough to see if this issue applies to it, check it out if you like. Let me know if you need help installing Orange locally to view the pull request (message me (rafael) on Discord).

janezd · 2021-01-22T13:16:13Z

A summary of today's internal discussion.

Integer-like values are assumed to be categorical if the total number of unique values is small enough. Values like 1,2, 3 or 0, 1 are often coded categorical values, like "survival" or "gender". This behaviour will remain.

If any of variable's values includes a dot, this is currently also assumed be categorical variable (like in 1.1, 1.2, 1.3, 2.1, 2.2, 3.1, 3.2, 3.3). Today we discussed that this is probably undesirable, so if any of variable's values contains a decimal dot, the variable should be treated as numeric. @irgolic wil incliude this in his PR. (But, @irgolic, also check readers that are not based on pandas, or keep this PR open.)

kaimikael · 2021-01-22T13:36:22Z

That sounds like a reasonable solution. And, if the heuristic is explained in the docs, it shouldn’t lead to confusion.

janezd · 2021-02-25T19:57:22Z

Actually, I was wrong: we have already changed this long ago. What you encountered was just a bug. :)

kaimikael added the bug report Bug is reported by user, not yet confirmed by the core team label Jan 18, 2021

janezd self-assigned this Jan 22, 2021

janezd removed their assignment Jan 22, 2021

janezd added wish and removed bug report Bug is reported by user, not yet confirmed by the core team labels Jan 22, 2021

janezd self-assigned this Feb 25, 2021

janezd mentioned this issue Feb 25, 2021

guess_data_type: Ignore missing values for numeric variables #5295

Merged

2 tasks

irgolic self-assigned this Feb 25, 2021

markotoplak closed this as completed in #5295 Feb 25, 2021

irgolic reopened this Feb 25, 2021

irgolic closed this as completed Feb 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File widget confuses numerical and categorical data #5196

File widget confuses numerical and categorical data #5196

kaimikael commented Jan 18, 2021

irgolic commented Jan 18, 2021 •

edited

Loading

janezd commented Jan 22, 2021

kaimikael commented Jan 22, 2021

janezd commented Feb 25, 2021

File widget confuses numerical and categorical data #5196

File widget confuses numerical and categorical data #5196

Comments

kaimikael commented Jan 18, 2021

irgolic commented Jan 18, 2021 • edited Loading

janezd commented Jan 22, 2021

kaimikael commented Jan 22, 2021

janezd commented Feb 25, 2021

irgolic commented Jan 18, 2021 •

edited

Loading