Add standardize and normalize #339

nmaarnio · 2024-03-02T08:37:54Z

No description provided.

…py of data, have invalid col check, force float dtype have better docs, standardization uses ddof=0

…tion

… for DF inputs

… (separate functions for raster and vector data)

nmaarnio · 2024-03-04T07:58:22Z

@em-t or @lehtonenp , would you have time to review this? Or perhaps @msmiyels or @nialov if you are not too busy.

lehtonenp · 2024-03-04T09:03:55Z

@em-t or @lehtonenp , would you have time to review this? Or perhaps @msmiyels or @nialov if you are not too busy.

@nmaarnio, I got time to review. So, I'll review this one.

lehtonenp

I ran tests and cli functions successfully.

I was wondering whether we should have private functions for both normalize and standardize just like in other functions. Are private functions needed? I understand that the functions are short. But the somewhat lengthy if else statements would not be so cluttered as now.

nmaarnio · 2024-03-04T09:49:30Z

Initially we implemented private functions for all tools, but this aspect our style guide has loosened to the point where it is not required anymore. I personally started shifting towards implementing one or more private functions only when I felt they were useful / the function had more contents than just checks and simple call to a library function. Additionally, the gh-pages documentation site generated by mkdocs shows the source code of the functions and having only one function makes it show the full code of a function. This could be considered a pro, but I think few people will use it and directly open the source code in their code editor when they wish to inspect our implementations.

Here I felt they would not add much and was not sure which part(s) to separate. However, I am open to suggestions and not against refactoring this! Do you have a specific suggestion for diving the code?

lehtonenp · 2024-03-04T09:54:25Z

I see that the private functions would not add much value to normalize and standardize for the given reasons. It is a plus that mkdocs reveals the implementation of the algorithms. Let's keep the code as you have implemented it. Approving.

lehtonenp

Tests and cli functions work as expected. Private function is not required.

Approving.

msmiyels · 2024-03-04T10:48:31Z

@nmaarnio sorry for beeing late on this. We actually have the normalization and standardization in the transformation module. So it looks that at this point, we do have multiple functions for the same thing.

What we do not have there yet is handling of tabular data (and I don´t know the CLI status of those), since this was decided during the review process of the functions.

lehtonenp · 2024-03-04T10:55:20Z

@nmaarnio sorry for beeing late on this. We actually have the normalization and standardization in the transformation module. So it looks that at this point, we do have multiple functions for the same thing.

What we do not have there yet is handling of tabular data (and I don´t know the CLI status of those), since this was decided during the review process of the functions.

Could you point out the normalization and standardization functions? I as the reviewer was not aware of these existing. I have mostly worked on raster and vector processing so far.

nmaarnio · 2024-03-04T10:59:28Z

No worries. I was aware that they can be performed with the existing transformations functions, but I thought it a good idea to create public functions that are called normalization and standardization (right now I think they are a bit "hidden" in the transformation functions). But in retrospect I now realize it's a bit silly to re-implement them – one option would have been to create public functions called standardize and normalize that simply call z_score_normalization and min_max_scaling with locked parameter values. EDIT: I noticed now that z_score_normalization does not take parameters but performs same operation as standardization always. I had not encountered this name before, is it commonly used @msmiyels ? In contrast, I have seen standardization widely used as a term referring to this rescaling operation.

There are some differences with the inputs and handling. I myself prefer that operations are defined for both vector and raster data if applicable for the plugin and CLI, but defining the core operation on data (Numpy array or Pandas DF) if it makes sense. Maybe we should give this a little thought now, should we merge this or not and how to proceed.

msmiyels · 2024-03-04T12:39:42Z

@nmaarnio I think it is quite easy to bring in confusion here, since there can be only very subtle differences between those terms and their purpose.

Generally:

scaling and normalize refer to a transformation of data into a new specific range, commonly [0,1]
standardize refers to the transformation of data based on statistical measures and data will not be forced into a specific range

Explicitly:

normalization is commonly the linear transformation into [0, 1]
standardization is commonly the transformation based on the z-score

So it looks that there are some inconsistencies among the naming of these functions, but all of these are transformations.

What about naming these functions f(x)_transform and provide kind of higher level public functions normalize and standardize which call the respective defaults for the [0, 1] min/max linear and the z-score transformations? You could use the exisiting private functions which only need a np.ndarray and parameters as input. "Tabularity" could be integrated in these new public functions as well.

Alternatively, change the existing public functions to accept tabular data? I know it's repetition, but this was explicitely excluded during the review process of the transformation functions 😵

Another thought that just came in 🤯: if it's only for the Plugin 📺 or even CLI ⚙️, we could just name it "stan...", "norm..." and call the existing functions under the hood.

What do you think?

nmaarnio added 7 commits March 1, 2024 10:04

feat(transformations): add standardize and normalize tools

89393ae

fix(transformations): standardization and normalization now return co…

0ad6f0b

…py of data, have invalid col check, force float dtype have better docs, standardization uses ddof=0

tests(transformations): added tests for normalization and standardiza…

283db13

…tion

docs: added docs for normalization and standardization

efb357a

docs: added docs for normalization and standardization

839b338

fix(transformations): add numeric data filtering and empty list check…

1dc1a38

… for DF inputs

cli(transformations): add CLI functions for standardize and normalize…

76e6fb7

… (separate functions for raster and vector data)

nmaarnio linked an issue Mar 2, 2024 that may be closed by this pull request

Add standardize and normalize tools #340

Open

10 tasks

nmaarnio marked this pull request as ready for review March 4, 2024 06:35

lehtonenp self-requested a review March 4, 2024 09:04

lehtonenp reviewed Mar 4, 2024

View reviewed changes

lehtonenp approved these changes Mar 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add standardize and normalize #339

Add standardize and normalize #339

nmaarnio commented Mar 2, 2024

nmaarnio commented Mar 4, 2024 •

edited

Loading

lehtonenp commented Mar 4, 2024

lehtonenp left a comment •

edited

Loading

nmaarnio commented Mar 4, 2024

lehtonenp commented Mar 4, 2024

lehtonenp left a comment

msmiyels commented Mar 4, 2024

lehtonenp commented Mar 4, 2024

nmaarnio commented Mar 4, 2024 •

edited

Loading

msmiyels commented Mar 4, 2024

Add standardize and normalize #339

Are you sure you want to change the base?

Add standardize and normalize #339

Conversation

nmaarnio commented Mar 2, 2024

nmaarnio commented Mar 4, 2024 • edited Loading

lehtonenp commented Mar 4, 2024

lehtonenp left a comment • edited Loading

Choose a reason for hiding this comment

nmaarnio commented Mar 4, 2024

lehtonenp commented Mar 4, 2024

lehtonenp left a comment

Choose a reason for hiding this comment

msmiyels commented Mar 4, 2024

lehtonenp commented Mar 4, 2024

nmaarnio commented Mar 4, 2024 • edited Loading

msmiyels commented Mar 4, 2024

nmaarnio commented Mar 4, 2024 •

edited

Loading

lehtonenp left a comment •

edited

Loading

nmaarnio commented Mar 4, 2024 •

edited

Loading