Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add standardize and normalize #339

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

nmaarnio
Copy link
Collaborator

@nmaarnio nmaarnio commented Mar 2, 2024

No description provided.

@nmaarnio nmaarnio linked an issue Mar 2, 2024 that may be closed by this pull request
10 tasks
@nmaarnio nmaarnio marked this pull request as ready for review March 4, 2024 06:35
@nmaarnio
Copy link
Collaborator Author

nmaarnio commented Mar 4, 2024

@em-t or @lehtonenp , would you have time to review this? Or perhaps @msmiyels or @nialov if you are not too busy.

@lehtonenp
Copy link
Collaborator

@em-t or @lehtonenp , would you have time to review this? Or perhaps @msmiyels or @nialov if you are not too busy.

@nmaarnio, I got time to review. So, I'll review this one.

@lehtonenp lehtonenp self-requested a review March 4, 2024 09:04
Copy link
Collaborator

@lehtonenp lehtonenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran tests and cli functions successfully.

I was wondering whether we should have private functions for both normalize and standardize just like in other functions. Are private functions needed? I understand that the functions are short. But the somewhat lengthy if else statements would not be so cluttered as now.

@nmaarnio
Copy link
Collaborator Author

nmaarnio commented Mar 4, 2024

Initially we implemented private functions for all tools, but this aspect our style guide has loosened to the point where it is not required anymore. I personally started shifting towards implementing one or more private functions only when I felt they were useful / the function had more contents than just checks and simple call to a library function. Additionally, the gh-pages documentation site generated by mkdocs shows the source code of the functions and having only one function makes it show the full code of a function. This could be considered a pro, but I think few people will use it and directly open the source code in their code editor when they wish to inspect our implementations.

Here I felt they would not add much and was not sure which part(s) to separate. However, I am open to suggestions and not against refactoring this! Do you have a specific suggestion for diving the code?

@lehtonenp
Copy link
Collaborator

I see that the private functions would not add much value to normalize and standardize for the given reasons. It is a plus that mkdocs reveals the implementation of the algorithms. Let's keep the code as you have implemented it. Approving.

Copy link
Collaborator

@lehtonenp lehtonenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests and cli functions work as expected. Private function is not required.

Approving.

@msmiyels
Copy link
Collaborator

msmiyels commented Mar 4, 2024

@nmaarnio sorry for beeing late on this. We actually have the normalization and standardization in the transformation module. So it looks that at this point, we do have multiple functions for the same thing.

What we do not have there yet is handling of tabular data (and I don´t know the CLI status of those), since this was decided during the review process of the functions.

@lehtonenp
Copy link
Collaborator

@nmaarnio sorry for beeing late on this. We actually have the normalization and standardization in the transformation module. So it looks that at this point, we do have multiple functions for the same thing.

What we do not have there yet is handling of tabular data (and I don´t know the CLI status of those), since this was decided during the review process of the functions.

Could you point out the normalization and standardization functions? I as the reviewer was not aware of these existing. I have mostly worked on raster and vector processing so far.

@nmaarnio
Copy link
Collaborator Author

nmaarnio commented Mar 4, 2024

No worries. I was aware that they can be performed with the existing transformations functions, but I thought it a good idea to create public functions that are called normalization and standardization (right now I think they are a bit "hidden" in the transformation functions). But in retrospect I now realize it's a bit silly to re-implement them – one option would have been to create public functions called standardize and normalize that simply call z_score_normalization and min_max_scaling with locked parameter values. EDIT: I noticed now that z_score_normalization does not take parameters but performs same operation as standardization always. I had not encountered this name before, is it commonly used @msmiyels ? In contrast, I have seen standardization widely used as a term referring to this rescaling operation.

There are some differences with the inputs and handling. I myself prefer that operations are defined for both vector and raster data if applicable for the plugin and CLI, but defining the core operation on data (Numpy array or Pandas DF) if it makes sense. Maybe we should give this a little thought now, should we merge this or not and how to proceed.

@msmiyels
Copy link
Collaborator

msmiyels commented Mar 4, 2024

@nmaarnio I think it is quite easy to bring in confusion here, since there can be only very subtle differences between those terms and their purpose.

Generally:

  • scaling and normalize refer to a transformation of data into a new specific range, commonly [0,1]
  • standardize refers to the transformation of data based on statistical measures and data will not be forced into a specific range

Explicitly:

  • normalization is commonly the linear transformation into [0, 1]
  • standardization is commonly the transformation based on the z-score

So it looks that there are some inconsistencies among the naming of these functions, but all of these are transformations.

What about naming these functions f(x)_transform and provide kind of higher level public functions normalize and standardize which call the respective defaults for the [0, 1] min/max linear and the z-score transformations? You could use the exisiting private functions which only need a np.ndarray and parameters as input. "Tabularity" could be integrated in these new public functions as well.

Alternatively, change the existing public functions to accept tabular data? I know it's repetition, but this was explicitely excluded during the review process of the transformation functions 😵

Another thought that just came in 🤯: if it's only for the Plugin 📺 or even CLI ⚙️, we could just name it "stan...", "norm..." and call the existing functions under the hood.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add standardize and normalize tools
3 participants