-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added notebook which modifies the preprocessing step for data normalisation #30
base: main
Are you sure you want to change the base?
Conversation
Tested performance with 3 training data normalisation approaches: min max scaling, standard scaling, robust scaling. Results showed no improvement with any of these approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested performance with 3 training data normalisation approaches: min max scaling, standard scaling, robust scaling. Results showed no improvement with any of these approaches.
@eve-b612 Your commit contains 20,457 new lines, so I'm not sure where to look. Would you mind pointing me to the line number where you did the 3 tests?
I think I found it now. It's this part, is that correct?: {
"cell_type": "code",
"source": [
"## MIN MAX SCALER\n",
"\n",
"with open(\"step1_2_preprocess_data.py\", \"r\") as file:\n",
" data = file.readlines()\n",
"\n",
"# insert min-max Scaler after unnecessary columns are dropped\n",
"for i, line in enumerate(data):\n",
" if \"storybooks_dataframe = storybooks_dataframe[['id', 'reading_level',\" in line:\n",
" data.insert(i + 1, \"\"\"\n",
"from sklearn.preprocessing import MinMaxScaler\n",
"scaler = MinMaxScaler()\n",
"storybooks_dataframe[['chapter_count', 'paragraph_count', 'word_count', 'avg_word_length']] = scaler.fit_transform(\n",
" storybooks_dataframe[['chapter_count', 'paragraph_count', 'word_count', 'avg_word_length']])\n",
"print(storybooks_dataframe[['chapter_count', 'paragraph_count', 'word_count', 'avg_word_length']].head())\n",
"\n",
"\"\"\")\n",
"\n",
"# write to file\n",
"with open(\"step1_2_preprocess_data.py\", \"w\") as file:\n",
" file.writelines(data)\n"
],
"metadata": {
"id": "wnqmawHJQoGq"
},
"execution_count": 26,
"outputs": []
} |
Yes! So sorry, I'm new to this I should have specified. It's line 5205 for min-max, line 10282 for standard scaler and 15382 for robust scaler. I think it's so long because the notebook outputs are there... I would appreciate any feedback on this approach. Clear outputs? Submit changes in a different format than notebooks? Thanks :) |
@eve-b612 Well, if you want to be able to easily see how your code changes affected the data further down in the machine learning pipeline, it would probably be easier to add your changes to a Python script instead of a Jupyter notebook. (For example by adding your code changes to |
Ok I see, would you like me to do that? You mentioned that if normalisation did not improve the results then you wouldn't include it, and it in fact did not improve. |
@eve-b612 If the normalization didn't improve the accuracy, then no need to add any more code 👍 But maybe you could add a few words about the experiments you did at the bottom of the Thank you for running the tests! 🙂 |
Added description of normalisation experiment, did not improve model accuracy.
No problem! I've just updated the step1_prepare read me file. |
WalkthroughThe changes in the Changes
Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
pmml/step1_prepare/README.md (1)
26-28
: Approve the new section with suggestions for improvementThe addition of this section on data normalization testing is valuable and aligns well with the PR objectives. It provides a clear summary of the experiments conducted and their results. However, there are a few suggestions to enhance its clarity and usefulness:
- Consider improving the formatting for better readability. For example:
## Testing Data Normalization Three variations of training data normalization were tested to improve model accuracy: - Min-max scaling - Standard scaling (z-score) - Robust scaling Findings: The model results did not improve with these normalization techniques.
It would be beneficial to add more details about the implementation, such as:
- Brief descriptions of each normalization technique
- The specific metrics used to evaluate model accuracy
- Any notable observations during the testing process
Include references to the specific code or notebooks where these tests were performed. This will help future contributors understand the exact implementation and potentially reproduce the results.
Would you like assistance in expanding this section with the suggested improvements?
Tested performance with 3 training data normalisation approaches: min max scaling, standard scaling, robust scaling. Results showed no improvement with any of these approaches.
Issue Number
Purpose
Technical Details
Testing Instructions
Screenshots
Summary by CodeRabbit