diff --git a/.nojekyll b/.nojekyll index 7026e8527..6db0034e0 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -7cdd8dcf \ No newline at end of file +c8cf8f89 \ No newline at end of file diff --git a/all.html b/all.html index 0d05fe133..6e26c0d0e 100644 --- a/all.html +++ b/all.html @@ -278,7 +278,7 @@
+
Categories
A/B testing
Abrahamic Covenant
Gemini
YouTube clip
adventure
adversity
agency
ai
ai benchmarks
ai guardrails
ai mishaps
ai strategy
ai tools
analysis
analytics
apocalyptic
app review
art
ask gpt
atonement
automation
baptism
beach
beginners
big tech
biography
blogging
book of mormon
book review
book-review
build in public
business
business strategy
buying things
career
career advice
cars
challenges
charity
chart critique
chatgpt
christianity
christmas
classics
closed source
coding
come follow me
come follow me lesson plan
commandments
commitment
communication
configuration
consecration
covenant
creativity
culture
data engineering
data manipulation
data pipeline
data science
dataviz
deathbed meditation
decisions
design
dev ops
dev tools
digital minimalism
doctrine and covenants
doubts
dystopian
economics
edge device
education
effort
emotional intelligence
emotional resilience
entrepreneurship
epistemology
experimentation
faith
family
family bonding
family history
fast sunday
fatherhood
favorite scriptures
featured
fiction
food
forecasting
foundation models
friendship
futurism
gathering of israel
ggplot
github
golf
gratitude
gratitude-ThingsIHaveAtAnothersSacrifice
grit
growth
hand of the Lord
hiking
hindsight
historical
history
home
html
humanity
ibis
ideation
incentives
individual covenant
influence
innovation
insights from books
internet article bookmark
interviewing
investing
jekyll
kids books
land of promise
laws of human nature
lds culture
leadership
learning
legal-ai
let god prevail
libby
life hacks
life lessons
life musings
lists
llm
llm benchmarks
llm wars
local ai
logging
love
love of God
machine learning
machine learning platforms
marketing
meditation
memories
mental health
mindfulness
mindset
minimalism
miracles
ml pipeline
ml-tools
modern living
monitoring
mortality
movies
murder mystery
music
musings
my principles
my testimony
nature
news
non-fiction
obedience
observable
observations
obsidian
old testament
open source
opinion
organization
pandas
parenting
peace
people I meet
personal development
personal experiments
personal stories
personality
perspective
philosophy
pioneers
pkm
plotly
poetry
politics
posts
prayer
pricing
priesthood
principles
priorities
privacy
productivity
promised blessings
prophets
psychology
purpose of life
python
quarto
r
raw notes
reactions
reading
recommendation systems
redemption
regression
relationships
retrospective
revelation
saas
sales
sci-fi
scripture of the day
scriptures
self-help
service
shiny
signs
social justice
social media
software
software engineering
sports
startups
statistics
storytelling
strategy
strength
success
sunset
surveillance
survival
team building
teamwork
tech
testimony
the family a proclamation to the world
theology
therapy
ticktick
time series
tool
tool comparison
trust in the Lord
tutorial
ui generators
upskilling
water
webscraping
work
work life balance
world war 2
writing
ww2
@@ -320,7 +320,106 @@

Recent Posts

-
+
+
+

+

+

+
+
+

+Why you should log with Aimstack +

+
+ +
+
+
+tech +
+
+logging +
+
+machine learning +
+
+data pipeline +
+
+ml pipeline +
+
+data science +
+
+monitoring +
+
+ml-tools +
+
+ +
+ +
+
+
+

+

+

+
+ + +
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
@@ -2165,7 +2264,7 @@

-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+

 
diff --git a/all.xml b/all.xml index 2ad47ef78..9e88b6058 100644 --- a/all.xml +++ b/all.xml @@ -10,7 +10,133 @@ quarto-1.4.515 -Mon, 01 Apr 2024 20:46:08 GMT +Tue, 02 Apr 2024 22:15:37 GMT + + Why you should log with Aimstack + https://www.bryanwhiting.com/tech/why-you-should-log-with-aimstack.html + I’ve long idolized
Home | AimStack. It’s a tool that allows you to track metrics and hyperparameters and a whole bunch of stuff. It lets you compare across experiments.

+
+
+

+
Aim landing lage
+
+
+

I first built my own version of this in RShiny back in 2017 so I could compare AUC across experiments.

+

I was using h2o.ai at the time which had this great flow for monitoring an individual experiment but made it impossible to compare across experiments.

+

Then my company Capital One built rubicon: GitHub - capitalone/rubicon-ml: Capture all information throughout your model’s development in a reproducible way and tie results directly to the model code!. This was cool and they open sourced it. It tracks parameters.

+

But it pales in comparison to aim.

+

Aim can do all these things:

+
    +
  • Track hyperparameters
  • +
  • Track learning curves (like a metric over time such as when building DL models and you’re watching how the error decreases with each epoch)
  • +
  • Track any plotly plot, which means you can put any EDA charts nicely organized in one place
  • +
  • it captures all your logging.info calls so you don’t need to use a remote logging thing like cloud watch to monitor long runs
  • +
  • It has its own loggers if you want to differentiate
  • +
  • It lets you take notes on an experiment
  • +
  • If lets you compare across experiments
  • +
  • It lets you use locally or remotely as a remote API
  • +
  • It tracks images, etc.
  • +
  • it connects with ML packages like xgbost to auto log.
  • +
  • It can even convert MLFlow data.
  • +
+

It’s incredible. Use it. It’s free.

+ + + +

_________________________

Bryan lives somewhere at the intersection of faith, fatherhood, and futurism and writes about tech, books, Christianity, gratitude, and whatever’s on his mind. If you liked reading, perhaps you’ll also like subscribing:

]]> + tech + logging + machine learning + data pipeline + ml pipeline + data science + monitoring + ml-tools + https://www.bryanwhiting.com/tech/why-you-should-log-with-aimstack.html + Tue, 02 Apr 2024 22:15:37 GMT + + + Configurations for ML Pipelines + https://www.bryanwhiting.com/tech/configurations-for-ml-pipelines.html + Configuring an ML pipeline means you have 15 different things that could change at any time and you create a way to easily change those 15 things. Those 15 things could be file paths, data filtering steps, models you want to use, etc.

+

Any researcher constantly asks themselves: but what if I switch X?

+

And so the researcher starts to configure a pipeline.

+
+

Config Methods I’ve Used

+

First time I built a config it was in VBA. I had a text file I loaded in that could be overwritten based on setting someone chose.

+

Second time I config’d something was in Python. I used ConfigParser per my Manager David Mantilla’s suggestion. It was pretty good. But unwieldy. Don’t use fhis.

+

Third thing I saw was using a Python constants.py file. Just import Python variables from another module. This is nice because you can import model objects or such. Dicts. Whatever. Seems great. But it’s Python code. Config files shouldn’t be code. They should be configs. Every great software follows this, like k8s helm charts or whatever. Google loves using Protobufs. Configs shouldn’t be code, because if they’re code they’re dangerous. You start doing crazy things with them.

+

Fourth thing I did was to use YAML which is very clean. Lots of people like YAML. This gets unwieldy if you need 100 configs for different customers, for example. Can you imagine managing 100 yaml files? What if you need to update one param? Then you need to update 100 files. Rough.

+

I also used Pydantic to read in the YAML file and validate types. Gotta validate types. What’s an int vs a string? Well, this mean that we needed to design the pipeline to rely on some config class. We had to pass this config object around everywhere. Not super idea but gets the job done.

+

Fifth thing was to use one default YAML. This solved the issue of redundancy across all the 100 YAML files. (If you only have one model, you probably need only 1 YAML file so this may not be your problem.) But this still kinda stinks. It’s in a file.

+

Sixth thing was some eng’s on my team saw a better way and build a CRUD database. This made it so we didn’t have to do a code change to update a config. That means that people outside the team could edit a config. Awesome. But we still have 100 web pages that could change. Kinda sux. And we still had a default setting - essential.

+

Problem with 100 YAML or 100 web pages is that if you want to change things or run experiments, you need to literally clone the config file. Now you have 15 versions of the same config file with slight modifications and you can’t remember what’s going on. So you delete them all eventually and just pick one.

+
+
+

Hydra?

+

I’m writing this because I just learned about hydra . Remember: I use this site for note taking. Here’s what I just learned.

+
+
+

+
Hydra!
+
+
+

Hydra | Hydra

+
    +
  • Hydra is Python open source maintained by Facebook
  • +
  • If built to configure pipelines, in particular ML pipelines but could be used for anything.
  • +
  • It uses dataclasses and yaml files, so I’m thinking I was smart for what I did with my fourth option.
  • +
  • But it quickly allows you to override config files from the command line or from editing the yaml file directly.
  • +
+

This demo is slick:

+
+
+
+

Why I like Hydra’s design

+
    +
  • This seems really nice because it avoids the headache of changing code.
  • +
  • Also, you can create a simple bash loop to execute 5 different experiments - while retaining only one config file
  • +
  • Also, I used to think that having a system of record is pretty important: I need to save the configs that were used for this run. That tells me how the pipeline or experiment was configured. I still think that’s true: but I believe that should be done via logging instead of managing 15 config files.
  • +
  • My new belief is that experiments should be ephemeral to keep the code clean. Have one prod yaml file and then everything else is ephemeral. Log everything: log the created yaml file with all defaults filled inso you can recreate it if necessary.
  • +
+
+
+

My Ideal World

+

Hydra doesn’t solve the “you shouldn’t have to do a git push to update prod”. If everything is a yaml file then to update prod you need to do a code change.

+

Google doesn’t seem to mind using code changes because everything is a protobuff. Code changes are nice because they’re reviewed.

+

But code changes are slow. And non-coders can do them.

+

So I believe one prod config should live in a UI with a database backend. But then that should be serialized to yaml and loaded via something like hydra.

+

I also believe experiments should be launched programmatically. Meaning I should be able to kick off 10 experiments training 10 models using a bash script. I can then log this experiment using Why you should log with Aimstack and I can log the full config there.

+

Also, I’d throw hydra in with metaflow .

+

How do you configure? # Appendix

+ + + +
+ +

_________________________

Bryan lives somewhere at the intersection of faith, fatherhood, and futurism and writes about tech, books, Christianity, gratitude, and whatever’s on his mind. If you liked reading, perhaps you’ll also like subscribing:

]]>
+ tech + configuration + software engineering + ml pipeline + ml-tools + https://www.bryanwhiting.com/tech/configurations-for-ml-pipelines.html + Tue, 02 Apr 2024 21:50:43 GMT +
Brave New World Aldous Huxley @@ -96,7 +222,7 @@

Polars makes this much cleaner with the with_columns operator, for example, which is also very similar to PySpark. But polars is Rust backend, not Java. Game, Set, Match polars.

I’m not the only one who loves dplyr . There have been several Python attempts to build dplyr in the Python ecosystem. There were great packages like GitHub - coursera/pandas-ply: functional data manipulation for pandas (9 years since last commit), and GitHub - dodger487/dplython: dplyr for python, not updated in 7 years. Then there’s the dfply package that hasn’t been maintained in 5 years (see tutorial).

The siuba package is the latest Python dplyr incantation that is actively maintained: GitHub - machow/siuba: Python library for using dplyr like syntax with pandas and SQL and can also execute against a SQL backend, but it can’t execute a polars backend. # Ibis to solve my problems?

-

I just came across ibis however, and it seems really promising. Turns out it was created in 2015 by Wes McKinney to solve the “10 Things I Hate About pandas”.

+

I just came across ibis however, and it seems really promising. Turns out it was created in 2015 by Wes McKinney, who created the pandas pyarrow backend to solve the “10 Things I Hate About pandas”. More on the