AI vs Advent of Code 2024 Results:

Based on how these eight models performed across the 25 Advent of Code–style puzzles, we can tease out practical hints about which models might fit common real-world programming tasks. Below is a synthesis of what typical programmers often need—like parsing logs, automating workflow scripts, writing or refactoring code with complex logic, etc.—and how each model’s puzzle-solving skills might (or might not) translate to those needs. Most people focus on obvious points (e.g., which model has the highest solve rate), but there are subtler implications:

1. o1 – “The Heavy-Duty Problem Solver”

Notable Puzzle Strengths
- Consistently solves tough second puzzles that stump others (Days 15, 20, 22, etc.). Many of these revolve around multi-step BFS or 2D puzzle logic, plus complicated constraints that ordinary solutions might not find easily.
- Achieves a 92% success on part1 tasks and still leads the pack (66.7%) on part2 tasks.
Real-World Mapping
- Complex Workflow Orchestration: If you’re automating a pipeline (CI/CD, DevOps scripts, large multi-step tasks), or diagnosing multi-phase processes, you often need a system that can handle “If step fails, do X; else proceed to Y”—similar to the advanced pathfinding and stateful logic that o1 repeatedly nailed.
- Large Code Refactoring: Big refactors have “knock-on” effects across many files. o1’s strong performance on big “global” or iterative puzzles suggests it can keep track of multiple conditions over big codebases.
- Branching/Tree-Filled Logic: Day 20–style “race condition cheats” reflect scouring large solution spaces for shortcuts—akin to searching for an optimal algorithmic path in a labyrinth of code.

Insight People Might Miss:
Many see a top performer as only for “hard tasks,” but everyday dev chores can surprise you with chain-of-thought complexity. If you often do tricky merges or test coverage logic, o1’s puzzle performance suggests it’s good for code tasks that can’t be solved with a quick straightforward script.

2. o1-mini – “High Value, Less Raw Power”

Notable Puzzle Strengths
- Second overall (61.2%). Succeeds on a variety of BFS/logic puzzles but more often fails part2 than o1.
- Still 72% on part1 tasks—meaning it’s quite solid for “medium-complexity” logic.
Real-World Mapping
- Moderately Complex Problems: If you want a lightweight model to handle daily tasks like “generate a small script,” “debug that function,” or “walk me through code changes,” o1-mini’s decent track record on BFS-like puzzle logic suggests it’s comfortable with multi-step reasoning—but not at the extreme scale.
- Everyday Scripting: Because it’s strong on part1 tasks (the “straightforward half” of many AoC puzzles), it might be good at typical tasks that don’t require advanced backtracking or major architectural leaps.

Insight People Might Miss:
Smaller or “mini” models sometimes get dismissed, but a large portion of day-to-day coding is part1-level complexity: analyzing logs, rewriting small classes, etc. o1-mini’s puzzle performance indicates it can handle a surprising range of tasks short of the truly labyrinthine ones.

3. Claude-3.5-Sonnet – “Balanced Text & Logic”

Puzzle Performance
- 55% overall. Ties with gemini-1206.
- Good on certain BFS or second puzzles (Day 7 and Day 16 with 2/2 success). Strong on text-manipulating tasks like Day 4, Day 19.
Real-World Mapping
- Text-Heavy But Some Complexity: For typical code tasks that blend a ton of text manipulation (like processing logs or building queries from user input) with mild logic, claude-3.5-sonnet does well.
- Documentation & Explanation: Claude models often excel at generating friendly, expanded text—like writing docstrings or user guides. The puzzle data shows it thrives on puzzles that are half textual search, half logic, rather than purely numeric or purely BFS.

Insight People Might Miss:
Day 16’s success is particularly telling: that puzzle involves “score-minimizing path with rotational cost,” i.e. a type of route planning plus cost trade-offs. If your daily job involves scheduling tasks or comparing trade-offs (like “server usage cost vs. speed”), claude-3.5-sonnet has shown it can handle more than just text rewriting.

4. Gemini-1206 – “Arithmetic & Expression Master”

Puzzle Performance
- Tied at 55%. Standouts: Day 7, Day 11, Day 13, Day 14 part2. Many revolve around numeric transformations, operator insertion, or left-to-right expression parsing.
- Sometimes bombs entire days with complicated state-based or large puzzle logic (e.g., Day 15).
Real-World Mapping
- Math-Heavy Code: If your daily tasks include optimizing math expressions, generating numeric data transformations, or building queries that rely on complex arithmetic logic, gemini-1206’s puzzle pattern is relevant.
- Validation for Combinatorial/Regex tasks: Day 13 was all about inserting operators with weird concatenation rules. That’s reminiscent of rewriting or verifying code that manipulates strings and numbers in lockstep.

Insight People Might Miss:
Even “regular developers” occasionally need to do big-lump arithmetic—like analyzing performance formulae, or quickly writing code that merges or splits numeric fields. Gemini’s repeated success on part2 for numeric puzzle days suggests it can go further than one might expect for math-driven tasks, especially if you ask it to do more advanced transformations than a typical script.

5. Deepseek – “Occasional Breakthroughs, but Inconsistent”

Puzzle Performance
- 49% overall, though it had some unique wins (Day 6 part2 alone, Day 18 part2, etc.). Often does part1 well, stumbles on part2.
Real-World Mapping
- Targeted Problem Solving: If you occasionally face tasks where you suspect a single clever trick or short-circuit—like a puzzle that no other approach catches—Deepseek’s uniqueness might help.
- Log/Regex/Filtering: Some of its puzzle wins revolve around parsing or partial BFS. Possibly good for “searching big logs” or “finding anomalies.”

Insight People Might Miss:
Despite the 49% overall, Deepseek’s random strong solves prove it might figure out corner-case logic that others miss. This could be precious if you rely on it for “niche debugging scenarios” or corner-case detection in your code.

6. Claude-3.5-Haiku – “Text-Focused, but Struggles on Complex Branching”

Puzzle Performance
- 38.8% overall. Part2 success is only ~20%. So it tends to solve simpler or text-oriented puzzles (Days 1, 2, 3, etc.) but rarely hits home runs on advanced logic.
Real-World Mapping
- Simpler Code Generation & Summaries: Possibly good for everyday tasks like “generate docstrings,” “convert code comments,” or “draft a short script,” but do not expect it to handle multi-step BFS or big integer logic well.
- RegEx or Minimal State: Fine for tasks that need quick text filtering or short pipeline logic.

Insight People Might Miss:
Comparing Haiku vs. Sonnet: Haiku repeatedly falls behind on heavy BFS or big numeric expansions. So if your day job involves continuous integration logic or advanced code transformations, pick something else. But for short “help me rewrite these lines,” it suffices.

7. GPT-4o – “Capable in Some Parsing, but Spotty on Large BFS/State”

Puzzle Performance
- About 41% overall. It has random bright spots (Day 1, Day 3, Day 18, Day 19) but also glaring misses (Days 5, 6, 20, 22, etc.).
- Part2 success is only 33%.
Real-World Mapping
- Parsing + Basic Logic: GPT-4o can do well on reading logs or doing medium-level manipulations (like text splitting, searching, or partial BFS). Possibly decent for rewriting code or explaining short code blocks.
- Fails on Extended Multi-Step: The data suggests it cracks under tasks that need lots of internal state or a tricky final step.

Insight People Might Miss:
Many assume “GPT-x” is best for everything. But these puzzles show GPT-4o is not unstoppable. For daily tasks with complicated multi-step logic or specialized BFS, you might want O1 or at least a more consistent puzzle solver. GPT-4o is still good enough for quick code scaffolding or partial debugging.

8. GPT-4o-mini – “Lightweight, Rarely Surpasses Basic Tasks”

Puzzle Performance
- Lowest success rate (22.4%). Rarely solves part2 tasks (only 2 out of 24).
- Did surprisingly solve Day 25 part1 while full GPT-4o did not. So it can have random niche successes.
Real-World Mapping
- Very Simple or “One-Shot” tasks: If you only need quick code snippets or it’s auto-completing small changes in your editor, that might be fine.
- Don’t rely on it for complicated logic, especially multi-step.

Insight People Might Miss:
If your day-to-day is mostly “type this snippet faster,” GPT-4o-mini might be enough. But as soon as your daily job merges with advanced bugfixing or multi-stage logic, it’ll likely fail. The puzzle data is a big caution sign that it can’t handle second parts or extended puzzles.

9. Overlooked Insights for Everyday Devs

Multi-Step BFS/DFS Really Does Show Up in Regular Development.
- Migrations, code transformations, large-scale reorganizations, dependency resolution (think: partial topological sorts for library upgrades)—all are BFS-like or search-based tasks. Models that solve tough day2 tasks (like o1) are prime choices.
Arithmetic & Expression Handling Doesn’t Just Mean “Math.”
- If your backend pipelines do string-to-number merges, or your code frequently modifies values in different formats, a model like gemini might excel at that repetitive transform. Even if you think you’re “just a web dev,” you’d be surprised how often weird numeric logic arises (report generation, analytics scripts, data cleaning, etc.).
Sometimes, Part2–Style Problems Map to “Refactoring” or “Edge Cases.”
- Part2 puzzles tend to twist the original solution—like handling an exception or bigger input size. In everyday coding, that’s exactly where subtle bugs appear. So a model’s ability to handle a puzzle’s “expanded scenario” translates to how it might handle edge-case or advanced scenario code. O1 often nails these expansions.
Unsolved Days (Day 21, Day 17 part2, etc.) Indicate That All Models Have Blind Spots.
- If you expect these LLMs to solve extremely meta or specialized tasks (like complicated concurrency or generation of self-modifying code), none of them might handle it well. The puzzle data is a reminder that all LLMs have outer limits.
Text vs. Logic: The “Claude Haiku” vs. “Claude Sonnet” difference, or the big gap between GPT-4o and GPT-4o-mini, show that raw text fluency doesn’t always translate into solving logic puzzles. If your tasks revolve purely around phrasing and documentation, you can pick a text-savvy but logic-limited model. If your tasks revolve around deep logic or code constraints, pick a puzzle-savvy model like o1 or gemini.

Final Practical Advice

Use O1 if your daily coding is prone to deep chaining logic, tricky migrations, or advanced BFS-like problem solving.
Use O1-Mini if you want a smaller model but still handle moderate BFS/DFS or step-by-step logic reliably.
Use Gemini-1206 for code that has heavy arithmetic or expression-based manipulations—like analyzing cost functions, rewriting formulas, or systematically inserting operators.
Use Claude-3.5-Sonnet for a good balance of text understanding (like rewriting docstrings) plus moderate BFS puzzle logic.
Use GPT-4o only if you’re sure your tasks are in that mid-range, mostly parsing or straightforward logic. Don’t expect it to handle the toughest expansions.
Use GPT-4o-mini for trivial or short tasks, like snippet generation or template expansions. It won’t handle big leaps or multi-stage puzzle logic.

This deeper cross-reference of puzzle performance with typical “developer chores” underscores that puzzle-solving skill does mirror real coding challenges more than many suspect—and that picking the right model can save you huge debugging or refactoring headaches in your day-to-day development workflow.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.vscode		.vscode
day_01		day_01
day_02		day_02
day_03		day_03
day_04		day_04
day_05		day_05
day_06		day_06
day_07		day_07
day_08		day_08
day_09		day_09
day_10		day_10
day_11		day_11
day_12		day_12
day_13		day_13
day_14		day_14
day_15		day_15
day_16		day_16
day_17		day_17
day_18		day_18
day_19		day_19
day_20		day_20
day_21		day_21
day_22		day_22
day_23		day_23
day_24		day_24
day_25		day_25
.cursorrules		.cursorrules
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
model_performance.json		model_performance.json
parse_results.py		parse_results.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI vs Advent of Code 2024 Results:

1. o1 – “The Heavy-Duty Problem Solver”

2. o1-mini – “High Value, Less Raw Power”

3. Claude-3.5-Sonnet – “Balanced Text & Logic”

4. Gemini-1206 – “Arithmetic & Expression Master”

5. Deepseek – “Occasional Breakthroughs, but Inconsistent”

6. Claude-3.5-Haiku – “Text-Focused, but Struggles on Complex Branching”

7. GPT-4o – “Capable in Some Parsing, but Spotty on Large BFS/State”

8. GPT-4o-mini – “Lightweight, Rarely Surpasses Basic Tasks”

9. Overlooked Insights for Everyday Devs

Final Practical Advice

About

Releases

Packages

Languages

Latand/AI-vs-AOC-2024

Folders and files

Latest commit

History

Repository files navigation

AI vs Advent of Code 2024 Results:

1. o1 – “The Heavy-Duty Problem Solver”

2. o1-mini – “High Value, Less Raw Power”

3. Claude-3.5-Sonnet – “Balanced Text & Logic”

4. Gemini-1206 – “Arithmetic & Expression Master”

5. Deepseek – “Occasional Breakthroughs, but Inconsistent”

6. Claude-3.5-Haiku – “Text-Focused, but Struggles on Complex Branching”

7. GPT-4o – “Capable in Some Parsing, but Spotty on Large BFS/State”

8. GPT-4o-mini – “Lightweight, Rarely Surpasses Basic Tasks”

9. Overlooked Insights for Everyday Devs

Final Practical Advice

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages