can ai win an ml contest?

2026-05-10

Spoiler: i don't know yet. The contest isn't over — about 2 days to go. But over 10 days, a ragtag team of Claude Code, Claude Desktop, eight cloud VMs, and one me clawed up to 322.2 points out of 1000 — which, going by the leaderboard, is top 3. Here's how it happened.

how i ended up there

Someone sent me a link to an internal ML contest at work with the note "rumor is, anyone who skips it gets fired sooner or later." So i signed up, just in case :)

Turned out the task was to build a music recommender.

🗒️

The brief

I thought — "cool, i'll get to play with subgenres, artist intersections, mood. Find something non-obvious but useful for training." But — no. The data is just IDs. No names, titles, genres — everything anonymized. So it isn't a product task at all; it's purely an ML engineer's task.

I decided to try anyway. I don't know anything about ML, but i'm good at automation, and i have my own Claude Code plugin for exactly this — hats. Five sub-agents with hard role separation: Manager writes specs but can't implement, QA writes tests but can't read source, Developer makes the tests pass without ever seeing them. They check each other against the spec, not against assumptions — which is what stops the "developer just edits the test until it passes" failure mode that wrecks most agentic dev loops.

what i did

Used the plugin to walk through the problem with my ai dev team. Roughly like this:

claude .
/hats:init

# (set up the project structure)

/hats:manager

me: Hi. Here's the task (paste). Need to win. Figure out how.
ai-manager: 14 days. First we build a rig for experiments, then run hypotheses through it one by one — keep what improves recommendations, drop what doesn't.

# (wrote task specs for the team)

/hats:cto

me: Let's use what we have: this Mac, Claude Code Max 200.
ai-cto: M4 Pro, 24GB, no CUDA — RAPIDS and CatBoost-GPU are out. Stack: Python 3.11 via uv, polars instead of pandas, implicit, lightgbm, PyTorch on MPS. No MLflow — runs/<n>/ folder with config.json + result.json.

# (wrote the tech stack)

/hats:qa

ai-qa: Want me to write tests against the manager's specs?
me: Yes

# (wrote tests for environment/processes, TDD-style)

/hats:dev

me: Start
ai-dev: Done. Submit it.   # (an hour later)

That got us our first submission — 49.9 points.

automating the experiments

After a couple of days of poking the developer with "you can do better!", i got tired of the manual labor — and i had to leave for the mountains anyway.

Clear i needed bigger automation. Someone had to sit with the ai team and feed them ideas to test.

So i set up new roles:

Researcher — idea generator. Looks at what's been done and proposes 1–2 new hypotheses with specs.
Reflector — once a day, reads logs from 20–40 experiments and rewrites the strategy. "We see two ridges in the hyperparameters — don't collapse the search."
Coder — takes an idea, writes the code, drops it in the queue.
Uploader — Claude Desktop with Chrome access, uploading submissions through the contest UI every hour.

With the system debugged, i went on vacation, keeping a Claude Code remote-control session on my phone.

ai devops

Came back and realized my Mac mini was at its limit. 40GB of swap, experiments crashing. No good new results coming through.

Talked it over with the ai-cto and we found a fix — use a Yandex Cloud grant. I installed yc cli, authenticated, and handed it to Claude Code. Asked it to be paranoid about spend, since you can really run up a bill there.

Things picked up. At peak, 8 VMs (64GB RAM each) were running and crunching at the same time.

We squeezed out our peak — 322.2 points. The grant has now run out and that's probably the end :)

where ai fell short

The ai didn't come up with all the hypotheses on its own. It was great at generating variations — change the learning rate, add another feature, mix things. But thinking broader was genuinely hard for it.

For example, the basic move of "search the internet for how this kind of task is usually solved" never crossed its mind. Same with "look for code on GitHub" or "go study solutions from similar competitions."

To figure out where it was getting stuck, i asked it to explain everything to me like a child: what data we have, how we use it, "give examples", and so on. That's how we dug up things like — different user types (eclectic listeners vs. people who replay the same stuff) need totally different approaches; we weren't using part of the data at all; we were optimizing the wrong metric.

Had to keep checking its work. Without intervention, it would hit a plateau fast.

bottom line

Can you win an ML contest without knowing ML?

Today, no — not if you're hoping for full automation. Ideas still have to come from you, and you have to watch that the resources go where they matter.

Yes — if you use ai and stay deeply in the loop. Understand how the whole thing should work and fix the parts ai doesn't catch.

the moral

This is a moment where you can learn pretty much anything. Try working through ai — it opens up some genuinely wild possibilities :)

Twitter
Threads
RSS