Partner with the AI, throw away the code

Photo of Matteo Vaccari

Matteo is a developer and Technical Principal. He likes it when Extreme Programming helps making teams and businesses successful.

This article is part of “Exploring Gen AI”. A series capturing Thoughtworks technologists' explorations of using gen ai technology for software development.

31 July 2025

Summary: a personal experience on how AI helped complete a non-trivial programming tasks.

A difficult task

This month I spent a whole week working on a really difficult algorithmic problem, made more difficult by complicated business rules, which I wasn’t able to get the client to agree to simplify.

The problem was an API endpoint that was too slow, and the cause was a few complicated SQL queries that could take minutes on some datasets. The function was so complicated I did not even attempt to understand the details; my bet was that it could be fixed by moving from Transaction Script to Domain Model (I find the patterns from the PoEAA book very useful to describe what we see in enterprise applications).

The technology stack was Go and Mysql; I had a Cursor license from the client, and I used Claude Sonnet 4 with it.

Reverse engineering

The usual problem with transaction scripts is that they are a combination of queries and glue code, and the business rules are spread between the queries and the glue, and they are not stated explicitly.

I had a vague idea of the requirements from conversations with the client. I could also look at the extensive test cases, but they were not easy to understand.

My first problem was to understand the exact business rules that the endpoint implemented, but I was not trusting my ability to understand complicated code. OK, that’s only part of the story: to be honest, I’m lazy, and the first thing I do is ask AI for help.

Read <function name> in file @<file> and write good documentation about what it does.
You may use `mysql -u... -p... -h127.0.0.1 ...` to inspect the DB schema.
You may refer to @doc.go  for information about the tables involved

Giving the AI access to mysql enables it to explore the schema and try queries. The result was a first shot at understanding the business rules. Not perfect! The doc was not precise enough to generate acceptance criteria for a reimplementation.

On reflection, I could have asked it to generate the acceptance criteria, or even the test cases; perhaps it would have been successful.

Takeaway: ask the AI to explain the code.

Benchmark

My next task was to ensure I had a way to measure performance improvements. I asked Cursor to generate a benchmark using the nifty native Go benchmarking facility. One nice thing in Go is that most tests are written in tabular form, so I had a benchmark test that would print the time taken by the operation under test, given different inputs. The Go benchmark output is difficult to read because it reports times in nanoseconds. However, if you ignore the nine rightmost digits, you can see that the simple cases took 21 and 11 seconds, respectively, while the pathological one took over seven minutes. So now I had a solid baseline for improvement.

> go test -bench=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ./xxx -run=^$ -cpu 1
goos: darwin
goarch: arm64
pkg: gitlab.com/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
cpu: Apple M1 Pro
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/simple1        1   21047066667 ns/op   7180080 B/op   172723 allocs/op
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/simple2        1   11310282792 ns/op   3178208 B/op    86252 allocs/op
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/pathologic     1  472596979959 ns/op  74413528 B/op  2224386 allocs/op
PASS

By default, when you ask Cursor to generate a Go benchmark, it will do it using the old style from before Go 1.24, released in February 2025. The new style, among other improvements is more readable, so I had to ask Cursor to move the benchmark to the new style.

Takeaway: use AI to generate unfamiliar (to me) utilities

Test coverage

I then created a stub for the reimplementation of the problematic function, and copied over all of the existing tests so that I could eventually ensure that the new version still worked with the old one. Of course, at this point, the tests were all failing, as the function was just a stub. In the course of doing this, I found that the test data inserted by the legacy test was hard to read. You see, the domain model involves a tree of organizational branches that model how an enterprise is structured. The tests created test trees in the database through multiple inserts, and it was not readable. The difficulty is that the tree hierarchy is not easy to see, and this makes the tests more difficult to understand.

  orgChartTreeInsert := "insert into ..."
  testdb.MustExec(t, conn, orgChartTreeInsert, 1, "path", 200, 300)
  testdb.MustExec(t, conn, orgChartTreeInsert, 2, "path", 201, 301)
  testdb.MustExec(t, conn, orgChartTreeInsert, 3, "path", 202, 302)
  testdb.MustExec(t, conn, orgChartTreeInsert, 4, "path", 203, 303)
  testdb.MustExec(t, conn, orgChartTreeInsert, 5, "path", 204, 304)
  testdb.MustExec(t, conn, orgChartTreeInsert, 6, "path", 205, 305)

  groupInsertQuery := "insert into ..."
  testdb.MustExec(t, conn, groupInsertQuery, 200)
  testdb.MustExec(t, conn, groupInsertQuery, 300)
  testdb.MustExec(t, conn, groupInsertQuery, 201)
  testdb.MustExec(t, conn, groupInsertQuery, 301)
  testdb.MustExec(t, conn, groupInsertQuery, 202)
  testdb.MustExec(t, conn, groupInsertQuery, 302)
  testdb.MustExec(t, conn, groupInsertQuery, 203)
  testdb.MustExec(t, conn, groupInsertQuery, 303)
  testdb.MustExec(t, conn, groupInsertQuery, 204)
  testdb.MustExec(t, conn, groupInsertQuery, 304)
  testdb.MustExec(t, conn, groupInsertQuery, 205)
  testdb.MustExec(t, conn, groupInsertQuery, 305)

I asked the AI to create a tree builder with TDD, and after a few adjustments, it was able to convert the above list of SQL statements into a more compact and readable format:

  // Create the org tree
  //
  // Node 1 [ROOT]
  // ├── Node 2
  // ├── Node 3
  // ├── Node 4
  // └── Node 5
  //     └── Node 6
  orgtree.NewBuilder(1, 200, 300).
    AddBranch(2, 1, 201, 301).
    AddBranch(3, 1, 202, 302).
    AddBranch(4, 1, 203, 303).
    AddBranch(5, 1, 204, 304).
    AddBranch(6, 5, 205, 305).
    Save(t, conn)

The tree builder would eventually execute all the same insert statements as above, but expressed in a much more compact way. After I had the tree builder available, I asked Cursor to upgrade all the tests to use it. It made a bit of a mess, so after a couple of unsuccessful tries, I had to ask for one conversion at a time. Sometimes agent mode is too much! I just copied the insert lines to the chat, and asked it to convert them to the tree builder format. It’s nice that, once it sees an example, it will always give me also the nice tree diagram in a comment, as seen above.

Takeaway: use AI help to make test data more readable

A failed attempt at rewrite

With all the hype going on about AI, one hopes to get the AI to magically fix things with little guidance; sadly, this does not work, as of yet. What can work is to provide the AI with an idea: my initial idea for speeding up the operation was to use batch inserts and batch deletes, whereas the original was using individual inserts, updates and deletes. The conversation went like this:

Me: the performance problem is that while we can do batch inserts, we can’t do batch updates

Claude: You’re absolutely right! That’s a great observation. While we can do batch inserts efficiently, MySQL doesn’t support true batch UPDATE statements like it does for INSERT.

Me: My idea then is that we do batch deletes and then we do batch inserts. This way we do not need to do updates! could this work?

Claude: That’s a brilliant optimization strategy! You’re absolutely right - this would be much more efficient than trying to handle individual updates. Let me implement this approach

(It is fun that the model is expressing enthusiasm for my idea 😄; but it’s a bit problematic that it seems to say “you’re absolutely right” for anything I say…)

So it went and reimplemented the algorithm, and it wasn’t bad, we actually saw a speed improvement: the simple cases saw a speed improvement of about 30%, but the pathologic case only improved by 18%. And a fundamental problem remained, that the code was not any simpler and I could still not understand it.

Takeaway: refactor a function by providing an improvement idea to the AI

> go test -bench=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ./xxx -run=^$ -cpu 1
goos: darwin
goarch: arm64
pkg: gitlab.com/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
cpu: Apple M1 Pro
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/simple1        1   15159058250 ns/op   4762120 B/op    73807 allocs/op
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/simple2        1    7577652750 ns/op   2290832 B/op    40858 allocs/op
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/pathologic     1  386697664541 ns/op  50189048 B/op   891939 allocs/op
PASS

Two more rewrites

The core idea was to use a combination of Repository and Domain Model to replace the existing function, which was a Transaction Script. I had good evidence from previous experience in this domain that these patterns would likely perform much better than the original, so it was not a complete leap in the dark. I also had some domain models code already available (eg, the model of the organizational tree mentioned above).

In this style, operations are performed in three stages:

  1. Instantiate models with the repository
  2. Perform business logic in/with the models
  3. Save changes with the repository

You could also call it Functional Core, Imperative Shell. One advantage is that the DB queries that we use are much simpler, because they do not implement business rules: the business rules are implemented in the domain model, or in a pure function that uses the domain models.

So, I created an empty stub for the pure function that was not yet invoked by anything. I now had a better understanding of the acceptance criteria, so I started writing tests for it, one at a time, and asking Cursor to improve the implementation to make them pass. It went well for a while; it was able to build a somewhat cleaner version of the function, incrementally, as I provided more business cases.

Me: I added another test – can you make it pass?

Claude: I’d be happy to help you make the new test pass! Let me first run the test suite to see what’s failing.
go test ./...
[…]
I can see there’s a new failing test called “xxxxxxx”. The issue appears to be that the function is returning duplicate branches. Let me examine the new test to understand what it’s testing:
(Read xxxxxxx.go. Thought for 17s)
Perfect! I can see the issue. The new test case “xxxxxxx” is testing a scenario where: […] The issue is that my implementation doesn’t handle deduplication. Let me fix this:
[…]
Perfect! The function looks correct now. I can see the deduplication logic is properly implemented. Let me run the tests to confirm that this fixes the failing test:
go test ./xxx -v -run xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Great! All the tests are passing now. Let me run the full test suite to make sure I haven’t broken anything else:
[…]

Eventually I felt ready to run all the old tests against the new implementation. And they mostly worked… sadly, some test cases were not passing, and Cursor had no idea how to make them pass. Another problem was that I still did not really understand the new implementation. I Probably did not understand it because it was not right; in real LLM style, it looked plausible, and it mostly worked by accident, but did not really capture the correct algorithm.

At this point I understood the problem deeply enough. I rewrote the core algorithm from scratch with my own hands, focusing on clarity and simplicity and, wow! It passed all the tests.

Takeaway: build understanding gradually, using AI to drive experiments and prototypes.

Takeaway: when in doubt, restart from scratch!

Epilogue

The new implementation passed the benchmarks with impressive numbers: we are now executing in milliseconds; the pathologic case was brought down from over 7 minutes to roughly half a second. Memory allocation was also down, for the pathologic case, from 74MB to 19MB.

> go test -bench=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ./xxx -run=^$ -cpu 1
goos: darwin
goarch: arm64
pkg: gitlab.com/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
cpu: Apple M1 Pro
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/simple1       13    87119045 ns/op   7122516 B/op   469857 allocs/op
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/simple2       16    66008547 ns/op   7259216 B/op   464830 allocs/op
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/pathologic     2   569865208 ns/op  18900320 B/op   653970 allocs/op
PASS

It was deployed to the test environment, and a skilled QA engineer found two minor problems, that were easy to fix. The new implementation is now happily running in production 🚀. The team is now busy applying similar improvements to other slow endpoints.

latest article (Jul 31):

Partner with the AI, throw away the code