Kiro for Test-Driven Development (TDD)

You’ve spent years mastering TDD’s red-green-refactor rhythm. Now AI coding assistants write code instantly. Should you abandon the discipline that made you a better developer, or do TDD and AI work together?

In this post I will use Kiro – an AI-powered IDE that enables developers build software from prototype to production through spec-driven development, intelligent agent assistance, and automated workflows.

Before we see how AI handles TDD, let’s revisit the cycle that makes it work: Red (write a failing test), Green (make it pass with minimal code), Refactor (clean up while keeping tests green). This rhythm isn’t just methodology—it’s the discipline that prevents us from writing code we don’t need.

For my experiment I’ve decided to use Roy Osherove’s “String Calculator TDD Kata“, however using such a well known TDD Kata results in the LLM solving it using one of the many implementations that already exist out there, so instead before starting I have asked my trusty AI to suggest a similar problem that was not tried before and so I present to you the Number Parser Kata:

Objective: Create a function that parses written numbers in English and returns their numeric sum.

  • Parse written numbers (“one”, “two”, “three”) and return their sum
  • Progressively add: handling “and” connectors, compound numbers (“twenty-three”), negatives (“minus five”)
  • Examples
    • parse("one") → 1
    • parse("five") → 5
    • parse("one two") → 3
    • parse("one and two") → 3
    • parse("thirteen and fifteen") → 28
    • parse("minus five") → -5

You get the point, as a starting point I wanted to see if providing a clear task enables Kiro to write the solution test first.

“Teaching” Kiro about TDD using Steering files

For this TDD experiment, I created a steering file that explicitly taught Kiro the red-green-refactor cycle, ensuring it would write minimal failing tests first rather than jumping ahead to complete solutions.

Steering files are instruction documents you place in your project that teach Kiro your team’s coding practices and workflows—steering files encode your development methodology directly into the AI’s behavior.

Since I’m going to work on a “green field” application I’ve also added steering files outlining the tech used (python), project structure and testing best practices.

A screenshot of a software interface titled 'KIRO', featuring navigational elements such as 'SPECS', 'AGENT HOOKS', and 'AGENT STEERING & SKILLS', along with a workspace section listing items like 'project-structure', 'tdd-workflow', 'tech', and 'test-standards'.
Kiro AI IDE workspace showcasing the organization of project structure, TDD workflow, tech, and test standards for streamlined development.

Here’s the key steering file that teaches Kiro TDD principles, notice lines 20-35, this was a crucial part that prevents Kiro from jmping ahead with implementation, writing one test then complete implementation without working step by step

---
inclusion: always
---
# TDD Workflow
## Philosophy
This project follows strict Test-Driven Development (TDD). All code must be written in response to a failing test. No production code exists without a corresponding test that drove its creation.
## The Red-Green-Refactor Cycle
- **Red**: Write a failing test that describes the desired behavior in plain English
- **Green**: Write the minimum production code to make the test pass — nothing more
- **Refactor**: Clean up the code while keeping all tests green
## Rules
- Never write production code before a failing test exists
- Tests must fail for the right reason before implementing
- Implement only what is needed to pass the current failing test
- After each green phase, consider if refactoring is needed
- All behavior must be described in plain English before generating a test
## What "Minimum" Means
**CRITICAL**: "Minimum" means the simplest possible code that makes ONLY the current test pass.
Examples of what NOT to do:
- ❌ Test checks "one" returns 1 → Don't implement a dictionary with "one" through "ten"
- ❌ Test checks addition of two numbers → Don't implement multiplication, division, etc.
- ❌ Test checks parsing a single word → Don't implement comma-separated parsing
Examples of correct minimal implementations:
- ✅ Test checks "one" returns 1 → Use `if numbers == "one": return 1`
- ✅ Test checks "two" returns 2 → Add `elif numbers == "two": return 2`
- ✅ After 3+ similar cases → Refactor to use a dictionary (driven by duplication, not anticipation)
**The Golden Rule**: If you can delete code and the test still passes, you wrote too much code.
**Resist the urge to be "clever" or "complete"**. Let the tests drive every single line of production code. Premature generalization violates TDD principles.
## Cycle Prompt Pattern
When asked to implement a feature, always:
1. First generate a failing test for ONE specific behavior
2. Confirm the test fails
3. Then generate the minimal implementation (see "What Minimum Means" above)
4. Confirm all tests pass
5. Suggest refactoring opportunities (only if duplication exists)
6. Review request and existing tests to find if additional tests are needed
- If at least one more test is needed, start another cycle by writing a failing test
- If not, declare that requirement was met
Break each feature into individual tasks for each step in the TDD lifecycle
## Self-Check Before Implementing
Before writing production code, ask yourself:
1. What is the EXACT assertion in the failing test?
2. What is the SIMPLEST code that makes that assertion pass?
3. Am I implementing anything the test doesn't verify?
4. If I remove this line, does the test still pass? (If yes, delete it)

Running the Red-Green-Refactor cycle

Now that I had my initial setup – I started with the first requirement:

Create a function that parses written numbers in English and returns their numeric sum. the method signature will be: int Add(string numbers).

If a single number is written then the output should be that number numeric value

And it worked! – Kiro Jumped in and created a first failing test followed by a trivial implementation

Code snippet showing a test for the Add function in a number parsing module, checking if the input 'one' returns 1.
Test function for the Number Parser’s Add method, validating that the input ‘one’ returns the correct output of 1.
Screenshot of a coding environment showing a Python file named 'parser.py' with a function definition comment for a number parser module, along with test session details on the side.
Kiro setup setup for a Number Parser in Python, from left to right: files, tests, implementation and agentic chat

And then continued to implement all numbers between 1-10 as well and tests for sum of two numbers – which was the test I expected – trivial and simple:

Code snippet showing a test function for adding numbers represented as words, asserting that the sum of 'one' and 'two' equals 3.
Python test function for adding multiple written number words, asserting that ‘one two’ equals 3.

All the time continuing the cycle of Red-Green-Refactor.

As Kiro ran, I noticed that while the code was refactored and improved, the tests remained the same. After Kiro finished, I asked for a test refactor and updated Kiro’s Steering file to ensure both tests and code would be refactored going forward.

Once all cycles were complete, I verified no additional tests were needed by requesting a quick test review. Kiro did jump ahead at one point—creating a parameterized test for all numbers between 4 and 10—but as someone who’s trained countless developers in TDD, this is a common “human behavior” if I ever saw one.

If you want to see the whole run (6min 50sec) you can check it on YouTube:

Conclusion

This experiment answered my initial question: Yes, TDD remains valuable in the age of AI coding assistants—not despite AI’s capabilities, but because of them.

Kiro followed the red-green-refactor cycle when guided by steering files that encoded TDD principles. It wrote failing tests first, implemented minimal solutions, and refactored code while keeping tests green. The Number Parser Kata demonstrated that AI practices disciplined development when properly instructed.

Kiro occasionally “jumped ahead” with parameterized tests—a behavior I’ve seen countless times when training human developers in TDD. This reinforced an important insight: we don’t abandon the practices that made us better developers in the world of agentic coding. Instead, we encode them into how we guide our AI tools.

Some argue AI’s ability to generate complex code with tests makes TDD obsolete. This experiment suggests otherwise: TDD provides the verification layer that ensures AI-generated code actually implements the specified behavior correctly. As AI coding assistants increase in capability, the discipline of TDD becomes more important, not less—it’s the guardrail that keeps us from accepting code that compiles but doesn’t solve the right problem.

Try this yourself: use the steering file from earlier in this post and give TDD with AI a spin.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.