Generating Tests with AI: Smart or Just Lazy?

Writing test code is the unloved chore of software development. Everyone knows it has to be done, nobody wants to do it. That's exactly why it's one of the first things AI code generation gets thrown at. But does it actually produce usable tests?

Where AI-generated tests shine

For simple utility functions and pure functions, AI test generation works surprisingly well. Feed it a function, and out comes a set of tests covering edge cases a human might not immediately think of.

Take this straightforward validator:

public static bool IsValidIban(string iban)
{
    if (string.IsNullOrWhiteSpace(iban)) return false;
    iban = iban.Replace(" ", "").ToUpper();
    if (iban.Length < 15 || iban.Length > 34) return false;
    
    var rearranged = iban[4..] + iban[..4];
    var numericIban = string.Concat(rearranged.Select(c => 
        char.IsLetter(c) ? (c - 'A' + 10).ToString() : c.ToString()));
    
    return BigInteger.Parse(numericIban) % 97 == 1;
}

What AI generates here is remarkably thorough: null input, empty strings, too short, too long, valid Dutch IBANs, invalid checksums, strings with spaces. That's easily ten tests that would otherwise need to be typed out by hand.

The problem with more complex scenarios

Once dependencies enter the picture — databases, external APIs, message queues — things change. AI generates tests that compile and pass, but don't actually test anything meaningful.

A typical example:

[Fact]
public async Task CreateOrder_ShouldReturnSuccess()
{
    var mockRepo = new Mock<IOrderRepository>();
    mockRepo.Setup(r => r.SaveAsync(It.IsAny<Order>()))
        .ReturnsAsync(true);
    
    var service = new OrderService(mockRepo.Object);
    var result = await service.CreateOrderAsync(new OrderRequest 
    { 
        ProductId = 1, 
        Quantity = 5 
    });
    
    Assert.True(result.Success);
}

Looks fine, right? But this test only verifies that when the mock returns true, the service also returns true. The actual business logic — stock checking, price calculation, discount rules — gets completely bypassed. The mock is configured so everything always succeeds.

These kinds of tests create a false sense of security. Coverage goes up, but they won't catch bugs.

A better approach: AI as a starting point

What does work is using AI as a starting point and then sharpening things manually. Generate the boilerplate — the test class setup, the arrange-act-assert structure, the standard happy path. Then add the edge cases that require domain knowledge yourself.

For integration tests, a hybrid approach works well:

public class OrderIntegrationTests : IClassFixture<WebApplicationFactory<Program>>
{
    private readonly HttpClient _client;
    
    public OrderIntegrationTests(WebApplicationFactory<Program> factory)
    {
        _client = factory.WithWebHostBuilder(builder =>
        {
            builder.ConfigureServices(services =>
            {
                // Testcontainers for a real database
                services.RemoveAll<DbContextOptions<AppDbContext>>();
                services.AddDbContext<AppDbContext>(opts =>
                    opts.UseNpgsql(_postgresContainer.GetConnectionString()));
            });
        }).CreateClient();
    }

    [Fact]
    public async Task CreateOrder_WithInsufficientStock_Returns409()
    {
        // Seed: product with 2 items in stock
        await SeedProduct(productId: 1, stock: 2);
        
        var request = new { ProductId = 1, Quantity = 5 };
        var response = await _client.PostAsJsonAsync("/api/orders", request);
        
        Assert.Equal(HttpStatusCode.Conflict, response.StatusCode);
    }
}

AI can handle the setup and structure just fine. That specific test — an order with insufficient stock should return a 409 — requires domain knowledge.

Mutation testing as a quality check

A useful trick to verify whether generated tests actually test something: mutation testing. Tools like Stryker.NET apply small changes to the source code (mutations) and check whether the tests then fail.

dotnet stryker --project OrderService.csproj

If a mutation survives — the source code was modified but all tests still pass — then the test suite doesn't really cover that piece of code. With AI-generated tests, typically 40-60% of mutations survive. With hand-written tests, that number is closer to 15-25%.

That difference speaks for itself.

When it's worth it and when it's not

Good use cases:

Pure functions and utilities
DTO validation
Serialization/deserialization tests
Test setup boilerplate
Generating parameterized tests for known input/output combinations

Better done manually:

Business logic with complex rules
Race conditions and concurrency tests
Security-related tests
Anything where the test should specify behavior rather than confirm it

The coverage target trap

With AI, hitting 90%+ code coverage is temptingly easy. But coverage without quality is just a metrics game. A team with 80% coverage from thoughtful tests is better off than a team with 95% coverage where half of it is generated happy-path tests.

It's not about the number. It's about the confidence those tests give you when deploying on a Friday afternoon. And that confidence doesn't come from an AI that dutifully calls every method once.

Practical advice

Start with AI generation for the boring parts. The test fixtures, the builder patterns, the repetitive assertions. Spend the freed-up time writing the tests that actually matter — the edge cases that come from production incidents, the scenarios that only someone with domain knowledge can think of.

That's not laziness. That's using the tools available wisely.

Generating Tests with AI: Smart or Just Lazy?

Generating Tests with AI: Smart or Just Lazy?

Where AI-generated tests shine

The problem with more complex scenarios

A better approach: AI as a starting point

Mutation testing as a quality check

When it's worth it and when it's not

The coverage target trap

Practical advice

Related Articles

Few-Shot Prompting: Teaching LLMs by Example

AI Code Generation in Practice: What Works and What Doesn't

Prompt Engineering: Getting Better Results from Large Language Models

Want to stay updated?