AI Agents as Senior Engineers? What Senior SWE-Bench Actually Reveals
Frontier models score under 25% on Senior SWE-Bench. A practitioner's take on correctness, taste and practice adherence — with concrete .NET examples for grading your own agents.
Jean-Pierre Broeders
Freelance .NET Developer
"We treat agents like senior engineers, so why evaluate them like junior engineers?" That single line, front and centre on the Senior SWE-Bench landing page, captures exactly the tension I feel with clients every day. The benchmark hit the Hacker News front page this week, and the number everyone quoted was sobering: even the best frontier model scores a "tasteful solve" rate of just 24%. That means it fails to finish a task at senior level more than three-quarters of the time.
As a freelance .NET developer who now works with coding agents daily, I don't read that as a reason for cynicism but as a reason for precision. The question is no longer "can an agent write code" — it can, impressively well. The question is: can an agent write code you'd happily push through a senior review? Senior SWE-Bench tries to measure precisely that, and how it does so is more instructive than the headline figure.
What makes Senior SWE-Bench different?
The original SWE-bench and variants like SWE-bench Pro test agents on isolated GitHub issues with sharply defined acceptance tests. That is useful, but it looks nothing like my work. Nobody hands me a ticket with a ready-made test suite and the exact files to touch.
Senior SWE-Bench flips that approach in three ways that genuinely matter:
- Realistic instructions. The tasks "read like natural language messages rather than over-specified requirements", with a median instruction length 31% shorter than SWE-bench Pro. In other words: under-specification, just like real life. The agent has to reconstruct the intent itself.
- Real scope. Tasks span an average of 11 files per feature and require hundreds of execution steps. These are not one-liners but multi-phase jobs sourced from actual production PRs.
- Adaptive validation. Instead of rigid tests, a validation agent uses "expert-designed recipes to write behavioral tests that adapt to the submitted solution". There isn't one correct diff; there are infinitely many valid solutions, exactly as with real code.
The dataset consists of 50 public and 50 private tasks across Python, Go, TypeScript, SQL and Rust. C# is absent — a pity — but the dimensions are language-agnostic, and it's those dimensions that make this benchmark interesting for .NET folks.
Correctness, taste and practice adherence
Senior SWE-Bench grades along three axes. I recognise every one of them from my own reviews.
- Correctness — runtime verification and behavioural tests. Does it do what it should?
- Taste — quality metrics based on the conventions of the codebase, including a code-bloat limit (less than 2× the size of a human solution).
- Practice adherence — alignment with the repository's observed conventions (a threshold above 2 out of 5).
The leaderboard numbers speak volumes:
| Model | Tasteful solve rate |
|---|---|
| Claude Opus 4.8 | 24.0% |
| Claude Sonnet 5 | 19.4% |
| GPT-5.5 | 16.0% |
Note what's being measured here. Not "is the code wrong", but "is the code wrong, ugly or un-idiomatic". That is exactly the difference between a junior who ships a green test suite and a senior who ships code the team is still happy with in five years.
Why "taste" is the hardest thing for an agent
Correctness is objective and therefore relatively easy to enforce: write tests, make them pass. Taste is contextual. Take a simple example I see regularly in agent-generated C#. The task: "filter active users and return their email addresses." An agent happily produces this:
public List<string> GetActiveUserEmails(List<User> users)
{
List<string> result = new List<string>();
for (int i = 0; i < users.Count; i++)
{
if (users[i].IsActive == true)
{
if (users[i].Email != null)
{
result.Add(users[i].Email);
}
}
}
return result;
}
Correct? Yes. The tests go green. But in a codebase that is otherwise fully LINQ-idiomatic, this is tasteless: == true, nested ifs, a manual loop, a concrete List<string> return type. A senior writes:
public IReadOnlyList<string> GetActiveUserEmails(IEnumerable<User> users) =>
users
.Where(u => u.IsActive && u.Email is not null)
.Select(u => u.Email!)
.ToList();
Both pass the same behavioural tests. Only the second fits the conventions of the house. This is exactly the gap the taste axis of Senior SWE-Bench tries to quantify, and it's immediately clear why models trip over it: there is no green checkmark for "fits us".
The under-specification is the real exam
What appeals to me most is the choice for short, ambiguous instructions. In practice, 70% of senior work is understanding the task, not typing it. An instruction like "make the webhook processing idempotent" carries a mountain of implicit knowledge: which dedup key, how long you keep it, what you do on a race, how you log a duplicate delivery.
An agent that takes that instruction literally might build a naïve in-memory set. A senior knows that breaks under load and across multiple replicas. A tasteful .NET solution leans on the data store as the source of truth:
public async Task<bool> TryProcessAsync(string eventId, CancellationToken ct)
{
// Idempotency enforced by a unique constraint on EventId, not by an
// in-memory set that is empty after a deploy or across replicas.
var record = new ProcessedEvent { EventId = eventId, ProcessedAt = DateTimeOffset.UtcNow };
_db.ProcessedEvents.Add(record);
try
{
await _db.SaveChangesAsync(ct);
return true; // first time: continue processing
}
catch (DbUpdateException ex) when (ex.IsUniqueViolation())
{
// Already processed: silently ignore, no duplicate side effects.
return false;
}
}
This is the kind of trade-off thinking that explains a 24% score. The model that executes the instruction literally scores on correctness but sinks on practice adherence and taste the moment the reviewer asks: "and how does this behave across three pods?"
What this means for your .NET workflow
I don't want you to conclude that agents are useless — quite the opposite. I ship measurably faster with agents. The lesson is that you have to build the senior-review harness, because the agent won't do it for you. Three concrete measures I deploy at clients.
1. Make "taste" explicit and machine-enforceable
A large part of what Senior SWE-Bench calls "practice adherence" can be captured in analyzers. Set .editorconfig and Roslyn analyzers to error, not suggestion:
# .editorconfig
[*.cs]
dotnet_diagnostic.CA1826.severity = error # don't use LINQ where a property exists
dotnet_diagnostic.IDE0058.severity = error # unused expression value
csharp_style_prefer_pattern_matching = true:error
dotnet_style_prefer_collection_expression = true:error
What humans call "taste" thus partly becomes a build error. The agent gets immediate feedback and the reviewer doesn't have to nitpick.
2. Treat every agent PR as a junior PR with a stricter gate
Never let an agent merge straight to main. A GitHub Actions workflow that mirrors the three axes of Senior SWE-Bench:
name: agent-pr-gate
on:
pull_request:
branches: [main]
jobs:
verify:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-dotnet@v4
with:
dotnet-version: '9.0.x'
# Correctness
- run: dotnet test --configuration Release --logger trx
# Taste + practice adherence: analyzers as errors
- run: dotnet build --configuration Release -warnaserror
# Code-bloat limit à la Senior SWE-Bench (<2x)
- name: Diff size guard
run: |
ADDED=$(git diff --numstat origin/main...HEAD | awk '{s+=$1} END {print s}')
echo "Lines added: $ADDED"
if [ "$ADDED" -gt 400 ]; then
echo "::warning::Large diff — ask for a smaller, more targeted change."
fi
The diff size guard is a deliberate echo of the benchmark's code-bloat limit: a solution twice the size it needs to be is rarely the tasteful solution.
3. Write behavioural tests, not implementation tests
Senior SWE-Bench's adaptive validation tests behaviour, not a specific diff. Copy that. Tests that lock onto internal implementation break on every refactor the agent (or you) makes. Test the observable outcome:
[Fact]
public async Task Duplicate_delivery_causes_no_second_side_effect()
{
var handler = new WebhookHandler(_db, _emailSpy);
await handler.HandleAsync(EventWith(id: "evt_123"));
await handler.HandleAsync(EventWith(id: "evt_123")); // exact replay
_emailSpy.SentCount.Should().Be(1); // behaviour, not implementation
}
A test like this leaves the agent free on the how but keeps the bar on the what. Exactly the philosophy the benchmark adopts.
The sober conclusion
Senior SWE-Bench is not proof that AI fails. It's an honest ruler showing where the boundary sits today: agents are strong on correctness, weak on taste and practice adherence, and that gap is widest under under-specification — precisely what seniority is about. A 24% tasteful solve rate isn't a failure; it's a snapshot of a curve that is climbing fast.
For me, day-to-day practice doesn't change fundamentally: I keep using the agent for speed, but responsibility for taste, conventions and trade-offs still rests firmly with me. The smartest thing you can do right now is not to keep that judgement in your head but to encode it — in analyzers, in gates, in behavioural tests. Then the agent doesn't become a replacement for the senior, but a stunningly fast junior working inside a harness that enforces your taste.
Want to dig into the details of the benchmark? The Hacker News discussion and the benchmark itself are both worth your time.
