Home Warp Speed Workshop About Audio Transmissions Captain's Log

Imprint Privacy Policy

How to Actually Measure the Impact of AI Tools on Development

09 Mar 26

Benjamin Igna

mins

read

Every engineering leader I talk to right now has the same problem: they’re spending real money on AI coding tools, the vendor decks promise a revolution, and when they ask their teams whether it’s working, they get a shrug. Or worse, they get anecdotes. “It feels faster.” “I think it’s helping.” Feelings are not a strategy. If you’re investing in AI-assisted development, you need to measure its impact with the same rigor you’d apply to any other engineering initiative. The good news: there’s a clear framework for doing this. The bad news: most organizations aren’t doing it yet. This article lays out a practical approach to measuring AI tool impact across three dimensions — utilization, impact, and cost — with concrete survey questions, metrics, and rollout advice you can use immediately.

Why Most Organizations Are Flying Blind

Here’s the uncomfortable truth: most companies rolled out Copilot (or Cursor, or Cody, or whatever their tool of choice is) with a lot of enthusiasm and very little measurement infrastructure. The typical approach looks like this: buy licenses, announce the rollout, wait three months, then ask “so… is it working?”

This is roughly equivalent to hiring fifty people and never checking whether they’re doing anything useful. You wouldn’t do that with headcount. Don’t do it with AI tools.

The DX research team, who’ve been doing some of the most rigorous work in this space, frame the challenge well: the organizations that succeed with AI don’t just deploy tools they measure adoption, track impact, and optimize cost in a deliberate sequence.

The Three Dimensions of AI Measurement

Any serious measurement effort needs to cover three areas. Miss one, and you’re working with an incomplete picture.

1. Utilization: Are people actually using the tools?

This is where every organization should start. Before you can measure impact, you need to know whether developers are even adopting the tools you’re paying for. Research shows that even leading organizations are only reaching around 60% active usage of AI tools. That means 40% of your licenses might be collecting dust.

Key metrics at this stage include daily and weekly active users, the percentage of pull requests that involve AI assistance, and the share of committed code that was AI-generated. For organizations exploring agentic AI, you’d also track tasks assigned to agents.

The story of Intercom is instructive here: by nearly doubling adoption of their AI code assistants, they achieved a 41% increase in AI-driven developer time savings. Adoption isn’t a vanity metric — it’s the prerequisite for everything else.

2. Impact: Is it actually making a difference?

This is where things get interesting — and where most organizations get stuck. The most reliable approach combines direct and indirect metrics rather than relying on any single measure.

Direct metrics include AI-driven time savings (how many hours per developer per week) and developer satisfaction with the tools. These give you immediate, actionable signals.

Indirect metrics involve tracking broader engineering productivity over time — PR throughput, perceived rate of delivery, and developer experience indices. These surface the longer-term benefits and, critically, any hidden risks.

And here’s where I’d add a warning that too few people talk about: balance velocity with quality. AI can make you faster today while creating a maintainability nightmare tomorrow. If your code generation volume goes up but your change fail percentage climbs alongside it, you haven’t gained anything you’ve just shifted the pain to the future.

3. Cost: Is your investment paying off?

Once you’ve passed the initial rollout phase, you need to get serious about the economics. This means tracking total AI spend (both aggregate and per-developer),calculating the net time gain per developer (time savings minus the cost of AItooling), and for agentic tools, computing something like an “agent hourly rate” to compare against human-equivalent effort.

This isn’t just about cost-cutting. It’s about identifying which use cases deliver the highest ROI so you can double down on what works and cut what doesn’t. It’s also the stage where governance matters most: setting model configurations, usage guidelines, and security protocols for scalable, compliant adoption.

Using Developer Surveys to Measure Impact

Telemetry and system metrics only tell part of the story. You also need to understand how developers experience these tools subjectively. A well-designed survey, run before and after rollout, gives you a baseline and a trajectory.

I recommend structuring the survey across three levels:

Individual Experience

At the individual level, you want to understand perceived speed, effort reduction, code quality, and overall productivity. Questions like “AI coding tools helpedme finish my work faster” or “Using AI increased my productivity” on a Likertscale (Strongly Disagree to Strongly Agree) are straightforward and effective.

Also critical: ask about frequency of use. Someone who uses the tool “all or most of the time”versus “not very much” will have fundamentally different perspectives on impact. You need both data points.

Team Experience

The team-level questions are where hidden risks surface. Code reviews, codebase navigability, readability, complexity, and maintainability these are the areas where AI-generated code can create problems that don’t show up in individual productivity numbers.

If your developers report that AI-generated code is harder to review or more complex than human-written code, that’s a signal worth taking seriously, even if individual speed metrics look great.

Quality and System Health

Finally, measure the impact on systems: testability, adherence to coding standards, overall codebase quality, reliability, and performance. These are the metrics that protect you from the “fast now, broken later” trap.

One question I particularly like at this level: “In a typical week, approximately what percent of your time is lost due to obstacles or inefficiencies in your work environment?” This gives you a broader efficiency baseline that contextualizes the AI-specific data.

How to Roll Out Metrics Without Creating a Mutiny

Let me be direct about something: measuring developer activity in the context of AI is a sensitive topic. The hype surrounding AI, combined with the telemetry thesetools generate, has created genuine anxiety on engineering teams. If you handlethe rollout poorly, you’ll destroy trust and get garbage data.

Three principles for a successful metrics rollout:

1. Never use these metrics for individual performance evaluation. Say this explicitly. Reinforce it by pointing to your existingperformance review process. Metrics like code generation volume are triviallygameable, and once people start optimizing for the metric instead of theoutcome, you’ve lost.

2. Frame measurement as being about developer experience, not output. The purpose is to understand how AI-assisted work affects the developer experience and software quality not to micromanageoutput. This framing is not just communication strategy; it should be genuinelytrue.

3. Explain the investment rationale. Data is necessary to guide organizational investment: which tools deliver real value, which workflows are worth scaling, and which experiments should be retired. Developers generally respect this when it’s communicated honestly.

Proactive communication is essential. Without it, speculation and fear fill the void. Anonymous surveys generally get better participation, and sharing results backwith the team builds the trust you need for ongoing measurement.

The Agent Question

One of the most thought-provoking challenges right now is how to measure the impact of autonomous AI agents. Should they be treated as independent contributors or asextensions of the teams that deploy them?

The most practical approach, in my experience, is to treat agents as extensions of the developers and teams that oversee their work. When you’re assessing a team’sthroughput, include both human-authored pull requests and those created byagents operating under that team’s direction.

This reflects a broader shift we’re heading toward: every developer will increasingly operate as a “lead” for a team of AI agents. The skills of the human operator knowingwhat to delegate, how to verify, when to intervene will matter as much as thetools themselves. Developers will increasingly be measured the way managers aremeasured today: based on the performance of their teams, human and AI alike.

A Reality Check on the Numbers

One final point. If you’re an engineering leader trying to reconcile the astronomical performance claims you see online with the results in your own organization you’re not alone. Among peers, researchers, and experienced leaders, there’s ashared understanding that many of these numbers don’t reflect reality.

Set goals based on real industry data, not vendor marketing. The landscape is evolving rapidly, and developer sentiment and AI-driven time savings are genuinelyimproving. But progress is incremental, not magical. Honest benchmarkingagainst real organizations beats chasing headline numbers every time.

Ready-to-Use Survey: AI Coding Tool Impact Assessment

Below is a complete questionnaire you can copy into your preferred survey tool (Google Forms, Typeform, etc.). Run it once before rollout as a baseline, then on aregular cadence monthly or quarterly to track changes over time. All Likert-scale questions use: Strongly Disagree / Disagree / Neither Agree nor Disagree / Agree / Strongly Agree.

Tip: Make theMakethe survey anonymous. You’ll get better participation and more honest answers.

Start Here

If you take one thing from this article, let it be this: measure before you optimize. Get a baseline survey out before or alongside your AI tool rollout. Track utilizationfirst, then impact, then cost. Use direct and indirect metrics together. Communicate openly with your teams. And resist the temptation to use AI metricsas a performance management tool.

The organizations that get this right will have a genuine competitive advantage not because they adopted AI faster, but because they understood what wasactually working.

Benjamin Igna is the founder of Stellar Work, a consulting company that helps R&D organizations transform how they work. He hosts the Stellar Work podcast and advises on agiletransformation, developer experience, and AI-assisted engineering.

Here you can download the paper and table as a PDF: https://drive.google.com/file/d/1ufG1iGAcrnxa987FdsbSf3ksW46WKZOx/view?usp=sharing