How to Choose an AI Consultancy: A Buyer's Framework

Most AI consultancy engagements that disappoint share a single root cause: the buyer didn't ask the right questions before signing. Every consultancy can show a polished deck and name-drop a few clients. The hard part is separating the firms that ship working systems from the firms that ship strategy memos and walk away.
This is a working buyer's framework. A scoring rubric, the red flags to watch for, and a small set of questions that reveal what kind of partner you're really hiring.
A scoring rubric for AI consultancies
Score each candidate from 1 (worst case) to 5 (best case) across nine dimensions. Anything under 30/45 total is a real concern. Anything over 38 is a strong fit.
| Criterion | 1 / 5 looks like | 5 / 5 looks like |
|---|---|---|
| Discovery vs. delivery balance | 6–8 weeks of discovery before any code | 1–2 weeks of focused discovery, then a working prototype |
| What gets delivered | A strategy deck and recommendations | Working software, deployed in your environment |
| Team seniority | A PM fronting offshore juniors you never meet | Senior engineers doing the work directly, named on the contract |
| AI evaluation & quality | "We'll test it before launch" | Custom eval suites and structured logging built in from day one |
| Model choice & vendor neutrality | Locked into one provider regardless of fit | Claude, ChatGPT, Gemini evaluated per use case with clear rationale |
| Integration approach | One-off custom integrations for every tool | MCP servers, reusable patterns, agents that reach into your systems cleanly |
| Code & IP ownership | Licensed platform you must keep paying to access | You own every line of code, on your accounts, from day one |
| Ongoing maintenance | Hand-off then unavailable | Optional retainer OR a clean handoff with real documentation |
| Risk transparency | "Nothing should go wrong if you follow our process" | Names specific risks upfront with mitigation plans |
Red flags to watch for
Any one of these alone isn't disqualifying. Three or more is.
- "We're excited about AI." Excitement isn't capability. Ask for specifics.
- Massive teams with unclear roles. Usually means you're paying for layered management.
- AI as a buzzword. No specific tools, frameworks, or model names mentioned.
- No mention of evaluation. If they can't tell you how they know the AI is working in production, they don't know either.
- Vague code ownership. "We'll work that out in the SOW" is a no.
- Hourly billing with no upper bound. Outcomes-based pricing aligns incentives. Hourly does the opposite.
- No honest disqualifiers. A consultancy that says it's right for every problem is right for none.
Six questions that reveal posture
The scoring rubric covers what to look for. These six questions tell you who you're actually dealing with. Ask all of them in a single conversation and pay attention to whether the answers are specific, honest, and grounded in real work.
What did you ship last quarter that's running in production today?
This separates the firms that build from the firms that talk. A good answer names a specific system, what it does, and how the client uses it. A bad answer is generic ("we recently helped a Fortune 500 client streamline their operations") or pivots into a deck. If they can't show you something running with users on it, the rest of the conversation doesn't matter much.
Show me an evaluation suite you built for an AI system.
Production AI is non-deterministic. The difference between a demo and a production system is whether you know when it breaks. A consultancy that can show you actual eval code, test cases, and logging dashboards is doing the work. A consultancy that can't is shipping demos that haven't been pressure-tested in front of real users yet.
Walk me through a project that went sideways. What did you do?
The honesty test. Every consultancy has projects that struggled. The ones that pretend otherwise are the dangerous ones. Listen for specifics, root-cause analysis, and what they changed in their process afterward. A partner that's open about past failures will be open about risks on your project too.
What kind of work do you refuse to take on, and why?
A consultancy with no disqualifiers is a consultancy that says yes to everything for revenue. Listen for actual scope refusals — types of work, types of clients, types of engagements where they know they're not the right fit. The clearer the no, the more credible the yes.
Who specifically will write the code, and can I talk to them today?
This catches the bait-and-switch where senior people pitch and junior people build. The right answer is "yes, here they are, let's set up a call this week." If the answer is "we'll introduce you after you sign the SOW," the people on the call aren't the people on the project.
If we wanted to take this fully in-house after launch, what would that take?
Reveals whether they're building for handoff or for lock-in. A good partner answers concretely — documentation your engineers can actually use, training sessions during the engagement, decision documents explaining why the system was built the way it was, and clean handover of accounts and credentials. A partner that hedges, or who immediately steers the answer toward "most clients keep us on a retainer," may be building something you'll struggle to operate independently.
FAQ
How long should I spend evaluating AI consultancies?
For a 4–16 week engagement, two to four weeks of evaluation is reasonable. Talk to three partners minimum, score them against the rubric above, and ask each one a question you already know the answer to (to check whether they bluff or admit they don't know).
What's a fair price range to expect?
It depends on scope. As a rough frame: a focused 4–6 week AI adoption assessment usually runs in the low five figures. A 4–12 week agents and automations build usually runs in the mid five to low six figures. An 8–16 week custom software replatform usually runs in the mid-to-high six figures. Outcomes-based pricing should be the norm. Firms that quote hourly with no upper bound are a red flag.
Should I run an RFP?
RFPs are useful when you need to compare apples-to-apples on a well-scoped problem. They're counterproductive when you're still figuring out what to build — they reward partners who write good documents, not partners who build good software. For AI work, a paid two-week discovery engagement with one or two finalists usually tells you more than an RFP ever will.
Should I ask for references?
Yes. Ask each reference three specific questions: "What surprised you about working with this firm?", "What would you do differently?", and "Would you hire them again for a different project?" The third question is the most honest signal you'll get.
What if my team isn't technical enough to evaluate AI-specific answers?
Bring in an independent advisor for the evaluation conversations. A one-to-two hour consult with someone senior who has actually shipped AI systems will cost far less than picking the wrong consultancy.
If you're looking for an AI consultancy that delivers working systems, where senior people do the work, and you own everything that gets shipped — FM might be the right fit. Most engagements start with a 30-minute scoping call. No decks, no hard sell.
Share this article
Subscribe to our Newsletter
Get insights delivered to your inbox.
Continue Reading

How Do Forward-Deployed Engineers Differ from Traditional Consultants?
Forward-deployed engineers work alongside your team to build solutions in real-time. The deliverable is solutions, not recommendations.
Brian Fletcher
Principal, Co-founder @ FM