# Pilot A Custom RAG On One Project Type

**By Dan Cumberland** · Published April 30, 2026 · Categories: AI Strategy

> A project-type RAG pilot is a 60–90 day implementation of a Retrieval-Augmented Generation system scoped to one architectural project type— say K-12 schools or...

## The Case for Scoping AI Down

A project\-type RAG pilot is a 60–90 day implementation of a Retrieval\-Augmented Generation system scoped to one architectural project type— say K\-12 schools or healthcare TI— built on that type's completed\-project archive\.  It is the contained alternative to a firm\-wide AI rollout\.

Most AEC AI conversations bounce between two bad options\.  One is the vendor pitch \("buy our platform, sign the MSA, transform the firm"\)\.  The other is paralysis \("let's wait until it's mature"\)\.  There is a third path, and it is the one principals at $20M–$100M firms keep landing on once they actually scope the work\.

> *Scope is the most underused lever in AEC AI strategy\.*

The firms getting real value from AI in 2026 are not the ones rolling it out everywhere\.  They are the ones who picked one project type, ran a 60–90 day pilot, and let the results decide what came next\.  McKinsey's most recent state\-of\-AI work[1](/blog/blog-type-architecture#ref-1) points the same direction at the macro level: the organizations reporting EBIT impact concentrate use cases rather than spread thinly\.  This article gives you the playbook— the pilot plan, the build\-vs\-buy table, the honest cost range, the three failure modes vendor content tends to skip\.

If scoping is the lever, the question is *why* it works\.  The answer is partly technical, partly organizational\.

## Why One Project Type Beats a Firm\-Wide Rollout

Scoping a RAG pilot to one project type produces meaningfully better retrieval than a firm\-wide rollout because terminology, standards, and document structure are far more consistent within a single type\.  Narrower domain, cleaner grounding, faster value\.

Here is what that looks like in practice:

- **Terminology consistency\.**  K\-12 spec language is not healthcare spec language\.  Embeddings cluster cleaner when the corpus speaks one dialect\.
- **Standards consistency\.**  Within a single type, your firm has fewer conflicting "right answers" about details, sequencing, and submittals\.
- **Single PM champion\.**  A senior PM in one vertical is far easier to recruit than firm\-wide buy\-in\.
- **Honest scorecard\.**  Contained scope means you can score the pilot before committing capital\.

Anthropic's Contextual Retrieval research[2](/blog/blog-type-architecture#ref-2) showed retrieval failure rates dropping by up to 49%— and 67% with reranking— when chunks carry their context\.  That effect compounds when the corpus itself is narrow\.  LlamaIndex's production RAG guidance[3](/blog/blog-type-architecture#ref-3) says the same thing in different words: production\-grade retrieval depends on chunking, metadata filtering, hybrid search, and evaluation, and all four get easier when the domain is narrow and the metadata is clean\.

K\-12 spec language is not healthcare spec language\.  Pretending one model can handle both at pilot stage is how pilots fail before they start\.

Once you accept that scope is the lever, the next question is which lever to pull\.  Not every project type is a good pilot candidate\.

## How to Choose Which Project Type to Pilot On

Pick the project type with the most completed projects, the strongest repeat client demand, the cleanest documentation, and a willing senior PM as champion\.  The combination matters more than any single factor\.

The four\-criteria checklist:

1. **Document depth\.**  At least 15 completed projects in the type\.  Below that, the corpus is too thin to ground anything useful\.
2. **Repeat client demand\.**  A vertical you keep winning work in\.  Future leverage from the pilot compounds with each new pursuit\.
3. **Documentation hygiene\.**  Specs, RFIs, submittals, meeting minutes organized in something searchable\.  If documents are scattered across personal drives, that is a separate project before this one\.
4. **PM champion\.**  One senior PM in the type who *wants* it to work\.  Not a tolerant skeptic\.  A wanted\-it\-yesterday champion\.

> *If you cannot name 15 completed projects in the type, the corpus will be too thin to ground anything useful\.*

If your firm has 2–4 repeat verticals, score each on these four dimensions and pick the highest total\.  A pilot without a senior PM champion is a pilot that ships and dies\.

Once you've picked the type, the work begins\.  Here's what 60–90 days actually looks like\.

## The 60–90 Day Pilot, Week by Week

A realistic project\-type RAG pilot runs 60–90 days across four phases: scope and corpus assembly \(weeks 1–3\), retrieval architecture build \(weeks 4–6\), evaluation and PM testing \(weeks 7–10\), and decision \(weeks 11–13\)\.

### Document corpus: in, defer, out

```html-table
<table><thead><tr><th>Document type</th><th>In / Defer / Out</th><th>Rationale</th></tr></thead><tbody><tr><td>Specifications</td><td><strong>In</strong></td><td>Highest-value retrieval target; structured by section</td></tr><tr><td>RFIs</td><td><strong>In</strong></td><td>Captures real-world clarification language</td></tr><tr><td>Submittals</td><td><strong>In</strong></td><td>Product/standard alignment with completed work</td></tr><tr><td>Meeting minutes</td><td><strong>In</strong></td><td>Decisions and rationale not captured elsewhere</td></tr><tr><td>Design narratives</td><td><strong>In</strong></td><td>Codifies firm-specific design logic</td></tr><tr><td>Lessons-learned docs</td><td><strong>In</strong></td><td>Highest-leverage content per page in the corpus</td></tr><tr><td>CAD / Revit / BIM models</td><td><strong>Defer</strong></td><td>v1 is text-first; visual models are a separate problem</td></tr><tr><td>Email threads</td><td><strong>Defer</strong></td><td>Privacy and signal-to-noise concerns</td></tr><tr><td>Active project files</td><td><strong>Out</strong></td><td>Pilot uses <em>completed</em> projects only</td></tr></tbody></table>
```

The four phases:

1. **Weeks 1–3 — Scope & Corpus\.**  Confirm the project type\.  Identify 15\+ completed projects\.  Lock the in/defer/out list\.  Tag every doc with project, year, phase, and document type\.
2. **Weeks 4–6 — Build\.**  Stand up a managed RAG service \(more on the options in the next section\)\.  Ingest\.  Configure metadata filters\.  Smoke\-test against ten gold\-standard queries written by the PM champion\.
3. **Weeks 7–10 — Evaluate\.**  RAGAS\-style scoring[4](/blog/blog-type-architecture#ref-4) on faithfulness, answer relevancy, and context precision\.  Head\-to\-head: same 30 queries to RAG vs\. your current Newforma or Procore search plus senior PM memory\.  Three to five senior PMs running real queries from real pursuits\.
4. **Weeks 11–13 — Decide\.**  Score the pilot against the bar you set in week 1\.  Scale, refine, or kill\.  The bar is set in week 1, not week 13\.  That is the whole game\.

Anthropic's *Building Effective Agents* guidance[5](/blog/blog-type-architecture#ref-5) is worth quoting in spirit: start with the simplest pattern that works, measure, and add complexity only when justified\.  Week one is a documents conversation, not an AI conversation\.  If you're still arguing about the corpus in week six, the pilot has already failed\.

Most of those phases assume you've already made one decision\.  Build or buy\.  That deserves its own section\.

## Build vs\. Buy: Managed Services for the Pilot

Buy managed for the pilot\.  Custom retrieval pipelines outperform managed services at scale, but a $25K–$75K project\-type pilot is the wrong moment to build a vector database from scratch— Amazon Bedrock Knowledge Bases[6](/blog/blog-type-architecture#ref-6), OpenAI Assistants File Search[7](/blog/blog-type-architecture#ref-7), and Azure AI Search[8](/blog/blog-type-architecture#ref-8) all handle ingestion, chunking, embedding, and retrieval out of the box\.

```html-table
<table><thead><tr><th>Service</th><th>Ingestion</th><th>Chunking</th><th>Hybrid search</th><th>Where corpus lives</th><th>Best-fit pilot</th></tr></thead><tbody><tr><td>Amazon Bedrock Knowledge Bases</td><td>Managed; wide format support</td><td>Managed</td><td>Yes</td><td>S3 + managed vector store</td><td>Firms standardized on AWS, IT-led</td></tr><tr><td>OpenAI Assistants File Search</td><td>Managed via vector stores</td><td>Managed</td><td>Vector-first</td><td>OpenAI vector store</td><td>Smallest setup, fastest to first answer</td></tr><tr><td>Azure AI Search</td><td>Managed; deep enterprise hooks</td><td>Managed</td><td>Yes (vector + keyword)</td><td>Azure tenancy</td><td>Microsoft-365 firms, security-first reviews</td></tr></tbody></table>
```

Each handles ingestion, chunking, embedding, and retrieval\.  None of them solves your real problems: corpus quality, evaluation, and PM adoption\.  Those stay your responsibility\.

Custom pipelines win later, when scale, governance, or accuracy crosses a threshold managed services can't meet\.  For [AI implementation services](https://dancumberlandlabs.com/services/ai-implementation/) at the pilot stage, that threshold is almost never crossed in the first 90 days\.  Both are true\.  Managed for the pilot, custom when it has earned its way in\.  The pilot's job is to find out whether RAG earns its keep on your project type\.  Building infrastructure is a separate project, and you don't need to do both at once\.

With a managed\-services build, the cost question gets easier to answer\.

## What It Costs \(and What Drives the Variance\)

A project\-type RAG pilot for a mid\-size AEC firm typically runs $25,000 to $75,000 all\-in over 60–90 days\.  The variance is driven by data hygiene, internal vs\. fractional implementation, and how strict the evaluation bar is\.

This is the typical pilot range we see, not an industry stat\.  Phrase it that way internally too\.

What drives the spread:

- **Data hygiene of the chosen corpus\.**  Clean, well\-organized completed\-project folders sit at the low end\.  Documents scattered across PMs' personal drives push toward the high end before any AI work starts\.
- **Build team\.**  Internal hire vs\. fractional implementation partner vs\. boutique vendor\.  Each has a different cost shape\.
- **Evaluation rigor\.**  A three\-person PM head\-to\-head is cheap\.  A formal 100\-query benchmark with weighted scoring is not\.
- **Governance and IP review\.**  Federal, healthcare, and large institutional clients trigger more security review\.  Plan for it\.

Most of the spread between $25K and $75K is documents, not technology\.  What's *not* in this range: full firm\-wide rollout, multi\-type expansion, and custom infrastructure builds\.

Cost is one number\.  Whether the pilot worked is a different question, and a more important one\.

## How You'll Know It Worked

Three signals tell you whether the pilot worked: RAGAS\-style retrieval metrics, a head\-to\-head test against your current workflow, and whether senior PMs use it weekly without being told to\.  Two out of three is not enough\.

- **RAGAS metrics**[4](/blog/blog-type-architecture#ref-4): faithfulness, answer relevancy, context precision, context recall\.  Set a target on each in week 1\.  Score in weeks 7–10\.
- **Head\-to\-head\.**  Same 30 questions to the RAG and to your current workflow \(Newforma or Procore search plus a senior PM's memory\)\.  Score on usefulness, not cleverness\.
- **Adoption signal\.**  Weekly active senior PM use four weeks after launch, without prompting\.

The honest evaluation question is not "does it work?" but "does it answer better than a senior PM's memory plus a Newforma search?"  And: if PMs don't open it the week after launch, it doesn't matter how good the metrics look\.

Even with good metrics, pilots fail\.  Not for the reasons you'd expect\.

## The Three Failure Modes Vendors Won't Name

Project\-type RAG pilots most often fail not for technical reasons, but for three predictable organizational ones: stale standards inside the corpus, project managers who never adopt the tool, and unresolved client\-confidentiality concerns\.  Each is fixable, but only if you name it before the pilot starts\.

1. **Stale standards\.**  The corpus contains 2017 superseded specs and retired vendor names alongside current ones\.  The RAG cites them with the same confidence\.  *Mitigation:* metadata\-tag every document with project completion year and standards version\.  Filter at retrieval\.  Retire docs whose standards have been superseded\.
2. **PM non\-adoption\.**  Senior PMs default to Newforma muscle memory and never open the new tool\.  This is the most common failure mode and the least talked about\.  *Mitigation:* the PM champion runs weeks 7–10 testing\.  Embed retrieval into existing workflow surfaces, not a separate UI nobody opens\.  If you are designing AI [for founders](https://dancumberlandlabs.com/for-founders/) and principals to drive top\-down, the people doing the work still have to want it\.
3. **IP nervousness\.**  Confidential client work, federal or healthcare clients, and security counsel can stop a pilot in week 9 if not addressed in week 1\.  *Mitigation:* address governance up front\.  Managed services with private\-tenancy options \(Bedrock, Azure\) typically clear most reviews\.  Get sign\-off before you ingest, not after\.

A RAG that confidently cites a 2017 superseded detail will lose your senior PMs in a single week\.  Most pilots fail in the politics, not the prompts\.

If the pilot clears those traps, the question becomes: what's next?

## When to Scale to a Second Project Type

The temptation after a successful pilot is to scale fast and broad\.  Don't\.  Scale to a second project type only when the first one's senior PMs use it weekly without prompting and the retrieval metrics clear an internal bar you set before the pilot started\.

Two trigger conditions, both required\.  Architecture should be designed so the second type plugs in— scoping is not the same as siloing\.  And the scale decision is a separate funding conversation, not an extension of the pilot\.

Scaling a pilot that's still being adopted is how a $50K experiment becomes a $500K rebuild\.

## FAQ

**What is a project\-type RAG pilot?**  A scoped Retrieval\-Augmented Generation pilot that indexes one project type's documents \(say K\-12 schools\) to test custom AI knowledge retrieval before firm\-wide rollout\.

**How long does a custom RAG pilot take?**  60–90 days is realistic when built on managed services and scoped to a single project type\.

**How much does a custom RAG pilot cost?**  $25,000–$75,000 is a typical range for a mid\-size AEC firm, all\-in over 60–90 days\.

**Should we build custom or use Microsoft Copilot?**  Use Copilot for general productivity\.  Build a custom RAG when the value depends on firm\-specific specs, standards, and confidential client documents\.

**What documents go in the pilot corpus?**  Specifications, RFIs, submittals, meeting minutes, design narratives, and lessons\-learned docs from completed projects in the chosen type\.  Defer CAD/Revit/BIM models for v1\.

The whole argument compresses to one move: pick one project type, run a contained 60–90 day pilot, set the bar in week 1, and let the results decide what comes next\.  Everything else— managed vs\. custom, $25K vs\. $75K, scale or stop— falls out of that one decision\.

## Working With an Implementation Partner

If scoping a project\-type RAG pilot is on your firm's near\-term list, an implementation partner can shorten the calendar and keep the pilot honest\.  The work is pilot scoping, project\-type selection, build\-vs\-buy guidance, evaluation design, and PM\-adoption coaching— and it's the work behind every [AI strategy](https://dancumberlandlabs.com/services/ai-strategy/) engagement we run with mid\-market AEC firms\.  If that maps to where your firm is, [let's have a conversation](https://dancumberlandlabs.com/service/) about what the first 90 days could look like\.

## References

1. McKinsey & Company, "The state of AI" \(2024\) — [https://www\.mckinsey\.com/capabilities/quantumblack/our\-insights/the\-state\-of\-ai](https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai)
2. Anthropic, "Introducing Contextual Retrieval" \(2024\) — [https://www\.anthropic\.com/news/contextual\-retrieval](https://www.anthropic.com/news/contextual-retrieval)
3. LlamaIndex, "Building Performant RAG Applications for Production" \(2024\) — [https://docs\.llamaindex\.ai/en/stable/optimizing/production\_rag/](https://docs.llamaindex.ai/en/stable/optimizing/production_rag/)
4. Exploding Gradients, "RAGAS — Evaluation Framework for RAG Pipelines" \(2024\) — [https://docs\.ragas\.io/](https://docs.ragas.io/)
5. Anthropic, "Building Effective Agents" \(2024\) — [https://www\.anthropic\.com/research/building\-effective\-agents](https://www.anthropic.com/research/building-effective-agents)
6. Amazon Web Services, "Knowledge Bases for Amazon Bedrock — User Guide" \(2024\) — [https://docs\.aws\.amazon\.com/bedrock/latest/userguide/knowledge\-base\.html](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html)
7. OpenAI, "Assistants API — File Search" \(2024\) — [https://platform\.openai\.com/docs/assistants/tools/file\-search](https://platform.openai.com/docs/assistants/tools/file-search)
8. Microsoft, "Retrieval Augmented Generation in Azure AI Search" \(2024\) — [https://learn\.microsoft\.com/en\-us/azure/search/retrieval\-augmented\-generation\-overview](https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview)


---

Source: https://dancumberlandlabs.com/blog/type-architecture/