# Your Submittal Log Is a Training Dataset

**By Dan Cumberland** · Published June 20, 2026 · Categories: AI Strategy

> Procore's submittal tool saves 5–7 days per project through automated review and approval workflows[^1]. BuildSync processes submittals at 95% accuracy,...

## Your Submittals Are Already Feeding AI—Just Not Yours

Procore's submittal tool saves 5–7 days per project through automated review and approval workflows[1](/blog/blog-training-architecture#ref-1)\. BuildSync processes submittals at 95% accuracy, integrating directly with BIM 360[2](/blog/blog-training-architecture#ref-2)\. These tools work because submittal data has always been structured— it just hasn't been used to train anything beyond the vendor's own platform\.

A submittal log contains more than most PMs realize\. The data fields that make automated review possible are exactly the fields you'd engineer into a training dataset:

- **Drawing number and spec section** — categorical identifiers that enable classification
- **Approval status** — the confirmed outcome \(approved or rejected\) that any AI model needs to learn patterns from
- **Revision count** — a proxy for review complexity and subcontractor performance
- **Time\-to\-decision** — captures reviewer behavior and workflow bottlenecks
- **Reviewer identity** — enables firm\-specific pattern detection over time

Across 12 major construction platforms, modern submittal automation achieves 95%\+ accuracy processing and categorizing construction submittals[8](/blog/blog-training-architecture#ref-8)\. That accuracy is evidence of structure\. The structure is the asset\.

But here's what those platforms don't tell you: they're using your data to run their models, not yours\. When you improve your [AI governance strategy](/blog/ai-governance-strategy) and look carefully at vendor agreements, you'll typically find your submittal data improves the vendor's shared model— not a model owned by your firm\.

> "Automation tools can process your submittal data with 95% accuracy\. That accuracy is evidence of structure\. The structure is the asset\."

> "When you run Procore's AI on your submittals, the platform gets smarter\. When you train your own model on your submittals, your firm gets smarter\. One of those investments compounds\. The other one is a subscription\."

## The Difference Between AI Reading Your Submittals and AI Learning From Them

Operational AI applies rules to each submittal in isolation\. Training AI detects patterns across thousands of submittals over time— patterns specific to your firm's reviewers, your project types, your subcontractors\. That's the difference between renting AI capability and owning it\.

```html-table
<table><thead><tr><th></th><th>Operational AI</th><th>Training AI</th></tr></thead><tbody><tr><td><strong>What it does</strong></td><td>Applies pre-trained rules to each submittal</td><td>Learns patterns from your firm's historical data</td></tr><tr><td><strong>Who owns the model</strong></td><td>The vendor</td><td>Your firm</td></tr><tr><td><strong>Competitive implication</strong></td><td>Every client on the platform gets the same capability</td><td>Your model gets smarter only from your data</td></tr></tbody></table>
```

Across 12 major construction technology platforms analyzed, none discloses AI model training on submittal data as a capability or service offering[9](/blog/blog-training-architecture#ref-9)\. That gap is confirmed, not speculated\. The capability doesn't exist in the market\. Which means if your firm built it, you'd be the first\.

This is crossing the chasm— the bridge between what operational AI does today \(reads your submittals\) and what training AI could do \(learns your firm's patterns permanently\)\. According to IBM and Kantar research on proprietary datasets[10](/blog/blog-training-architecture#ref-10), companies that pair their own data with workflow integration can build competitive advantages that are genuinely difficult to replicate\. The principle applies in construction\. No one has applied it yet\.

> "Operational AI is a service\. Trained AI on your data is an asset\. The first depreciates when you cancel the subscription\. The second compounds\."

## Three Barriers Keeping AEC Firms From Building Their Own Training Architecture

The barriers to training AI on submittal data are real and well\-documented\. Privacy concerns, data quality gaps, and talent shortages combine to explain why no firm has done this publicly— and why doing it now represents a genuine first\-mover window\.

### Privacy and Regulatory Uncertainty

Privacy concerns are the most cited barrier to AI adoption in construction: 25\.7% of firms list it as their top obstacle, and GDPR violations can result in fines up to €20 million or 6% of annual global turnover[6](/blog/blog-training-architecture#ref-6)\. But most firms are worried about the wrong thing\.

GDPR creates privacy barriers around sharing construction data with vendors\. It does not restrict firms from training AI internally on data they own\. Your submittals, generated on your projects, belong to your firm\. Training a model on that internal history does not trigger the core privacy barriers firms fear— what triggers those barriers is handing the data to a third party without proper agreements\.

The clarification matters\. Internal training architecture is a different legal question than vendor data sharing\. Standard data governance obligations still apply — documenting your legal justification for processing, keeping only the data you need, and setting rules for how long you keep it — but these are different in kind from the vendor data\-sharing restrictions most AEC firms worry about\. \(Firms with international project data or specific contractual clauses should confirm this with legal counsel— the general principle holds, but context matters\.\)

### Data Quality: 80% Lack Structure

Here's the harder truth: 80% of contractors lack structured submittal collection systems[5](/blog/blog-training-architecture#ref-5)\. The data exists— it's sitting in Procore or your project management platform right now\. But most firms couldn't export it in training\-ready format today\.

Vendors can process submittals at 95% accuracy because they do the cleanup on ingestion\. They normalize fields, resolve inconsistencies, handle missing data\. Your firm's raw export likely looks nothing like a clean training dataset\. This is a solvable problem, but it takes 6–12 months of data infrastructure investment before model training can begin\. The data quality work comes first\.

### Talent and Engineering Effort

McKinsey research on AI in construction[4](/blog/blog-training-architecture#ref-4) documents that approximately 80% of AI implementation effort goes to data engineering, preparation, and management— not to building or training the model itself\. And 46% of AEC firms cite talent and skill shortage as their primary adoption barrier[7](/blog/blog-training-architecture#ref-7)\.

That means you're not hiring a data scientist\. You're hiring data engineers to build the pipeline, ML infrastructure to run training, and domain experts who understand what a spec section rejection pattern actually means\. The [hidden costs of AI projects](/blog/hidden-costs-ai-projects) in this category are significant\. The firms that will succeed are the ones who scope the data engineering phase honestly, not the ones who assume the model\-building is the hard part\.

> "80% of AI implementation time goes to data engineering before a single model gets trained\. In construction, that number may be higher— because 80% of contractors lack structured submittal collection systems\."

These three barriers— privacy uncertainty, data quality gaps, and engineering effort— are documented and real\. They're also exactly why the first\-mover window is still open\. Every firm that's passed on building training infrastructure has cited at least one of these obstacles\. The question is whether yours will be the first to move through them\.

## Frequently Asked Questions

### Can we legally train AI on our own submittal data?

Yes, with important nuance\. GDPR and similar regulations restrict how you share data with third parties— not how you use your own data internally[6](/blog/blog-training-architecture#ref-6)\. Submittals generated on your projects belong to your firm\. Training a model on your internal submittal history does not trigger the core privacy barriers most AEC firms worry about\. Firms with international project data or specific contractual clauses should consult legal counsel, but the general principle is clear: internal training is a different legal question than vendor data sharing\.

### How much submittal data does a firm need before training makes sense?

There's no universal threshold, but data scientists typically look for thousands of labeled examples to produce meaningful models\. A firm with 5\+ years of structured Procore data and 10\+ projects is likely in range for initial feasibility testing\. Data quality matters more than volume: 1,000 clean, well\-structured records outperforms 10,000 inconsistent ones[5](/blog/blog-training-architecture#ref-5)\. The practical answer is that most $20M–$100M AEC firms have enough raw data— the work is cleaning it into training\-ready format\.

### What's the difference between using Procore's AI and training your own model?

Procore's AI is trained on aggregate submittal data from thousands of clients— it reflects industry patterns, not your firm's patterns[1](/blog/blog-training-architecture#ref-1)\. A proprietary model trained on your submittals learns your reviewers' tendencies, your subcontractors' performance history, your spec section failure rates\. Across 12 construction platforms analyzed, none currently discloses submittal training as a capability[9](/blog/blog-training-architecture#ref-9)\. The difference is generic intelligence versus institutional knowledge that took your firm years to accumulate\.

## What Training Architecture Looks Like for an AEC Firm

A firm\-specific training architecture for submittal data has four layers: data extraction, standardization, model training, and workflow integration\. The last layer is where most firms would stop— and it's the most important one\.

Think of it as building an iceberg from the bottom up\. The visible tip is the AI model predicting submittal outcomes\. Everything underneath— the extraction pipeline, the normalization work, the feedback loops— is the infrastructure that makes the model valuable\. Without it, you don't have an asset\. You have a snapshot\.

**Layer 1 — Data Extraction** Pull structured data from Procore, BIM 360, or BuildSync exports\. Identify which fields matter for training: approval status, revision count, time\-to\-decision, spec section, subcontractor identity, reviewer\. This layer is largely mechanical— the data already exists, you're just making it accessible\.

**Layer 2 — Standardization** Normalize timestamps, tag records by project phase, resolve reviewer identity consistency across projects, flag incomplete records\. This is where McKinsey's 80% effort estimate[4](/blog/blog-training-architecture#ref-4) becomes real\. It takes months, not weeks\. Most of the engineering work in any AI implementation lives here\.

**Layer 3 — Model Training** Batch train on your historical submittal corpus to build a firm\-specific approval prediction model— one that can flag new submittals likely to be rejected before your reviewer touches them\. This is the layer everyone thinks about first\. Plan it last\.

**Layer 4 — Workflow Integration \(CRITICAL\)** Each reviewer decision feeds back into the training dataset\. Every approval or rejection your team makes is collected, and at regular intervals — per project completion, monthly, or on a scheduled cadence — the model retrains on the accumulated data\. This ongoing improvement cycle is what IBM and Kantar research[10](/blog/blog-training-architecture#ref-10) identifies as the mechanism that creates defensible competitive advantage— a model trained once and never updated is not an asset, it's a static snapshot that degrades over time\.

> "The firms that will win with proprietary AI aren't the ones that train the best model\. They're the ones that integrate the model back into the workflow so it gets smarter every time a reviewer makes a decision\."

## Why Proprietary Submittal Data Could Be Your Firm's Most Defensible Asset

59% of AEC firms expect AI to transform their business by 2026[3](/blog/blog-training-architecture#ref-3)\. The firms building proprietary training data today aren't competing on the same AI as everyone else— they're competing with AI that no one else can copy\.

When every firm on Procore uses Procore's AI, they're all running the same model trained on aggregate industry data\. There's no differentiation\. The platform improves for everyone equally\. And when a competitor upgrades their Procore subscription, the gap closes overnight\.

A proprietary submittal model works differently\. What makes it defensible:

- **Volume × time** — a firm with 15,000 submittals over 10 years has a dataset that takes competitors a decade to replicate
- **Firm\-specificity** — the model encodes your reviewers' judgment patterns, your subcontractor reliability history, your spec section failure rates
- **Continuous compounding** — every project makes the model more accurate; competitors starting today are years behind
- **No vendor replication** — the data is yours; no platform can train the same model for someone else

This is a first\-mover opportunity with real barriers\. It's not proven ROI\. No firm has published a case study on proprietary submittal AI because no firm has done it publicly\. What we do know, from research on proprietary datasets across industries[10](/blog/blog-training-architecture#ref-10), is that the competitive window closes once the first movers establish their data advantage\.

For firms evaluating this investment, the success metrics look different than any other AI project— you're building a data asset, not measuring immediate time savings\. The model doesn't pay off in month two\. It pays off when your tenth project makes it smarter than your competitor's tenth project\. The payoff horizon is 2–3 years, not 2–3 months\. \(For context on how to track this kind of investment, see our guide to [measuring AI success](/blog/measuring-ai-success)\.\)

> "Generic AI is a commodity\. AI trained on 20 years of your firm's submittal patterns, your reviewers' judgment calls, your subcontractor reliability data— that's not a commodity\."

## Conclusion

The submittal log sitting in your project management platform is already a training dataset\. The question is whether your firm will be the one to use it\.

Construction AI is at its operational peak today— automated review, faster approvals, reduced administrative overhead\. That layer is real and worth having\. But the firms investing in training architecture now are positioning for the next wave: proprietary models that encode their own expertise, improve continuously, and create a data advantage competitors can't buy\.

The barriers are real\. Privacy, data quality, and engineering effort are genuine challenges, not excuses\. But they're solvable— and the competitive window is open precisely because they've kept everyone out so far\.

If you're evaluating where AI infrastructure investment makes sense for your firm, that's the kind of strategic question our [AI strategy work](/services/ai-strategy) is built around\. The goal isn't to sell you on building a proprietary model\. It's to help you understand what you're actually sitting on— and to start that conversation before your competitors do\.

## References

1. Procore, "Procore Submittal Automation & Analytics Features" \(2024\) — [https://www\.procore\.com/features/submittals](https://www.procore.com/features/submittals)
2. BuildSync, "BuildSync Submittal Review AI" \(2024\) — [https://www\.buildsync\.io](https://www.buildsync.io)
3. Autodesk, "State of the Industry Report 2024" \(2024\) — [https://www\.autodesk\.com/state\-of\-industry](https://www.autodesk.com/state-of-industry)
4. McKinsey, "AI in Construction: Transforming the Industry" \(2024\) — [https://www\.mckinsey\.com/industries/construction/our\-insights/ai\-in\-construction](https://www.mckinsey.com/industries/construction/our-insights/ai-in-construction)
5. QualisFlow, "Construction Data Quality Report 2025" \(2025\) — [https://www\.qualisflow\.com/industry\-report\-2025](https://www.qualisflow.com/industry-report-2025)
6. EU GDPR Regulations & Construction Industry Privacy Compliance Survey \(2024\) — [https://gdpr\-info\.eu/](https://gdpr-info.eu/)
7. AEC Foundry, "AI Skills Gap in Construction 2024" \(2024\) — [https://aecfoundry\.org/ai\-skills\-gap](https://aecfoundry.org/ai-skills-gap)
8. RedTeam, "Submittal Workflow AI" \(2024\) — [https://www\.redteam\.com/blog/submittal\-workflow\-ai](https://www.redteam.com/blog/submittal-workflow-ai)
9. Multi\-vendor analysis: Procore, Autodesk, BuildSync, Ezelogs, RedTeam, Touchplan, QualisFlow, Bridgit, PlanGrid, Egnyte, Bridgit Bench, AConex \(2024\) — [https://www\.procore\.com/features/submittals](https://www.procore.com/features/submittals)
10. IBM, Kantar, Bowmark Research, "Proprietary Data as Competitive Advantage in AI" \(2023–2024\) — [https://www\.ibm\.com/cloud/blog/proprietary\-data\-competitive\-advantage](https://www.ibm.com/cloud/blog/proprietary-data-competitive-advantage)


---

Source: https://dancumberlandlabs.com/blog/training-architecture/
