The next $10B AI company will own the weirdest data

Why the most defensible moats aren’t in LLMs—they’re in forgotten workflows and friction-filled datasets

Apr 22, 2025

Everyone’s chasing the next foundational model.
But the AI company that builds a true monopoly?
It won’t win with scale.
It’ll win with strange, high-friction, nobody-else-wants-it data.

Weird data is hard to clean. Hard to label. Hard to get access to.
That’s what makes it a moat.

Why weird data wins

In AI, defensibility doesn’t come from technical novelty anymore. It comes from what nobody else can see.

Tesla’s lead in autonomous driving isn’t from algorithms—it’s from billions of real-world miles.
Medivis builds AR-guided surgical tools trained on proprietary 3D medical imaging datasets, painstakingly curated from hospitals and surgeons over years.
Channel19 optimizes trucking logistics with a digital marketplace fueled by real-time freight and carrier data, scraped from small operators others overlook.
Klaviyo powers e-commerce marketing with a proprietary dataset of customer behavior—clickstreams, purchase histories, and engagement signals—collected from 143,000 merchants, giving it an edge no generic AI can replicate.

These are messy. Expensive. Domain-specific.
And that’s the point.

The secret: friction is the moat

The more annoying the data is to collect, the more likely it is to be valuable and defensible.

Healthcare transcripts full of acronyms, hesitations, and human nuance
Factory sensor logs with no consistent format
Field notes from oil rigs, classrooms, or farming tools
Customer support voice calls that require redaction, labeling, and industry context

You can’t buy this off the shelf. You have to earn it—through workflow, access, and patience.

“The deepest AI moats are built on data no one else wanted badly enough to collect.”

How to build a weird data moat

🛠️ Founder playbook:

Build for data capture: Your product should generate proprietary signal by solving a real pain point.
Go vertical first: Focus on one narrow niche where the data is high value and underutilized.
Own the loop: Make the data useful in product, improving retention and performance.
Treat ops like IP: Labeling, cleaning, structuring—it’s not overhead. It’s defensibility.

“10 golden samples are more valuable than 10,000 public ones.”

What I’m watching

The most exciting AI startups today aren’t flexing 100B parameters. They’re silently dominating:

Payer-side claims workflows in healthcare
Campaign orchestration and attribution in martech
Real-time workflow automation in enterprise software
Sensor + voice fusion data in elder care

These aren’t just models—they’re monopolies on niche, critical, impossible-to-replicate data, solving real customer pain.

The investor takeaway

The sharp investors I know are asking a different set of questions:

What feedback loop are you capturing?
What signal improves with every user interaction?
What would a competitor have to do to replicate your dataset?

Because the next $10B outcome?
It probably won’t come from another chatbot.

It’ll come from a startup quietly logging overlooked signals from workflows you’ve never seen— until suddenly, they’re the only ones with the data that matters.

What's the weirdest dataset you wish you could own?

(I'd love to hear your answer—just hit reply.)

Chintan Zalani

Apr 23Edited

Interesting "weird data" take, Parul. I would argue that AI has also made it easier to scan public sources and structure unstructured information. I do this with AI startup funding data. But I guess truly defensible moats would involve a combination of access to data sources that aren't easily accessible to others. And some form of existing distribution or network effects that improve as more users contribute data.

Expand full comment

1 reply by Parul Singh

Paola Raska

Apr 25

This is so on target -> 10 is better than 10k. For health this is going to translate to cohorts of engaged patients collecting all sorts of data on themselves - data that isn't available from anywhere else as it is always siloed, fragmented, by data type and difficult to integrate at the patient level. Rich, deep, data, curated trough the patient's own lens of their lived experience. Their data across ALL modalities. That's what CEtHI is about: creating real world evidence in real time via engaged patients. I love the validation this brings us.

1 more comment...