Why your AI is only as good as the data foundations on which it is built

Many of the questions organisations want AI to answer depend on connecting imperfect data first

In the rush to apply AI across the enterprise, many organisations are asking the wrong first question.

They ask which model to use, which copilot to deploy, which workflow to automate, which agent to build.

But before any of that, there is a more basic question that is often a key factor in whether AI delivers meaningful value:

Has the data needed to answer the business question been connected well enough for AI to see the situation clearly?

In many organisations, the answer is no.

The insight that people is hampered by fragmented datasets. Customer records are in one system, behavioural signals in another, campaign exposure elsewhere, survey research somewhere else again. Identifiers to connect datasets are incomplete, inconsistent, or missing altogether. Exact joins (for example email address) work for some records, but not enough to create a complete and usable view.

This is where probabilistic matching becomes important. Not as a magic fix or an end in itself but as a practical method for connecting imperfect data well enough to make better analysis, richer insight, and more useful AI possible.

The real problem is rarely “lack of data”.

Most organisations do not suffer from a total absence of data. They suffer from data that is abundant, but disconnected.

The evidence needed to answer an important business question often exists already. It is just scattered across different platforms, teams, and formats. Each source captures only part of the story. One dataset contains declared preferences, another actual behaviours. One contains transactions, another interactions. One shows exposure, another response.

Individually, each source is useful. Together, they may be powerful. But only if they can be connected in a credible way.

This matters even more in the AI era. AI systems can only reason over the relationships they are able to “see”, and while they can infer patterns from partial data, important gaps in how data is connected can still limit what they uncover.

That is why upstream data connection is not a technical side issue. It is often the common limiting factor in whether AI produces shallow output or meaningful insight.

What probabilistic matching actually does

Probabilistic matching helps estimate which records are likely to belong together when there is no perfect shared identifier.

That is the important distinction.

A deterministic match asks a simple question: do these records share the same key? If yes, they match. If not, they do not.

A probabilistic match asks a more nuanced one: based on the attributes available, how likely is it that these records refer to the same person, household, account, event, or entity?

That does not mean abandoning rigour. It means recognising the reality of many business environments where exact IDs are often incomplete, inconsistent, or unavailable across all the datasets needed to answer a given question.

Probabilistic matching can help bridge that gap. It does not create certainty where none exists. But it can create a fuller, more usable view than exact matching alone.

That said, coverage is not the same as accuracy. More data is only useful if the additional connections are credible. The value of a probabilistic approach lies in balancing broader coverage with a managed level of uncertainty that is understood, tested, and controlled.

And in many cases, that is enough to unlock better decisions.

When probabilistic matching is the answer

Probabilistic matching is the right approach when a business question depends on multiple datasets that do not join cleanly, and exact matching alone leaves too many gaps.

That scenario is more common than many organisations admit.

A marketing team wants to understand which customers are most likely to respond to a campaign, but campaign exposure, CRM history, website engagement, and purchase data all sit in different systems.

A customer insight team wants to know whether what people say in surveys actually predicts what they do in market, but attitudinal and behavioural data have been collected separately.

A media planner wants to understand where incremental reach is truly coming from across platforms, but cross-channel audience data does not fit together neatly at the person or household level.

In each case, the desired insight sits across multiple sources. No single dataset contains the answer on its own. And in many organisations, deterministic matching alone can leave a large share pf relevant records unconnected, limiting the usefulness of downstream analysis.

In those situations, probabilistic matching is not just helpful. It may be the method that makes the analysis possible in the first place.

Its value lies in creating a broader and more representative input layer. Instead of limiting analysis to the proportion of records that happen to share an exact identifier, it can help recover more of the underlying structure in the data, while introducing a managed level of uncertainty that needs to be understood and controlled.

This is not about adopting a specific tool or platform, but about recognising when this type of approach is required.

When probabilistic matching is not the answer

It is equally important to be clear about when probabilistic matching is not the answer.

Not every data challenge requires it. Not every insight problem benefits from it. And using it where it adds no value can create unnecessary complexity.

If the business question can already be answered from one dataset, there may be no need to match at all. If the relevant systems already share a strong and reliable common identifier, deterministic matching is usually preferable. If the available variables are too weak, sparse, or inconsistent to support credible linkage, a probabilistic approach may not be robust enough to justify use. And where the cost of a wrong match is especially high, very conservative or exact methods may be required.

This is why probabilistic matching should not be treated as a universal answer to messy data. It is a method suited to particular types of problem, such as situations where the insight depends on connecting imperfect sources, and where uncertainty can be managed transparently and responsibly.

In some cases, poorly specific matching can introduce systematic bias, for example over-thinking certain types of records while missing others, which can distort downstream  analysis if not properly tested.

That nuance matters. It is also what makes the approach credible.

Why this matters more now, than ever

There is a temptation in AI conversations to assume that more advanced models will compensate for poor data structure upstream.

They can compensate for some limitations. They cannot remove them altogether.

AI may be able to identify patterns, summarise complexity, generate hypotheses, and support decisions. But it still depends on the quality and connectedness of the inputs it receives. If key signals are missing, isolated, or badly linked, the system may still generate an answer; just not a complete or reliable one.

This becomes even more important in agentic or decisioning contexts, where AI is expected not only to analyse but to recommend, prioritise, or act.

The more autonomy organisations want from AI, the more confident they need to be in the connectedness of the data layer beneath it.

That requires not just methodology, but transparency, validation, and appropriate governance, so that the level of uncertainty is understood and can be trusted.

That is why trust in AI does not begin with explainability at the model layer. It starts earlier, in how the underlying data has been assembled, matched, and prepared.

What this looks like in practice

The clearest way to understand the role of probabilistic matching is through real business questions.

A customer insight lead may want to know whether what people say in a survey is reflected in what they actually do in the market. Attitudinal data and behavioural data often sit in different systems and do not share a clean respondent-level key. Probabilistic matching can help connect the two well enough to reveal which attitudes genuinely translate into action, and which do not.

A marketing director may want to understand which customer segments are genuinely persuadable and through which channels. The relevant signals, such as campaign exposure, engagement, CRM records, and purchase behaviour often exist, but not in a single joined-up view. A probabilistic approach can shift the view from partial signals to a more complete journey-level picture, often changing which segments appear most persuadable and where budget is best allocated.

A media planner may want to know where incremental reach is really being delivered across platforms. Cross-platform datasets rarely line up perfectly, yet the planning decision depends on understanding overlap, duplication, and under-reach. Here again, probabilistic matching can help create the broader audience view needed for more useful analysis.

A sales or revenue operations leader may want to know which accounts are most likely to convert or expand. But account signals are spread across CRM systems, product usage, support histories, and marketing engagement data. Without a better linked account view, AI scoring may reflect only fragments. With one, the business can prioritise effort based on stronger commercial evidence.

A customer experience lead may want to understand what is really driving poor experience. Complaints, feedback, journey data, and operational records often exist in separate systems. Matching them more effectively can reveal where friction is occurring, which moments are most damaging, and which interventions are likely to make the greatest difference.

In each case, the end user is not asking for probabilistic matching for its own sake. They are asking for an answer. Matching matters because the answer sits across datasets that do not naturally fit together.

Probabilistic matching is not the insight

This is one of the most important points to get right.

Business users do not wake up wanting probabilistic matching. They want to know which customers are at risk, which audiences are under-reached, which accounts are most likely to grow, or which attitudes predict real behaviour.

Probabilistic matching is valuable because it helps create the connected data layer needed to answer those questions when exact joins are not enough.

That may sound like a subtle distinction, but it is not. It changes how the method is positioned, how it is evaluated, and how it is discussed with non-technical stakeholders.

The real value lies not in matching records for its own sake, but in making better insight possible.

A practical test

For organisations considering where probabilistic matching fits, five simple questions can help.

  • Does the business question depend on more than one dataset?
  • Do those datasets lack a reliable common identifier?
  • Would exact matching leave too little coverage to be useful?
  • Can uncertainty be handled in a methodologically responsible way?
  • Will a fuller view materially improve the insight or decision?

If the answer to most of these is yes, probabilistic matching is likely to add value. If not, it is worth challenging whether it is needed at all.

The bigger point

As AI becomes more deeply embedded in commercial decision-making, the organisations that benefit most are unlikely to be the ones that simply adopt models fastest. They will be the ones that take the upstream data problem seriously.

Because the challenge is not only whether an organisation has data. It is whether the data needed to answer the business question has been connected well enough for analysis, AI, and decision-makers to see the full picture.

When exact matching is possible, use it.

When one dataset is enough, keep things simple.

But when the answer sits across messy, imperfect, fragmented sources, probabilistic matching may be the method that turns disconnected evidence into usable insight.

And in many AI strategies, that step often comes earlier, and matters more, than the models themselves.