Prevent and detect data quality issues before they hit your downstream tools
If you look at the market today, there are 2 key trends affecting your effective use of customer data.
One is that there is just too much data, so much that it’s no longer becoming useful. IDC predicts that there will be 180 zettabytes of data created annually by 2025. That’s insane. When your company has too much data, and it’s not generated deliberately, it’s often sitting in a database somewhere, not being used. The mismatching data creates distrust, and even though we have more data than ever, people revert back to gut-based decision making.
Second, is that the data collected is not always useful or meaningful at organizations. When data is collected without a plan or agreed upon conventions, it’s inevitable that issues will arise: unclear event names, typos in events, events with missing properties, and more. It’s for this reason that data scientists spend most of their time cleaning and organizing data just to make it useful for their teammates that rely on this data to do their job.
These two forces create a number of challenges that likely impact teams throughout your company.
Business teams lack the context they need to use data effectively
Chances are your company relies on data to make decisions every day. Data-informed decision making is only possible when you have clean and accurate data. And even with clean and accurate data, if teams don’t know what the data means it’s impossible for them to effectively use the data.
One of the biggest problems companies face is that teams lack the context they need to use data effectively. Without a centralized spec of tracking plan, marketing, analytics, and product teams are left on their own to make sense of the data that is made available to them. This often creates two challenges:
Challenge 1: Data is no longer self-serve. This means that when Analytics, Product or Marketing need to use data, they have to spend time understanding what data is available to them and what the data means.
Challenge 2: The bigger challenge is that this leads to individuals spending time understanding and dissecting the data to make it useful for them. At the same time individuals from other teams are doing the same: spending time understanding and dissecting the data just to make it useful for them. Ultimately you’re left with multiple individuals spending time on the same dataset just to make it useful. So instead of going back the source and fixing how the data is collected, each team spends time fixing the data in downstream tools.
Duplicate events complicate analysis
Another challenge we see is that without tools to enforce data quality, is that duplicate or unclear events complicate analysis. When you’re trying to run a report or create an audience, how do you even know which event you’re supposed to use? Is it Sign_up? Signup as one word? Or is it User Signed Up?
Any end user of tools with this dataset is unable to reliably understand which events are production and should be used in queries. Selecting the wrong event here could lead to business decisions based on the wrong data. In most end tools, the only information provided is the name of the event without any additional context. As a result, you typically need to check with the person who created the data to actually understand what it means.
Unexpected property values make attribution impossible
Even if the event name is clear, the values of properties and traits are not always as expected. Without knowing where this lead came from, this “Lead Captured” event isn’t all that useful. These issues lead to campaigns not working, wasted ad spend, broken dashboards and poor predictions.
In some cases a single tracking outage can cost more than $600K, and the time spent troubleshooting and cleaning is significant.
The current approach is failing
The reason for these problems is that one team or even one person is typically responsible for ensuring high quality data. When data quality issues show up, fixing them becomes everyone’s job, distracting engineering, marketing, and analytics from their core work.
Ensuring high quality data requires tools and processes to prevent and detect issues. Most companies lack the tools they need to actually enforce their tracking plan and resort to manually auditing or reviewing implementation and often catch mistakes after they make it to production. We know how hard this is, and that’s why we have solid recommendations and processes for so-called Data QA.
Protocols for data you can trust
Protect the integrity of your data and the decisions you make with it.
- Have confidence in your data
- Standardize data across your business
- Prevent issues at the source before the data is sent to downstream tools
- Diagnose data quality issues before they impact production
Protocols centers around the idea of a tracking plan or spec within your CDI / CDP. The tools and features align with three pillars for ensuring high quality data:
Align: Tools to keep your teammates in sync with an accessible data dictionary
Validate: Tools to Diagnose data quality issues with enhanced reporting and alerts
And Enforce: Configurations to automatically block invalid events to protect your downstream tools