Five data resolutions for 2022
As you sit waiting for sugar-plum fairies (or just enjoying good holiday cheer and relaxation), it’s a good time to start thinking about what you want to change up in 2022. Data is a big part of my working life, and so as I look forward to the new year, I’m hoping you’ll consider improving your data practices in the coming year.
Most of us, I think, work in what we would like to describe as a data-driven business. But it’s shocking to me how often data-driven businesses make fundamental mistakes in how they use, consider and handle data. So, here’s a simple list of resolutions that can help you build a stronger data discipline for the coming year.
Resolution 1: Always test your assumptions.
For me, the four scariest words in a supposedly data-driven business are ‘the answer is obvious.’ Validating your assumptions doesn’t have to be hard. First, identify the statement you want to validate (e.g., ‘this is the best market for us to go into’ or ‘this onboarding flow is better’). Identify the data you have or need to collect that would allow you to test that assertion. Collect that data and analyze it. This doesn’t need to be some fancy AI or ML model: the vast majority of practical analyses leverage the most basic of statistics (and are stronger for it!) Draw your conclusions from your analysis.
In software development there’s an axiom that the cost of a bad decision grows exponentially over time. Caught early, it may cost you hours. Caught late, it can easily cost you person-years of effort and opportunity (and maybe even customers and revenue!). The question should never be whether you can afford to validate your assumptions but whether or not you can afford not to.
Resolution 2: Identify your bad data
Bad data comes in many forms and is extremely common. We all suffer from it. A critical step in running a data-driven business is in identifying your bad data so that you’re able to deal with it. Here are some of the most common forms of bad or ‘dirty’ data:
- Duplicated or missing data – Sometimes your data collection process itself is broken, and you end up missing data or duplicated data. Perhaps you had misconfigured something or your provider had an issue. The key here is to look for these issues in your data before you start any analysis. You would be shocked (because I constantly am) at how often this occurs and goes unnoticed. We ourselves at SparkPost some years ago had an issue where Google Analytics data was collected twice for some portions of our website due to misconfiguration. The issue of course isn’t that it happened, it’s not catching and addressing it, as it can cause false inflation or deflation of stats.
- Unsound or mislabelled data – This is the term I use for data where the collection process itself, or the questions which were asked, are fundamentally misaligned with what you’re trying to answer. For survey data this could be a result of asking bad questions or having a broken collection process. In email an example might be counting clicks to an unsubscribe link as part of your click-through data for engagement.
- Restricted data – This is data which is unavailable for usage because its use is illegal or restricted by policy. As privacy initiatives become more broadly adopted, this is becoming more commonplace. Data whose collection falls under the purview of GDPR or CCPA or some other privacy regime is a great example of this. It’s also a case where review of your data collection and usage policies by appropriate professional resources is important.
- Incorrect data – This is data which is present, but which is untrustworthy. This often requires strong subject matter expertise in the data to fully spot. A great example of this is geographic and device data for Gmail users. If you look at your geolocation data for Gmail users based on their open data, you will locate them all in the US. That’s not because they live in the US, it’s because Gmail proxies all of their traffic through a US-tagged network block. It's not possible to ‘clean’ this without dropping that location/device data, it is fundamentally untrustworthy data.
- Incomplete data – Data can also be incomplete. A common reason for this is that you cleaned some data that was dirty for other reasons – for instance dropping location data from Gmail opens, excluding iOS15 opens to accommodate MPP changes, etc. Incomplete data is very common, and the real risk in it is ensuring that you understand whether or not systemic exclusion of data introduces bias into your overall analysis. Unfortunately, some of the privacy related restrictions can do this: excluding all Europeans from certain data sets, excluding all Apple users (or 60-some percent of them) from analyses relying on opens, excluding Gmail users from device usage status – all of these can be very large blind spots.
- Stale data – This is data that was likely ‘good’ at some past date but is no longer relevant. Some examples of stale data are using signup locale as the users’ current residency or using IP-to-sender maps, or even using historical deliverability as an indication that an address is still valid. I think all of these cases are great examples of how data quality is nuanced. In all of these cases the data is hypothetically unreliable the minute it is used. (I could have signed up while traveling, an IP range can be decommissioned at any time, an address could have been decommissioned immediately after the engagement event) In all these cases, though, the risk of the data going from being simply hypothetical to actually bad grows over time. A critical part of using it is in understanding (and hopefully measuring) the rate of decay of its reliability and incorporating that uncertainty into your assumptions. It also illustrates that none of this is a boolean good/bad – much of making real world decisions is making decisions based on assumptions, and that is fine as long as you accurately understand your assumptions.
Resolution 3: Understand your Proxies
It is very common that we want to measure something which is immeasurable (either literally, or just in a practical sense). A perfect example of this is wanting to measure whether or not a recipient truly engaged with your mail, in the sense of them opening it and actually reading its contents. This is not possible to measure as we don’t have little spy cams that monitor all our recipients and monitor their behavior. So instead, we use what are known as proxies for this information – something that we can measure which we believe to be correlated well enough to stand in for the thing which we can’t measure. In the case of user engagement, the industry has traditionally used open pixel beacons that (in theory and prior to Apple’s MPP changes at least), do a reasonably good job of measuring whether a user opened a mail. Maybe you (or your provider) also license Verizon’s campaign performance feed and can use their aggregate stats on how often recipients opened the mails using their non-pixel-based stats. In either case, you’re not actually measuring the important thing (since neither of those indicate that a user actually put eyeballs-to-screen on your message content), but a proxy for it.
Proxies are a common and necessary part of data analysis, and the goal isn’t to avoid them but to be acutely aware of when you’re using a proxy so that you can make your own contextual judgment of how good the proxy is standing in for the thing you wish to be measuring in any particular case.
Resolution 4: Clean your Data
So, we just identified a bunch of common issues that affect almost every email data set out there. Does that mean that it’s hopeless and we should just give up on using email data? Absolutely not!
The most common way to deal with bad data is to remove the data that you know you can’t trust. This is a key part of what is commonly known as ‘cleaning’ data. So to ‘clean’ data affected by Apple’s MPP changes, we would identify the subset of data affected (opens from current Apple devices) and remove that data from the analysis.
It’s also important to understand whether or not a ‘badness’ in the data is relevant to a given analysis. This often requires significant subject matter expertise. For instance, we know from extensive testing that Apple’s MPP opens are fetched from the user's device itself and not through a proxy server farm like Gmail’s. We further know that even with the iCloud+ Safari privacy protection features activated that geolocation data is still accurate to the region. Thus, while the opens from those devices can’t be trusted as actual engagement, we could use the data in an analysis of where users are located, at least to regional granularity.
It is also important to measure the impact of your missing/dirty data. This serves two purposes: first it allows those using your analysis understand how it applies to them. For instance, if we run a report that leverages information that can’t be used for EU residents, it may still be perfectly usable in a US context but would probably be very low value to use in a European context, unless there was a reason to believe that the behavior being reported on was similar between the two regions.
Elimination of enough data may render the sample size too small or too biased to be useful and we may need to go back to the drawing board.
Resolution 5: Keep your analysis as simple as possible, as complex as necessary
In a world where it seems like attaching the acronyms AI or ML to a product can immediately increase its cachet, there is value in keeping things as simple as possible. The practical reality is that the simplest analysis is often the strongest. Most problems can be tackled with basic statistics, and they have the benefit of being transparent to both you and anyone consuming your analysis. When you have to reach deep into your bag of modelling tricks, you often end up with models that are difficult to understand and render analyses that can be difficult (or dangerous) to extrapolate to other cases.
So that’s my list of data resolutions for 2022. What are you thinking about doing differently with your data in the coming year?