Data cleanliness and quality have been hot topics in the business intelligence (BI) world for years, and rightfully so. Without clean and non-siloed data, companies risk making bad business decisions based on the wrong insights.

The same principle also applies to media intelligence. If your media data isn’t clean, consistent, and consolidated according to best practices across all lines of business or regional offices, you risk similar negative outcomes based on erroneous insights.

That’s why data cleaning (also known as “data cleansing” or “data scrubbing”) is one of the most important – yet, often overlooked – steps when conducting media intelligence and analysis activities. 

The risks of skipping data cleaning are massive

Although companies who skip data cleaning do so at their peril, it’s easy to see why this often happens: Data cleaning is time-consuming and often tedious work that typically isn’t celebrated around the boardroom table. 

Indeed, data cleanliness is usually only noticed by stakeholders when something looks wrong. 

Data cleaning is also the most time-consuming step in a monthly media analysis report, often taking even longer than content curation, tagging for categories and sentiment, or report writing. 

Things get even more laborious for large enterprises with thousands of mentions per day – after all, the larger the dataset, the more potential issues there could be (and the longer it takes to clean and quality assure (QA) the data).

But skipping this step almost always costs organizations more in the long run. Clean data builds trust among readers – and just like building a house on a rickety foundation, a media analysis report built on bad data quickly collapses when faced with any real scrutiny from stakeholders. 

How unclean data can risk your credibility

Dozens of potential issues and mistakes can lurk within your media data if you’re not careful and meticulous about data cleanliness. Here are just a few potential problems:

  • Press release content: While some enterprises deem PR content from the wire fair game for inclusion, most regard it as irrelevant noise. In the latter case, there’s nothing more damning than investigating a large mid-month coverage spike only to discover it was driven by reproductions of a company-issued release.
  • Publication data: Media outlets and publications often have slightly different naming conventions, depending on your content provider (for example: Articles under The New York Times, NY Times, or New York Times could all arrive in your system the same day). Without an efficient way of combining all these permutations into one record with accurate metadata such as reach and region, you’ll soon have a significant data problem on your hands.
  • Journalist data: Journalists can also have variations on their names, but even more common are the various affiliations that come with freelance or wire journalists (whose content can appear in dozens of different publications). Clean data is the difference between a single Carla K. Johnson from AP with 55 articles to her credit or 55 different Carla K. Johnsons from multiple outlets.
  • Duplicate articles: Duplicate articles are the bane of any media analyst. That’s partly because they’re so common – especially among online news feeds that often serve up dozens of duplicate articles of the same story – and partly because including a large batch of duplicates by mistake can completely blow up your analysis.

Other common data quality issues can include outdated, irrelevant, missing, or spam articles; incorrect article metadata (including author details, category tagging, URL, or outlet name); formatting and consistency errors (such as inconsistent sentiment and date formatting, or the same outlet with different reach values); and old-fashioned typos in the data.

How often should I clean my media data?

There are two main approaches to data cleaning.

  • Ongoing data cleaning: Regular data cleanups and QA on a frequent basis.
  • Intermittent data cleaning: Irregular and infrequent data cleanups and QA, typically as data preparation before running a media analysis report. 

Some organizations clean their data only in the days leading up to a media analysis, ensuring their quarterly report (for example) is built on quality data but leaving day-to-day data issues unchecked. 

While this approach can be cheaper from a scheduling perspective because teams aren’t required to check data very often, it also leaves enterprises precariously exposed to inaccurate results if an ad hoc or last-minute report is required. This approach can also lead to costly bottlenecks (and potential delays or mistakes) in the days leading up to a report, as staff scramble to prepare large volumes of data for analysis.

At the end of the day, however, any data cleaning is better than none at all – and some communications groups don’t need anything other than intermittent updates. It all depends on what makes the most sense for your organization.

How Fullintel keeps your data clean and effective

Fullintel’s AMEC Award-winning media analysts employ a well-defined data cleaning and QA workflow that’s templated, repeatable, and effective. We keep your data clean and reliable depending on your needs – from ongoing data checkups and maintenance for ad-hoc reporting, along with a comprehensive data cleanup and consolidation well before your media analysis report is created.

Get in touch to learn more about how our AMEC-certified team can improve your media intelligence activities and outcomes.  

Andrew is Co-Founder and President of Fullintel. His 25-plus years of media intelligence experience helps large organizations and Fortune 1000 companies such as Textron, AAA, Clorox, Kraft Heinz, MUFG, and Bell plan and implement day-to-day and crisis media monitoring and analysis strategies and best practices. He also co-founded dna13, the world’s first software-as-a-service media monitoring platform, which was eventually acquired by PR Newswire.