Data Quality: You Don’t Know What You Don’t Know

SAP BusinessObjects delivers great Business Intelligence solutions so that organizations can report off their existing data sources.  But what is the point of reporting of data that isn’t accurate anyway?  Although it is true that accurate data is pretty useless if you can get access to it, the converse is also true.  What is the point of a great end-user enabled system that includes inaccurate data?

My Top 5 Customers – Really?

Take a look at the report below.  (If you want to download this Xcelsius Model it is available below.)

Who are my top 5 customers?

Top 10 Customers
Top 10 Customers

Did you say:  General Electric, Procter & Gamble, PepsiCo, Home Depot and Walmart?

Well, Sorry.  I’m afraid, that would be incorrect.

You see, what often happens in real-world situations is that organizations think they have more customers than they actually do.  That’s because within their CRM system, employees are able to add the same customer multiple times with multiple spellings.  This has happened in our case as well.  Let’s apply BusinessObjects Data Quality to this real-world situation.  With SAP BusinessObjects, you can take company names, customer names, addresses, etc. and standardize them, e.g. UPS = United Parcel Service = UPS Inc., WalMart = Wal*Mart = Wal-Mart, First Commerce Bank = 1st Commerce Bank.

My Top 5 Customers – Really!

Let’s have a look at this same report with Data Quality applied:

Top 10 Customers with Data Quality
Top 10 Customers with Data Quality

Do you see the changes?

Walmart has jumped up into second place and United Parcel Service is now in fifth place.  We can also see the our profitability at Walmart is higher than we thought (26.8% instead of 18.7%) and United Parcel Service is actually lower that we thought (28.6% instead of 26.3%).  When you are making business decisions off your corporate data, it’s imperative that it is accurate and complete.

Here is the source data behind this chart and you can see how the lack of standardization has led to the incorrect results.  I have highlighted the offending records for you:

Raw Customer Data
Raw Customer Data Behind the Top 10 Customers Report

Once we apply data quality and standardize the names, the order changes and I have a new top 5!  Often times our biggest customers, vendors, partners and products don’t get the credit they deserve for contributing to our success.  Once you’ve got data quality, you can know that you know that you know, the true numbers.

I’ve introduced this topic under the name of Data Quality, but Data Quality really falls under the broader topic of Data Stewardship or Data Governance.

You Don’t Know What You Don’t Know

The bottom line around data quality is that you don’t know what you don’t know.  If you manage a data warehouse which accepts feeds from dozens of systems, then it’s highly likely that you have a data quality problem and don’t even know it.  It’s a critical aspect of data warehousing.  Operational systems are notorious for bad data.  Last year, I read an excellent, practical guide to data quality called, Data Quality Assessment.  The book itself does not endorse a specific software vendor but all the principles found in the book would apply to any organization looking to improve their corporate data quality.

Downloads – See It Live

If you’d like to see an Xcelsius model of this chart live, I’ve made it available for download.  The source code for the .xlf is also available:
http://trustedbi.com/files/Importance of Data Quality.zip

Truth Is Stranger Than Fiction

Sometimes in life you run across situations that are hard to believe.  Here is an example where truth is stranger than fiction.  When you want to get someone’s attention when it comes to data quality, just tell them this example.  This data quality situation really happened and the results were disastrous.  This video is from Timo Elliott. When you click on it, it will take you to his website:

Data Quality Issues
Timo's Data Quality Presentation (2min)

Do you have any good stories to share?  I’d love to hear them.

«Good BI»

3 replies on “Data Quality: You Don’t Know What You Don’t Know”

  1. The problem comes also from the miss lack design of the CRM.

    CRM shouldn’t have free fields that will allow and cause the user to type
    any value he wished to type.

    Fields like: address, company name should be always standardize and come from embedded drop down lists.

    If such huge amounts of money are invested every year in fixing errors in data coming from Wrong typing, simply standardize your CRM tools or create the appropriate business rules how to create a new customer, how to work with lists and Ctr.

    If the problems are happening in the beginning, where rows are being created for the first time then the money should be spilled there and not after, at least to point of view.

    Thanks

    Yoav

    1. Yoav – Great feedback.

      You are right. I have often found that a lack of standardization among responses leads to this type of problem. Addresses are a unique problem however because unless you do 3rd party validation at the time of data entry, it’s likely that incorrect/incomplete address and contact information will be entered.

      I also found for example at a medical testing facility, that they had not standardized side-effects, so it was difficult to see how often a specific side effect had occurred. You had entries that said: headache, head ache, migraine, cranial pressure, head pain, aching of head, etc. You really need the best of both worlds – a pick list of the most common responses and an “other” category so that items are forced into incorrect categories.

  2. I agree,

    This will remain a problem till the next CRM generation:
    One that could be standardized easily and dynamically and can work with external data baseslive secured feeds and process sequences of data stream in order to ensure data is correct as well as when I purchase in the internet my credit card is being validated in real time.

    Thanks

    Yoav

    Yoav

Comments are closed.