Data Quality Paradox
It is perhaps the greatest paradox of data quality that the primary cause of duplication of entries in any database is the mistaken use of unique keys as the unique identifiers of records.
The problem starts in the data modelling stage of analysis when data analysts use codes as the unique identifiers (UIDs) of data entities.
Codes Are Not Identifiers
Data Modeling 101: The unique identifier of a data entity is NEVER A CODE!!
This may cause many data analysts to have apoplexy, as it is a practice that they have followed most of their lives. But simple logic can show it to be incorrect.
For something to be a unique identifier it must first of all qualify as an identifier. Codes do not identify! Don’t believe me?
A Simple Test
If I take you to the middle of a warehouse with 500 unlabelled parts laid out on the floor and tell you to go and pick up part 12562 you would be unable to proceed. Why? Because 12562 does not in any way identify any of the parts. If I say “pick up a motor” then you will be able to make a start. You might need to ask “an electric motor or a hydraulic motor?” but you will be able to identify a motor among the parts.
We see from this that codes are not identifiers. If they are not identifiers then they cannot be unique identifiers.
What is a UID?
Unique identifiers enable us to answer the question; “What is it, with respect to this business, that makes one occurrence of a data entity uniquely different from every other occurrence of that data entity?” If we know this then we will have the unique identifier for that data entity.
So lets ask the question with regard to a finance company that loans money for the purchase of motor vehicles. ”With regard to this business, what is it that makes one customer uniquely different from another customer?” For the purposes of this example we will stick to customers who are individuals, as opposed to corporate customers.
Straight away, some data analysts will shout out “it’s the customer number!” Wrong!
Example UID Problem
To demonstrate this I could take you to many finance companies and show you the customer data files with records that would look very similar to the following.
|Cust No||First Name||Surname||Street||Town||D.O.B|
|1001||John||Smith||123 High St||Bristol||22 Dec 1950|
|1002||John||Smith||123 High St||Bristol||22 Dec 1950|
|1003||John||Smith||123 High St||Bristol||20 Aug 1771|
|1004||Mary||Jones||27 West St||Walsall||12 May 1975|
|1005||Mary||Green||27 West St||Walsall||12 May 1975|
|1006||Mary||Jones||22 Grove St||London||12 May 1975|
Every customer has a different Customer Number, but do these six records represent six distinct customers?
It is highly likely that customer No 1001 and 1002 are the same person. But what about 1003? The same person again with an error in the date of birth or a younger person of the same name at that address; the son of the older John Smith perhaps?
Looking at Customer Numbers 1004 and 1005, we ask is Mary Green the same person as Mary Jones – now married to a Mr Green perhaps? Or is she an entirely different person who just happens to have the same date of birth?
What about Customer 1006? Is this the same Mary Jones who has moved house? If she had a bad debt record at 27 West St would you want to do business with her at 22 Grove St?
Data Uniqueness is a Business Decision
This set of records forces us to repeat the question; “What is it, in the context of this business, that makes one customer uniquely different from every other customer?”
One thing that is for certain is that this will NOT be Customer Number!
Uniqueness must be defined by the business. It is the job of data analysts to help the business arrive at this definition.
It is the job of database designers to implement this uniqueness in a manner that prevents duplicate records being created in the database.
If the business has not defined uniqueness from a business point of view and the data analysts have not modelled this, then it is impossible for database designers to implement uniqueness.
Unique Keys are Not UIDs
The use of the word “unique” in the term “unique key” and in “unique identifier” at some point in the past got completely misinterpreted by data analysts and database designers, in that they thought that unique keys were a mechanism for implementing unique identity in a database.
Because of this, what they did was to introduce to every database a mechanism that would, above all else, allow duplication to occur.
Amazingly, this error is still practiced worldwide on a daily basis, and not just by those analysts and designers who made the mistake many years ago but also, sadly, by those coming new to the data analysis and database design.
Stop it Now!
Removing Duplicate Records
How do you remove all of these duplicate records from your database? Buy a clever piece of software that will do it for you? You could, but it would simply be changing the problem not solving it. You would either end up with a database with no duplicates removed or with a whole lot of records merged that should never have been merged. Try to sort that one out!
The fact is that, if the business has not defined what it sees as the elements that make a data entity unique, then no piece of software can.
A full description of how to model UIDs for data entities is contained in the eBook:
available from the Online Store.
The data shown is only a short, simplified and denormalised example, but is typical of data held by various finance companies for whom I have done consultancy.
I am not making an argument against unique primary keys in tables, as these are essential, but against confusing unique keys with unique identifiers.