Data Quality Paradox
It is perhaps the greatest paradox of data quality that the primary cause of duplication of entries in any database is the mistaken use of unique keys as the unique identifiers of records.
The problem starts in the data modelling stage of analysis when data analysts use codes as the unique identifiers (UIDs) of data entities.
Codes Are Not Identifiers
Data Modeling 101: The unique identifier of a data entity is NEVER A CODE!!
This may cause many data analysts to have apoplexy, as it is a practice that they have followed most of their lives. But simple logic can show it to be incorrect.
For something to be a unique identifier it must first of all qualify as an identifier. Codes do not identify! Don’t believe me?
A Simple Test
If I take you to the middle of a warehouse with 500 unlabelled parts laid out on the floor and tell you to go and pick up part 12562 you would be unable to proceed. Why? Because 12562 does not in any way identify any of the parts. If I say “pick up a motor” then you will be able to make a start. You might need to ask “an electric motor or a hydraulic motor?” but you will be able to identify a motor among the parts.
We see from this that codes are not identifiers. If they are not identifiers then they cannot be unique identifiers.
What is a UID?
Unique identifiers enable us to answer the question; “What is it, with respect to this business, that makes one occurrence of a data entity uniquely different from every other occurrence of that data entity?” If we know this then we will have the unique identifier for that data entity.
So lets ask the question with regard to a finance company that loans money for the purchase of motor vehicles. ”With regard to this business, what is it that makes one customer uniquely different from another customer?” For the purposes of this example we will stick to customers who are individuals, as opposed to corporate customers.
Straight away, some data analysts will shout out “it’s the customer number!” Wrong!
Example UID Problem
To demonstrate this I could take you to many finance companies and show you the customer data files with records that would look very similar to the following.
| Cust No | First Name | Surname | Street | Town | D.O.B |
| 1001 | John | Smith | 123 High St | Bristol | 22 Dec 1950 |
| 1002 | John | Smith | 123 High St | Bristol | 22 Dec 1950 |
| 1003 | John | Smith | 123 High St | Bristol | 20 Aug 1771 |
| 1004 | Mary | Jones | 27 West St | Walsall | 12 May 1975 |
| 1005 | Mary | Green | 27 West St | Walsall | 12 May 1975 |
| 1006 | Mary | Jones | 22 Grove St | London | 12 May 1975 |
Every customer has a different Customer Number, but do these six records represent six distinct customers?
It is highly likely that customer No 1001 and 1002 are the same person. But what about 1003? The same person again with an error in the date of birth or a younger person of the same name at that address; the son of the older John Smith perhaps?
Looking at Customer Numbers 1004 and 1005, we ask is Mary Green the same person as Mary Jones – now married to a Mr Green perhaps? Or is she an entirely different person who just happens to have the same date of birth?
What about Customer 1006? Is this the same Mary Jones who has moved house? If she had a bad debt record at 27 West St would you want to do business with her at 22 Grove St?
Data Uniqueness is a Business Decision
This set of records forces us to repeat the question; “What is it, in the context of this business, that makes one customer uniquely different from every other customer?”
One thing that is for certain is that this will NOT be Customer Number!
Uniqueness must be defined by the business. It is the job of data analysts to help the business arrive at this definition.
It is the job of database designers to implement this uniqueness in a manner that prevents duplicate records being created in the database.
If the business has not defined uniqueness from a business point of view and the data analysts have not modelled this, then it is impossible for database designers to implement uniqueness.
Unique Keys are Not UIDs
The use of the word “unique” in the term “unique key” and in “unique identifier” at some point in the past got completely misinterpreted by data analysts and database designers, in that they thought that unique keys were a mechanism for implementing unique identity in a database.
Because of this, what they did was to introduce to every database a mechanism that would, above all else, allow duplication to occur.
Amazingly, this error is still practiced worldwide on a daily basis, and not just by those analysts and designers who made the mistake many years ago but also, sadly, by those coming new to the data analysis and database design.
Stop it Now!
Removing Duplicate Records
How do you remove all of these duplicate records from your database? Buy a clever piece of software that will do it for you? You could, but it would simply be changing the problem not solving it. You would either end up with a database with no duplicates removed or with a whole lot of records merged that should never have been merged. Try to sort that one out!
The fact is that, if the business has not defined what it sees as the elements that make a data entity unique, then no piece of software can.
A full description of how to model UIDs for data entities is contained in the eBook:
available from the Online Store.
For a quick introduction to the fundamentals pick up the Rapid Guide to Data Structure Modelling also available from the Online Store
Notes
The data shown is only a short, simplified and denormalised example, but is typical of data held by various finance companies for whom I have done consultancy.
I am not making an argument against unique primary keys in tables, as these are essential, but against confusing unique keys with unique identifiers.
Multi-Dimensional MEM (MDM) Online Course
Online Mentoring
Information Flow Modelling eBook
Business Process Modelling eBook
Data Structure Modelling eBook

It seems that we are talking about two different things. The UID in the example could be used inside of a database; however, to refer to a customer by this UID outside of this DB context would not be helpful.
Hi Dave
Thanks for your comment.
The main points of the article are to show:
o That the Unique Identifier (UID) of a Party will NEVER be a code!
o That it is the use of unique keys in a database that enable duplication.
o That the Unique Identifier of a Party (whether they are playing the role of Customer or Supplier) must be known and defined at the BUSINESS level – not at the database level.
It is at the database level that the business defined UIDs must be implemented in a manner that ensures that no duplication can occur and this will never be achieved by using a code!
Regards
John
Fantastic article! So true!
Hi Sanja,
Thank you for your kind remarks.
John
Good article, John
I would take this one step further by stating that using nonsensical numbers as identifiers is a usability issue. That is, if the actual users of such identifiers were consulted in the database design process, the users could indicate the best identifiers for ease-of-use. The best identifiers would also likely be “quality” identifiers in that their ease-of-understanding would be in line with actual users of the data.
Just brainstorming.
Thanks, Scott,
The identifiers you refer to are actually what I call QUACKs (Quick Unique Alternative Code or Keys) in the Integrated Modelling Method.
These can be a very useful business mechanism for easily referring to data and business entities but are NOT unique identifiers (UIDs) and are only useful when implemented in parallel with UIDs.
John,
For large scale data warehouses we have worked around this problem by maintaining a ‘checksum key’ In our case we have a standard implementation of a combination of MD5 and MIME64 (Link above). This allows us to generate an anonymous code that is unique but still relates to the data. It also significantly speeds up loading and validation on load routines
rgds
davidw
Thanks David
It would definitely make sure that the data in each customer record was unique but not that each related to a unique customer, unless the business had previously enforced this by defining (and implementing) the unique identifier for customer.
John