master data :15 most common data quality problems with solutions

1. Incomplete data.

This problem is the most common. When it occurs, key columns are missing information or contain incorrect ETL tasks or data that negatively impacts further analytics.

Solution. Implement a framework control for data reconciliation. It checks the number of records coming into different levels of analytics and sends an alert if there are fewer records on any level.

read also : Are optical drives dead? New and not so technologies for archival data storage

2. Default values

Have you ever found a transaction date of 01/01/1891 while analyzing data? This is just a case of using default values – unless you have a client base of people over a hundred and thirty. The problem becomes especially acute if you don’t have enough documentation.

Solution. Perform data profiling and figure out why default values are being used. Engineers usually resort to this alternative if the actual data is unavailable.

Inconsistent data formats.

This problem mainly affects string-type columns because they can store data in different formats. For example, the client’s name and last name in different cases, or an e-mail address with incorrect formatting. This happens when storing information in different systems, and the data formatting is inconsistent.

Solution. Make it homogeneous, that is, standardize the data in the source system. Or at least in the pipeline at the stage when data are transferred to the lake or storage.

read also : A boost to information integration with IBM Infosphere Information Server

4. Repetitive data.

Not hard to identify, but hard to eliminate. If dirty duplicate data is brought into the system along with a critical attribute, it will break all further processing and can lead to other quality problems.

Solution. Implement tools to manage master data, at least at the uniqueness check level. With such a solution, accurate duplicate records can be found and removed. It can also send a notification to a data engineer or administrator about a duplicate record so they can figure out what the cause is.

5. Inconsistency at the systems level

Often seen in large organizations that have gone through mergers and acquisitions. Different legacy systems have slightly different views of the world. Customer name, address or DOB may contain inconsistent or incorrect information.

Solution. As with the previous problem – implement master data management so that all the different information corresponds to a single record. It is not necessary to achieve an absolute result; it is sufficient to specify the percentage of fuzzy matching.

6. Cardinal data changes

Imagine that you receive a data file every day – customer addresses. Usually it contains 5,000 records, but today it contains only 30. The file meets other requirements for uniqueness, acceptability, and accuracy, and yet the data is missing.

Solution. As I suggested in point 1, a reconciliation framework can help solve this problem at the analytics level. If this data comes from another source, you need to implement a control file that confirms that the records were sent. Then, by automated means, reconcile this file with the records received after the data transfer.

7. Lost data

This is a particular case of inconsistency problem, when the data in one system exists and not in the other. A customer is in table A, but his account is not in table B. This is an “orphaned customer”. On the other hand, if an account exists in table B, but no customer is associated with it, it is an “orphaned account”.

Solution. A data quality rule that checks for consistency whenever data comes in from tables A and B can help identify this problem. To fix the situation, the cause of the inconsistency must be identified in the source system.

8. Violations of chronology of stored data

Correct chronology is critical to any data warehouse implementation. Data arrives in chronological order, and history is stored through SCD, type 2. But if the wrong lines are opened and closed, it distorts the representation of the last valid record. And this, in turn, disrupts the chronology of the data and further processes.

Solution. Make sure that the column with the correct date is used to determine the chronology.

9. Irrelevant data.

Nothing is more frustrating than collecting ALL available information – it’s expensive and not environmentally friendly.

Solution. Agree on the principles of data collection: each attribute must have an end goal, otherwise that attribute doesn’t need to be captured.

10. Unclear data definitions.

Talk to Sam in finance and Jess in customer service. They interpret the same data point differently – sound familiar? Clarity is a quality parameter that doesn’t get talked about much because in today’s data stack it refers more to business or data catalog terminology. In essence, it’s a quality problem.

Solution. Match data definitions whenever a new metric or data point is created.

11. redundant data.

Different teams in a company collect the same data multiple times. If a company has an active presence both online and “on the street,” because of the repeated collection of the same information, the data will end up in different systems. All of this leads to redundancy, which affects not only the profitability of the company, but also the quality of customer service.

Solution. Use a single master system where all company representatives get their data. Again, master data management will help as well.

12. Old and outdated data

Storing data after a certain period of time does not add value to the stack, but is more expensive, confusing for the data engineer and undermines analytics capabilities, and makes the data irrelevant (see point 9).

Solution. Apply the principle articulated in the General Rules of Data Protection: “Store exactly as much as necessary.”

13. Inconsistent keys

Imagine that you are creating a new data warehouse with primary and surrogate keys for the Core Data model. The repository is growing, with new data coming in every day, including seasonal peaks. And at some point, you realize that the natural keys are not unique. This discovery undermines the model design, compromising referential integrity.

Solution. Perform comprehensive data profiling, including seasonal data, to ensure that the key on which the surrogate key depends is always unique.

14. Lack of access to the data

People who make decisions must have access to the data they need to do so. There is no benefit to data locked in a repository without integration and access by data analysts, administrators and data scientists.

Solution. Implement an operational model, including data access permissions for certain teams.

15. Late data.

Data needs to come in when it’s needed to make critical decisions. If marketing campaigns run every week, you need data by a certain day of the week to launch each campaign. If they arrive late, the marketing campaign is likely to produce weak results.

Solution. Agree on a timeline with the engineering team. It also makes sense to go back to the basics and make sure the original systems can meet those SLA levels.

Leave a Reply

Your email address will not be published.