This week we're talking about the Five Cs of Data.
What does the word "data" mean? There are a lot of descriptions and definitions associated with the word. One definition of data is information that has not been processed or structured and that may be used as a basis for reasoning, discussion, calculations or analysis. One may argue whether data has to be factual, but the processing of data is what often determines what is factual.
Data for business can come from many sources and be stored in a variety of ways. However, there are five characteristics of data that will apply across all of your data: clean, consistent, conformed, current, and comprehensive.
The Five Cs of DataThe five Cs of data apply to all forms of data, big or small. Your data processing should include checks for all of the following attributes.
Clean data means data that has no missing values, no inaccurate data, no out of range data, no typos, etc. Having 100% clean data can be difficult to achieve. There are also times where things like missing values are accounted for in your analysis process. Cleaning data also is subject to the law of diminishing returns. Every hour you spend cleaning your data may not be as effective as a previous hour. At some point you may have to proceed with data that is "clean enough" for your purposes. An emphasis on acquiring clean data can eliminate a lot of the cleaning and post-processing that can bog down your data analysis processes.
Consistent data means that the data is the same no matter where it appears in your organization and the definition(s) associated with your data are consistent across your organization. For example, what does "recent sales" mean? It should mean the same thing for data ingestion, data analysis, data reporting, etc. In addition, consistent data supports the idea of "one (or single) version of the truth". Context is crucial for determining what is consistent. For example, when talking about credit scores, are you referring to a FICO score? A score from Equifax? Experian? TransUnion? An internal score? Finally, consistent data refers to how data is represented. Are names a combination of first, last and middle? Do you use a middle initial? Is the middle name or initial optional? Consistency is strongly coupled with your data management and data governance policies.
Conformed data is data that fits within established boundaries. For example, if you have data on students, does it apply to all students, full-time equivalents, daily attendance, or something else? If you have a measurement for ounces, does it mean liquid, solid, or Avoirdupois? Even something that appears to be straightforward and universal, like days in a year, will be different when measured on Earth versus Mars. Some other examples that have caused major problems in business and science are imperial versus metric measurements and Fahrenheit versus Celsius temperate measurements.
When it comes to data, current can be relative. Stock market data, for example, needs to be real-time if you are using it to trade stocks. On the other hand, if you are doing historical price analysis, then you want data that goes back for months or years. In addition, the closer to real-time you want your data, the larger the volume of data will be. Going back to the stock market example, there are thousands and perhaps millions of trades for stocks over the course of a trading session. But there is only one final closing price.
Comprehensive data is data that spans all the required dimensions of your business case(s); the breadth of your data. If you need customer names, then first names and last names are a minimum and middle initials/names are usually the standard for names. Then there are the associated values: Jr., Sr., III, etc. Honorifics. Do you need to account for two middle names? The data also has to be of sufficient depth. If you want to analyze repeat customers, then you need to go back at least a year to find patterns in purchasing. In addition, once your data has been collected, you should review it to make sure you are not missing categories of data: columns, years, regions, etc.
Data should be well-managed. A data governance team should be coupled with a data management team. The two are related, but not the same. Data management is the day-to-day data operations of your business. Data governance is the standards, policies and procedures that your business teams should follow when acquiring, storing, using, and transferring data.
You may also be interested in The Five Vs of Big Data
Until next time, thanks for Talking Technology with me!
Copyright ©