May 27, 2021 | Data Heroes
Susan Walsh, Founder at The Classification Guru Ltd.
Make sure the data has it’s coat on!
"Make sure the data has it’s coat on!" Susan's idiom is all about data maintenance and highlights the importance of continued data quality management; to make sure it's always consistent, organized, accurate and trustworthy. In this episode of our The Data Heroes Podcast, Susan Walsh (Founder at The Classification Guru Ltd) gives tips about getting on top of your data issues through normalization, deduplication, segmentation and categorization.
"Ultimately, I want them to take back responsibility and ownership of their data and manage it." Susan talks about client challenges faced and how she trains teams to develop database health strategies to ensure data is always as actionable as possible. Susan discusses supplier normalisation, taxonomy development, and her upcoming book 'Between the Spreadsheets: Classifying and Fixing Dirty Data'.
"If you've never classified data in your life, you could pick it up, read it and follow the instructions in it and classify data!"
The Value of Data Normalization
Dirty Data Challenges That Susan Solves.
"Well, the first thing is normalization of suppliers. I find that there are normally multiple versions of the same suppliers within a data set. So you've got a IBM, I.B.M, IBM Inc, IBM Limited, etc. There's just so many different versions. Because of this they don't even know how much they're spending with one supplier. You wouldn't know how much you were buying with one supplier. So how can you negotiate better rates and things like that? And secondly, visibility on spend. So what are you spending your money on? How much in IT? How much in HR? How much in professional services and facilities? They don't know. And so, within a matter of weeks I can give them that visibility. Which ultimately brings down cost savings and spots any fraudulent activity, any rogue spending, any spending off contract. Normally there's preferred suppliers that you have to buy certain items through and people just do the wrong thing. But if you don't have the information and the data, then it's very hard to control."
The Data Isn’t Right.
"My clients approach me for help because they're already in trouble. Maybe there's some project that has been delayed or can't happen because the data's not right. They're in a position where they're forced to have to pay for the service. Your team can absolutely do it, but they have a day job. So it could take you a year. I could do it in a few weeks. I'll do this day in, day out. So the accuracy and the consistency will be there. I say to my clients, ‘the cost of my fees, even if it's saved you 1% of your gross turnover, you would still more than make your money back’. You have to talk to them in a way that they can relate to. Talking to a senior decision maker about how they're spending hours fixing their spreadsheets isn’t going to relate, because they've never had to do that. If I say to them, ‘the analytics that you've been receiving every month with your monthly reporting have been wrong for the past 12 months, because your data is not right. And you've been making decisions on that.’ then they're probably going to listen a bit more."
Normalization for Data Maintenance.
"Make sure that the data has it's coat on. Like a jacket. Make sure it's consistent, make sure it's organized, make sure it's accurate and make sure it's trustworthy. And once that coat is on, keep it on by maintaining your data. It's not just going to stay that way forever. You have to look after it. Take back responsibility and ownership data and manage it!"
Customized Taxonomies for Precise Reporting.
"I've been finding recently that I've been building customized taxonomies for my clients so that they can report exactly the way they want to. And in the detail that they want to. I've done a lot of work with UNSPSC and that goes into a lot of detail, but quite often it's too detailed for the type of data that I'm working with. So it can be a challenge to make it fit sometimes. And the UNSPSC is great in areas of things like biology, laboratory, hardware, but marketing is not so good."
Check Your Data For Changing Job Titles.
"I think the problem is that job titles are changing so fast as well, like really quickly! And then of course you get these people who call themselves gurus, you know, what kind of job title is that? So there are a whole lot of people out there that have given themselves a job title that doesn't mean anything to anyone. So they're being excluded. And the other thing is that people are changing their jobs far more frequently than they ever used to. So you should be checking and updating your data every probably six months. Check your data, people!"
- Segmenting Customers for Digital Experiences. For companies wanting to expand their repertoire and generate better digital experiences for customers, they need to become more effective categorizers and segmenters. According to Susan, companies selling products online should have strict rules for categorizing wares to make them easily findable, and she argues it's the same case for marketing. With the increase of digital marketing material, emphasis should be placed on segmenting customers effectively to provide flawless targeting and messaging. "We're moving from the old postal ways to online and you need to segment your customers. So categorizing them in the right areas. You don't want to be sending an old age pensioner a school leavers information leaflet, or email or something like that. You need to get your categorizations right, it's the same for planning."
- Duplicate Data Disaster. Susan recalls her most recent data disaster which was 100 times more complex than she had expected. She grossly underestimated the number of rows of duplicates in the scores of “millions” of inconsistent data points coming from a variety of sources. "There were about 2.3 million rows of customer information, but I managed to deduplicate that down to about 1.3 million rows. And so it was a real challenge to get the numbers, to add up with currency conversions and get all that data classified in time and make sure it was right."
- Consolidation Of Merging Values. Being extremely thorough when checking data to categorize for Normalization helps when it’s time to dedupe. Susan focuses on specific aspects to differentiate companies, such as company descriptions to help determine whether the supplier is the same or not. While undertaking a normalization task she begins with suffixes, and uses descriptions and values to help form a decision. "You can have ABC cars and ABC cars, but one's in the UK and is a taxi company and one's in the US selling vehicles. So using descriptions and values can help get you to that decision."
- Dirty Data Challenges: Classifying The Information. Susan describes common customer data challenges she encounters at The Classification Guru. These mainly pertain to differing column headers of various file sources. This creates difficulty when discerning columns that are the same, however are located in different files. Another data issue she deals with is clients having poor invoice descriptions, making the data very difficult to classify. "And a lot of detail or the descriptions could be misleading, things like services or hardware, you know, is it cleaning services? Is it IT services or is it tool hardware? Or is it IT hardware? You know, they're not particularly useful. So they're the things that make it the hardest. And then foreign language countries are always a bit of a challenge as well."