Data Scientists are failing due to poor Data Management #19

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook
Share on reddit
Share on email
Share on pinterest

You can’t tell a great story without defining your characters. So why do so many organisations think they can derive value from their data science initiatives without investing in data management?


The report by Anaconda on – The State of Data Science 2020 – confirms the widely known fact, that Data Scientists spend almost half of their time not solving business problems but by cleansing and loading data.


This surely begs the question: why not invest at least half of what you invest in data science, on data management?

In this week’s blog, I’ll share three key data management investments which’ll make your data science initiatives successful:

#1 Improving data quality


#2 Defining a master data strategy


#3 Implementing data governance


Let’s dive into each of these:

#1 Improving data quality (the basics of data management)

Ideally, I’d quote a statistic which’d tell you just how many billions of dollars are being lost due to poor quality data.


But I don’t think that’s necessary, as the importance of having good data is well understood across the data community.


The report above mentions how a significant amount of time is lost by preparing, cleansing and organising the data.


But what if this was in the fabric of your big data platform? The ongoing maintenance and monitoring of data quality would hugely improve the initiatives when using the data downstream.


So, what can you do about it?

  • Set up a simple data reconciliation framework. The amount of times I’ve heard clients say, “the data has somehow gone missing through the loading cycles”, is hardly believable. A reconciliation framework will ensure the records you inputted at the start are exactly the records you received at the end. This is key before you start transforming the data.


  • Completeness and accuracy checks. This isn’t rocket science. A simple completeness check can instantaneously tell you where data is missing and how to quickly fix it. An accuracy check can tell you where data is available but seems somewhat inaccurate.


  • Uniqueness checks. Again simply, check for duplicates, understand where they are coming from and fix it at source or in your platform. Avoid it filtering through your systems to impact the work on the other end.

Governance is often confused with added bureaucracy and red tape. ....💭

#2 Defining a master data strategy

So, each of the data science initiatives are organising, loading, and using their own data in total silos.


To avoid this horror, define a master data strategy.


In simple terms, depending on the business outcome, your scientists should be able to choose from the relevant mastered entities. Such as customer-related models, should utilise customer master entity.


The basics of your strategy should be able to answer the following questions:


  • What are the key domains you need to master?


  • Which key systems hold this data and how to match and merge them?


  • Which data will be enriched and/or standardised?


  • How the data will be audited and versioned?


  • How the data is being governed?

#3 Implementing data governance

Your master data strategy will not be successful without governing the data. This is, of course not the “maintenance/back up” of data. That is for IT teams to deal with.


Governance is often confused with added bureaucracy and red tape. However, additional scrutiny is required to ensure the usage of the data is in line with regulations and corporate ethics.


This also helps the scientists rely on the right people who know and can interpret the data to aid the business outcome.


We are, of course talking about setting up policies, procedures and a framework that define the following:


  • Accountable & Responsible individuals for specific data domains – Your data owners & stewards


  • Their key responsibilities of approving data usage, defining common business terms or capturing data lineage


  • Defining different data assets (grouping of data) and how they are classified into sensitive and non-sensitive data


  • Controls on data against regulatory requirements such as the GDPR


Improving the quality of the data, governing it, and mastering it; are the three basic areas where you can invest and find a high return of investment (ROI) for your data science initiatives.


The goal is to ensure the data scientists are spending more time productionising models with actionable business outcome and less time doing data management.


Do you agree with what I’ve said above, what are your thoughts? Feel free to reach out to me via my email [email protected], if you have some feedback or if you just want to say hello!


If you’re still reading this, I hope you’ve found some value in this blog post.


If you’d like to be kept informed of more content like this, subscribe to my newsletter.


Also, check out my other blog on: Why is everyone obsessed with data science?

Share this post

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook
Share on reddit
Share on email
Share on pinterest
Hanzala Qureshi

Hanzala Qureshi

I’m a digital consultant at a leading consultancy firm. I mostly spend my life working on complex data projects. On this website I document my journey in consulting and thoughts on data & emerging technologies.

keep reading

Leave a Comment

Your email address will not be published. Required fields are marked *

This website uses cookies to ensure you get the best experience.