Reduce, reuse, recycle – Principles for Data Governance and Data Cataloguing in the age of Data Mesh and AI
Introduction
Since the publication of ChatGPT and other AI-tools developments around AI have companies reaching out to consultants and experts on the topic to help them implement and train AI models on their data. Everyone wants to enhance their business by incorporating AI into their daily workflows. Since it has been shown that AI has the potential to accelerate and automate many processes and tasks that currently take workers hours to complete or even enable tasks that were previously not possible.
For an AI model to be trainable on your business data and give accurate answers and recommendations regarding your organization it is crucial to have a firm grip on your assets within your data estate first. But why is this so necessary?
Swimming in muddy water
When it comes down to it, having a firm grip on your data estate is so important in order to have a single source of truth. Think of implementing AI without a robust data governance framework like trying to swim inside a pool filled only to half and with muddy water. It is no wonder that its answers will make little sense and it might even hallucinate things because it has trouble seeing where it swims and often scrapes across the bottom of the pool.
Having a data governance framework, especially for data mesh, gives you not only the resources to fill the pool completely, but also pumps and filters to clear out the mud. It makes gaps in your data estate visible and removes roadblocks for ideas and projects that so far left a lot of business value unrealized. Implemented fully, it helps you to create a clear vision of all your data within your business. AI will be dependent on this very view of your data estate to be able to connect the dots, be trained on proper training data, and function correctly to deliver real value to your organization.
This is where the interdependent principles of reduce, reuse, and recycle come into play:
1. Reduce
Reducing data involves focusing on data-storage-waste-minimization-strategies. This includes techniques such as data deduplication to eliminate redundant data across data assets, proper data lifecycle management to ensure data is retained only as long as necessary. Using archiving strategies to move infrequently accessed data to less expensive storage solutions as well as regular backups to prevent data corruption by data decay through bit-rotting. Using storage rules and policies, these practices help in maintaining a lean and efficient data storage.
2. Reuse
Making data easily accessible and reusable across different business domains is a crucial strategy for reducing data. Standardizing data formats, schemas, and interfaces can enhance data reusability, reducing the need and costs for redundant data creation. Implementing a data product catalog can help teams search for, discover, and reuse existing data, fostering collaboration, and reducing duplication of effort. In addition, this increases data consistency, as modifications need not be carried out on multiple versions of the same data set.
3. Recycle
Recycling data involves repurposing data for different uses beyond its initial intent. Organizations can leverage advanced analytics and machine learning to extract new insights and value from existing data sets. Effective metadata management plays a vital role in understanding the potential of your data for repurposing, enabling more strategic use of data assets.
These ecological principles are nothing new, and we already apply them daily in areas like forest management, food production, waste management, etc. What we aim to do is to apply those principles to the realm of data. Many users already live by them unknowingly, but in the context of implementing a data governance framework for a data mesh, they are applied in a structured and accountable way.
The toolbox is already open
Various already available data governance and cataloguing tools (like Databricks Unity Catalogue or Microsoft Purview) and platforms facilitate the reduction, reusage, and recycling of data. When it comes to enhancing data quality using AI, already existing tools are currently being further developed and will soon be able to help with reliably establishing quality rules, filters, detecting sensitive information, etc. Data virtualization and integration tools help managing data efficiently across different domains. Organizational change, e.g., fostering a culture of maintaining data quality, sharing and collaboration, is essential to support these practices. Encouraging cross-domain communication and establishing clear data governance policies for everyone to follow can further enhance and often lead to automation of these data management efforts further down the road. This entails further improvements in efficiency and cost savings.
The three principles in action
Imagine an organization within the healthcare sector adopting these principles for the implementation of data governance in their data mesh. They begin by reducing data redundancy and storage costs through strict data deduplication and efficient data lifecycle management across their different departments, ensuring only necessary data is retained. To enhance data reusability, they standardize data formats and create a comprehensive data product catalog, enabling medical research teams and patient care units to access and utilize the same patient data seamlessly. By recycling data, they repurpose clinical trial data to develop new treatment protocols and improve patient outcomes using advanced analytics and machine learning. With all three principles coordinated, the increase in efficiency compounds and a massive increase in productivity emerges.
Conclusion
Applying the principles of reducing, reusing, and recycling data for governing a data mesh architecture offers significant benefits:
A data governance framework acts as a great enabler, while maintaining these three principles help you to prevent your data mesh from turning into a data swamp with redundant data silos all over the place. These practices not only optimize data utility and cost-efficiency but also contribute to a more sustainable approach to data management. Organizations can achieve better scalability, agility, and environmental sustainability by embracing these data management strategies. And finally, they are the necessary predecessors for realizing the maximum value of AI models for your business.