Home Communication I have a lot of data. And now?
I have a lot of data. And now?
5 June, 2024

A critical analysis of companies’ (im)preparation for the future.

For the more impatient readers, I would like to start by expressing a truth that I frequently observe in my field of work: the vast majority of companies are not effectively prepared to adopt intelligent systems, such as Artificial Intelligence (AI). This conclusion is not the result of unfounded pessimism but of a critical analysis of how information technologies are managed in companies.

Often, under the influence of the media, competitive pressure, or even government directives to encourage innovation, some organizations decide to “embark” on adopting new technologies hastily. This decision, often taken unstructured, ignores crucial steps, such as the development of adequate strategic planning, in the medium and long term. Unfortunately, the data context follows this same trend of lack of preparation.

 

Data management: quantity versus quality

With interest and a certain amount of concern, I have watched the refreshing determination of some business actors to start collecting data on everything and anything. At first glance, it seems like a good idea, but only in theory. Reality shows that data collection, to be truly effective and transformative, depends on how it is conducted and by who it is managed. It's not just a question of the amount of data collected; data quality, relevance and processing are equally important.

Effective data use is a fundamental step towards modernizing any organization. When used well, data can become a powerful competitive differentiator, helping to uncover internal inefficiencies, identify new market opportunities, and even offer a significant competitive advantage. However, although some organizations are successfully implementing new data collection systems, a recurring question remains: "We have a lot of data, what now?" This question reveals a lack of preparation for the future digital era.

Explaining this situation has become a regular part of my work. Massive data collection does not mean an organization is automatically ready for a digitalized future. Preparing an organization for this technological leap involves carefully considering the domain's data model, the processes for updating and integrating new domains, the types of storage to be used, and the data formats and their normalization. These are just some of the points often overlooked in less technical discussions but are crucial to any data strategy's long-term success.

When discussing "few but good," I refer to the data's intrinsic quality and relevance. To my knowledge, a device or tool that can absolutely measure the quality of "good data" has not yet been invented. However, several frameworks and reference models can and should be used to improve and validate the quality of the collected data. Improving data quality is iterative, takes time, and may require significant adjustments to existing implementations.

Data quality is fundamental to developing other systems, applications, AI models and analytical reports. This is where the GIGO (Garbage-In, Garbage-Out) principle comes into play. If the initial data is poor quality, the insights derived from that data will also be poor. I am convinced that many organizations believe they are working with high-quality data, but this is often not the reality.

 

How to process a large volume of data?

Before we take pride in the “immense data” we possess, it may be wise to take a step back to ensure that data collection is complete and storage is scalable and organized. Only then can we take two steps forward in adopting new intelligent systems based on this data. This process requires profoundly reconsidering how data is viewed and managed within organizations.

When it comes to the data amount (volume), the question arises: “What really is a large volume of data?” The answer, as in many aspects of information technology, is: “It depends”. I generally adopt the definition that: a volume of data is considered large when traditional systems can no longer offer an adequate response to its processing. This perception can vary greatly, depending on the reality of each organization; however, reaching this threshold is not necessarily a sign of discarding everything that exists and radically changing the paradigm.

This is one of the biggest challenges faced by organizations. Few are those who have specialists in the areas of data sciences and information systems on their staff, capable of having a holistic view of existing systems and of designing and implementing a structured plan for the adoption of new technologies. Sometimes, the solution adopted involves improvising, training someone in a hurry, or copying a solution that worked in another reality and implementing it quickly. While this approach may sometimes work, I advocate for a more reasoned and deliberate solution.

 

“Training is an important, necessary and, although slow, extremely rewarding process.”

It is essential to recognize that not all organizations have the necessary resources to maintain a team of experts in this field. However, it is possible to take conscious steps in this direction. It is crucial to establish a solid foundation before moving towards a more or less radical technological change. Training is an important, necessary and, although slow, extremely rewarding process. What's the point of owning a Ferrari if I don't have a driving license or if the roads I want to drive on are not properly prepared?

This article is not intended to be a manifesto against technological innovation in organizations but rather a reflection on how we prepare our companies for the future. Data engineering is not just a matter of collecting and storing immense amounts of information but rather of doing so in an intelligent, structured and sustainable way. After all, in the race for technological innovation, it's not enough to just run... you need to know where to place each step.

 

Author: Pedro Guimarães
Technical Responsible for the Data Engineering Area (EPMQ) at CCG/ZGDV Institute

CATEGORIAS

TI