top of page

The Shift to Data-Centric AI: Embracing Better Data over Big Data

It's not about big data, but better data

Artificial Intelligence (AI) has seen a meteoric rise in the last decade, transforming how we live, work, and interact with the world. At the heart of this revolution has been the concept of 'big data.' Characterized by its volume, velocity, variety, and veracity, big data has been the fuel that powers AI's engines, driving insights and enabling predictions that were previously impossible. However, as the field of AI matures, a fundamental shift is taking place in how we approach data.

There was a time when 'big' was synonymous with 'better' in the realm of data. The prevailing notion was that the more data we fed into our AI systems, the better they would perform. This notion drove businesses and researchers alike to collect and analyze vast amounts of data in the hope of uncovering valuable insights. In many cases, this approach yielded remarkable results, but it also brought about a set of unique challenges - from technical issues related to data storage and analysis, to ethical dilemmas surrounding data privacy and security.

In recent years, there has been a growing recognition that more data doesn't necessarily mean more value. The idea of 'better data over big data' has begun to resonate in the industry, marking a shift from a purely quantity-driven approach to one that emphasizes the quality, relevance, and timeliness of data.

This transition underpins the rise of what's now being referred to as 'data-centric AI.' In contrast to traditional model-centric approaches, which focus on refining complex AI models, data-centric AI prioritizes the improvement of the dataset itself. It's a paradigm shift that's changing the way we develop, deploy, and manage AI systems. This article will delve into this fundamental transformation in the AI landscape, exploring the shift from big data to better data and the rise of data-centric AI.

Understanding Big Data and Its Limitations

Big data, as its name implies, refers to massive volumes of data that traditional database systems cannot handle effectively. These datasets are typically characterized by their '4 Vs': volume, velocity, variety, and veracity. 'Volume' refers to the sheer size of these datasets, 'velocity' to the speed at which new data is generated and processed, 'variety' to the range of data types and sources, and 'veracity' to the reliability and accuracy of the data.

The advent of big data has revolutionized numerous sectors, from healthcare and finance to marketing and logistics. One area where it has had a particularly profound impact is artificial intelligence. Here's why: AI systems, particularly those that use machine learning, improve their performance by learning from data. The more data these systems have access to, the better they can learn and adapt. This fundamental principle has made big data central to AI.

In addition to providing fuel for AI systems, big data comes with a number of inherent advantages. It allows organizations to uncover patterns and insights that would remain hidden in smaller datasets, enabling them to make more informed decisions. It also supports the use of predictive analytics, allowing companies to forecast future trends and proactively address potential challenges.

However, as powerful as big data can be, it is not without its limitations. One of the biggest challenges is ensuring the quality of the data. As the volume of data increases, so does the likelihood of inconsistencies, inaccuracies, and irrelevancies creeping into the dataset. 'Garbage in, garbage out' is a common mantra in the field of AI, signifying that AI systems are only as good as the data they are trained on. If the quality of the data is poor, the performance of the AI systems will inevitably suffer.

Privacy concerns also arise with the use of big data. As companies collect more and more data, often from consumers who are not fully aware of what they're consenting to, the risk of data breaches or misuse of data grows. This raises significant ethical and legal issues that must be carefully managed.

Finally, there's the issue of diminishing returns. In the initial stages of an AI project, increasing the amount of data can lead to substantial improvements in the performance of AI models. However, as the dataset grows, the benefit from each additional unit of data tends to decrease. At the same time, the costs and complexities associated with managing and processing the data continue to increase.

Furthermore, the focus on big data often leads to a 'model-centric' approach to AI, where the emphasis is on creating increasingly complex models to extract insights from the data. But this approach also has limitations. Complex models can be resource-intensive, hard to interpret, and may not necessarily perform better than simpler models, especially when the quality of the data is poor.

All these factors have led to a growing recognition of the limitations of the big data approach, and a shift towards a focus on 'better data' over big data.

The Rise of Better Data and Data-Centric AI

As we grapple with the challenges and limitations of big data, a new paradigm is taking shape in the AI landscape: the concept of 'better data' and the rise of 'data-centric AI.'

'Better data' is not just about reducing the volume of data. Instead, it emphasizes enhancing data quality, relevancy, diversity, and timeliness. A dataset with these attributes, even if smaller, can often yield more valuable insights than a large, unrefined dataset. This shift from focusing on data quantity to data quality is fundamental to what is now being referred to as a 'data-centric' approach to AI.

Data-centric AI takes a fundamentally different approach to developing AI systems than traditional model-centric methods. Rather than focusing on building and refining complex models, it prioritizes improving the quality of the data itself. This approach recognizes that AI models, regardless of their sophistication, can only perform as well as the data they're trained on.

A focus on data quality over quantity has several benefits. It can lead to improved performance of AI models because high-quality data helps the model to learn more effectively and make more accurate predictions. It also promotes a more efficient use of resources. Refining data for quality and relevance typically requires less storage and computational power than managing and analyzing vast quantities of raw data.

A data-centric AI approach also aligns well with ethical and responsible data practices. By prioritizing quality over quantity, companies can limit their data collection to what is necessary and relevant, reducing the risk of data breaches and addressing privacy concerns. It allows organizations to be more transparent about the data they collect and how they use it, which can enhance trust with consumers and comply with increasingly strict data privacy regulations.

Another benefit is interpretability. By focusing on data quality and relevant features, the AI models can often remain less complex while achieving comparable, if not better, performance. Simpler models are typically easier to interpret and understand, which is important for transparency and accountability in AI applications.

Moreover, data-centric AI can also lead to more robust AI systems. By ensuring data diversity and representing various real-world scenarios, we can build AI systems that perform well not just on average cases, but across a wide range of conditions.

In conclusion, the shift towards better data and data-centric AI is reshaping the way we develop and deploy AI systems. It's an approach that aligns with responsible data practices, optimizes resources, and can lead to robust AI systems that perform effectively in diverse real-world conditions. It's not just about having less data, it's about having better data.

Implementing a Data-Centric AI Approach

Implementing a data-centric AI approach involves several key steps, each of which contributes to the overall quality and relevance of your data.

  1. Data Cleaning: This step involves identifying and rectifying errors in your data, such as duplicates, inconsistencies, or inaccuracies. For example, a company might use automated scripts or tools to detect anomalies in their data, such as a customer being recorded with multiple different addresses, and then correct these errors to ensure the reliability of their data.

  2. Data Labeling: In supervised learning, AI models learn from labeled examples. Thus, ensuring accurate and consistent labeling is crucial. For instance, a medical imaging company developing AI for diagnosing diseases would need to carefully label images to indicate whether or not a disease is present. Incorrect labels could lead to the AI system making incorrect diagnoses.

  3. Data Augmentation: This technique involves creating new data based on your existing data, which can be particularly useful when you have limited data to work with. In image recognition tasks, for example, you might rotate, flip, or crop your images to create new examples for your AI model to learn from. This can help your model to generalize better and perform well on new, unseen data.

  4. Ensuring Data Diversity: Your data should reflect the diverse range of scenarios your AI system will encounter in the real world. For instance, an autonomous driving system should be trained on data representing different weather conditions, times of day, and types of roads to ensure its performance across various situations.

Take the case of an AI system for recognizing human faces. If the training data mostly consists of faces of people from a certain ethnic group or age bracket, the system may not perform well when presented with faces from different ethnicities or age groups. A data-centric approach would involve collecting and using a diverse set of face images for training, ensuring representation of different genders, ethnicities, ages, lighting conditions, and facial expressions. This would lead to a more robust and fair facial recognition system.

All these steps are critical to improving the quality of your data and thereby the performance of your AI models. While implementing a data-centric AI approach can require a significant investment of time and resources, it can often lead to better results, improved efficiency, and more ethical outcomes than a purely model-centric approach. It's a worthy investment for organizations looking to drive growth, efficiency, and innovation through AI.

Future Implications

The shift towards a data-centric AI approach holds significant implications for the future of AI and its application across different sectors.

In business, a data-centric approach could lead to more effective and efficient AI systems. By prioritizing data quality and relevance, businesses can extract more valuable insights from their data, improve their decision-making processes, and realize greater return on their AI investments. It also aligns with the push towards responsible AI, helping businesses comply with data privacy regulations and maintain consumer trust.

In government, a data-centric AI approach can help deliver more effective public services. For example, by ensuring data diversity and quality, government agencies can develop AI systems that better serve the diverse needs of their populations. However, governments will also need to consider how to regulate and oversee the use of AI, particularly in relation to data collection and use.

In research, a data-centric approach could shift the focus from developing increasingly complex models to improving datasets and evaluation methods. This could lead to more robust and reproducible research findings, addressing a common challenge in AI research.

However, this shift towards data-centric AI is not without its challenges. One significant hurdle is the resources needed for data cleaning, labeling, augmentation, and ensuring diversity. These tasks are often time-consuming and require a level of expertise that may not be available in all organizations.

Another challenge is the lack of standardized tools and best practices for implementing a data-centric approach. While the field is making progress, more work is needed to develop and share effective methods and tools.

Data privacy and security concerns also remain a critical challenge. Organizations must ensure that their data practices respect individual privacy rights and comply with relevant regulations.

Despite these challenges, the potential benefits of a data-centric AI approach are substantial. As the field continues to evolve, we can expect to see more tools, techniques, and frameworks developed to support this approach. By investing in better data practices, organizations can not only improve the performance of their AI systems, but also operate more responsibly and ethically in an increasingly data-driven world.


In the rapidly evolving landscape of AI, the focus is shifting from the pursuit of big data to the quest for better data. As we've explored in this article, this shift towards a data-centric AI approach has significant implications for the future of AI and its application across various sectors.

Big data, with its vast volumes and variety, has long been considered the fuel for AI. While it has revolutionized many sectors, it also comes with its own set of challenges, such as issues with data quality, privacy concerns, and diminishing returns from increasingly complex models.

Conversely, the data-centric AI approach emphasizes the importance of improving the quality, relevance, diversity, and timeliness of data. By investing in these aspects of data, we can enhance the performance of AI models, make more efficient use of resources, and promote more ethical and responsible data practices.

Implementing a data-centric approach requires steps such as data cleaning, data labeling, data augmentation, and ensuring data diversity. While it can be resource-intensive, the resulting improvements in AI performance and efficiency, as well as alignment with ethical AI principles, make it a worthy investment.

Looking to the future, the shift towards data-centric AI holds significant implications for business, government, and research sectors. It also presents new challenges, such as the need for standardized tools and methods for data-centric AI, and continued concerns around data privacy and security.

In conclusion, as the field of AI continues to advance, the mantra of 'better data, not just big data' is likely to become increasingly important. The shift from big data to better data signifies a maturing of the field, reflecting a deeper understanding of what truly drives performance in AI systems. By embracing a data-centric AI approach, we can develop more robust, efficient, and ethical AI systems, unlocking the true potential of AI to benefit our society.


Sumo Analytics is a data science and AI laboratory, specializing in the realm of prediction science. We build and deploy advanced AI systems that elegantly marry human intelligence with the computational power of artificial intelligence, enabling our clients to achieve unparalleled performance.


bottom of page