Beyond data lakes: exploring the vastness of data oceans for true equity
  • 26 Mar 2024
  • 15 Minutes to read
  • Dark
    Light

Beyond data lakes: exploring the vastness of data oceans for true equity

  • Dark
    Light

Article Summary

Thank you to Kem-Laurin Lubin, for sharing her insight and stories in our knowledge base.

Click here to read on Medium.

“If you think of a data lake as a body of water, then without careful management, a lake can quickly become a swamp. Data lakes are only as useful as the data they contain and the insights they provide.” — Unknown.

This month, I had the honour of participating as a guest panelist at the Toronto design meet-up event, Design Meets, a long-standing feature in Toronto design circles. The event, hosted by Pivot Design Group, one of Toronto’s premiere design service companies, and this instance was moderated by Michael Anton Dila from Oslo for AI, a non-profit project organization, here in the city of Toronto. Oslo for AI specializes in more community driven exploits that informs better governance practices of AI, an untapped space for Tech Humanistic endeavours in a world so heavily indexed on the technology itself, along with the financial gains to be had. Both these companies embody my own ethos and self dubbed title, Tech Humanist, seeking to ensure that human centred design be central to every AI powered system.

During my presentation, I was struck by a recurring insight running through my mind, and one that often surfaces in debates around artificial intelligence: the critical role of data, a resource highly prized by modern companies. This moment underscored the importance of reevaluating our terminology. Referring to extensive data collections as “data lakes” seems outdated and limiting, given their essential role and immense potential in fueling AI advancements.

We are, after all, the same humans who once embraced trends like the Macarena and MC Hammer pants with open arms, okay legs — trends we now reminisce about with a mix of fondness and embarrassment, those photos hidden away, only visible to those who were there. You know if you know. Similarly, we are the same humans who have, time and again, misinterpreted and misapplied metaphors in computing practices.

The metaphors of computing

Lev Manovich, a prominent figure in this discussion, offers insightful perspectives on this topic. His exploration of computing metaphors in The Language of New Media, a book I have frequently referenced since completing my Master’s degree the year it was published. The book sheds light on how terms like “navigation” and “interface” shape our interactions with digital technology, influencing both design and user perception of the technological practices. Manovich’s work unveils the profound effect these metaphors have on our engagement with the digital landscape.

By dissecting these common digital terms, Manovich not only enriches our understanding of media theory but also challenges us to reconsider the digital environments we navigate daily. His analysis encourages a deeper contemplation of the often invisible frameworks that shape our digital experiences, prompting us to question and, potentially, to innovate.

This brings us to the crux of this post: the misuse of metaphors has led to misunderstandings and underestimations of AI’s true potential and applications. Yet, it’s crucial to recognize that not all AI applications fall into this trap. I’ve attended numerous presentations and conferences, over the last five years, where innovative AI uses have been showcased. One striking example is how AI is utilized for livestock tracking in Africa, a presentation on AI I happen to join as part of the University of Guelphs, Care AI presentation series. I am, also, partial to them because I live here.

This application of AI transcends the common narratives, demonstrating a practical and impactful use of technology that aligns with real-world needs and challenges, and not confined by limiting metaphors. It exemplifies how, when thoughtfully applied, AI can transcend misused metaphors and misconceptions, offering solutions that are both transformative and grounded in reality and in the interest of true human needs.

Language reflects and deflects

Asa rhetorician with an academic foundation, language occupies much of my thoughts. I ponder its constraints and its power, the way it shapes our understanding and perception. Through the lens of my chosen words, I reflect on what they reveal about us — the nuances they highlight and the boundaries they impose.

It was the literary theorist, Kenneth Burke who talked about how language reflects, selects, and deflects reality. He developed the concept known as “terministic screens,” which describes how our perceptions and understandings are shaped by the language and terms we use. This concept suggests that language directs our attention towards certain interpretations and away from others, thus reflecting, selecting, and deflecting different aspects of reality .

This notion of terministic screens, is part of Burke’s broader examination of language and rhetoric and how they influence our interaction with the world. According to him, language plays a critical role in how we perceive and construct our reality, thus reflecting, deflecting, and selecting our experiences. In the context of contemporary data management, for example, the term “data lakes” functions similarly to Burke’s concept of terministic screens, shaping our understanding and approaches towards data storage and analysis.

The phrase “data lakes” itself channels our perception, guiding and goading us towards imagining vast, fluid, and expansive repositories of raw data, untouched and unstructured, akin to the natural body of a lake. This linguistic choice influences not only the way organizations and individuals conceive of data storage possibilities but also the strategies they adopt for handling “big data.” — Is it really big, though?

Just as Burke argued that language could reflect, deflect, and select our realities, “data lakes” reflects the modern emphasis on vast data potential, selects the focus on flexibility and scalability, and deflects from more traditional, structured data storage methods like data warehouses.

Swampy potential — from crystal clear to murky depths

The metaphor, “lake,” extends beyond mere storage; it suggests a new paradigm where data remains in its natural state until needed, challenging and reshaping preconceived notions of data processing and utilization. This assumption about so called data lakes being natural is also flawed.

Let me break it down even more.

The concept of a “data lake” in data science can be misleading. It evokes the image of a vast, unstructured reservoir of data, oversimplifying the complexities involved in managing various data types. Though data lakes are designed to store diverse kinds of data without rigid formatting, this very flexibility can introduce significant challenges.

In my experience, this scenario is all too familiar, resulting in an endless cycle of clarifying and re-clarifying with Data Analysts about the specific data needed for my work, versus what is available. This often leads to repetitive and unproductive discussions.

Without strict organization or quality control, data lakes can become “data swamps” — cluttered and useless due to bad organization and missing information. This makes it very hard to get useful insights from the data.

This metaphor, again, highlights the importance of proper data management and governance. Without these, a data lake, which is intended to store vast amounts of raw data, can become unmanageable and obscure, much like a swamp, thus losing its value for analytics and decision-making. This critique is not new; it has been reiterated by many data scientists, IT professionals, and thought leaders in various forms and continue to plague the industry.

The significance of robust data management and governance in data lakes cannot be overstated. These systems, designed to store vast amounts of unstructured data, require meticulous management to prevent them from deteriorating into unusable “data swamps.”

Experts from Datanami and Starburst stress the importance of employing integrated data lake management platforms and establishing comprehensive data governance frameworks. These measures ensure the reliability, security, and compliance of the data, facilitating its effective use in decision-making processes. McKinsey further underscores the necessity of organizational commitment, particularly from top management, to drive the success of data governance initiatives, highlighting the role of governance in aligning with corporate strategy and enhancing business outcomes.

It is at times like this I am reminded of the adage “It is better to be a big fish in a small pond than a small fish in a big pond,” except in reverse.

Drowning in data: murky waters, unsecured lakes and the quest for governance

We do not think about this often but handling data lakes involves complex technical issues. The absence of strict data processing standards, particularly ACID compliance, can lead to incomplete data transactions and consequently, unreliable data.

By virtue of working at the hip with Data Scientist their work charge is a big one as they have to by ACID compliant. What is this you ask?

In data science, ACID is a set of principles that help keep database transactions safe and reliable:

Atomicity ensures that all parts of a transaction are completed; if one part fails, the whole transaction fails and nothing is left half-done.

Consistency ensures that all data follows certain rules. If a transaction might break these rules, it’s not allowed to complete.

Isolation means that multiple transactions can happen at the same time without interfering with each other.

Durability ensures that once a transaction is completed, it’s permanently recorded, even if there’s a power cut or other issue.

Together, these four principles help keep data accurate and secure.

Data governance

Further to ACID compliancy, data lakes are designed to accommodate diverse data types, this flexibility can inadvertently lead to governance and security challenges. As data lakes expand, maintaining their manageability and ensuring the security becomes increasingly difficult. Effective management requires implementing detailed governance protocols and ensuring the integrity and scalability of data, which are crucial for producing dependable insights from such extensive data environments.

Additionally, data lakes often face challenges in terms of governance, security, and scalability. They typically need significant management to prevent them from becoming cluttered and unmanageable. And this is why data governance is so critical.

Frankly, many companies still struggle with creating a responsible environment in this new era where data equates to power. The inherent flexibility of a data lake often compromises its security and governance, making these essential features more challenging to implement within a system that handles such a diverse array of data formats.

And while the data lake metaphor suggests fluidity and vastness, it can be misleading by underestimating the complexities of managing large-scale data repositories. The challenges of data quality, governance, security, and operational integrity present real constraints that the metaphor fails to convey.

Data lake — terminology, phraseology and definition

Lastly to not beat the metaphor to a pulp — the term “lakes” itself, suggests areas vast and bound by geography, yet they also hold information that remains static, and frozen in time. As a result, the decisions driven by AI technology tend to rely on historical data — collections from bygone days, not the immediate reality.

Oxford Dictionaries defines them accordingly

Lake: a large area of water surrounded by land.

Ocean: a very large expanse of sea, in particular each of the main areas into which the sea is divided geographically.

From parroting data to lifting masks: a call for data informing algorithms of equality

This retrospective approach to the metaphorical application — a lake — in decision-making, affecting a broad spectrum of humanity on this planet, need not be viewed negatively but rather as an opportunity to improve. This same perspective is echoed by AI activists and thinkers like Timnit Gebru, Joy Buolamwini, and Safiya Noble, each arguing from their unique disciplinary standpoints. Similarly, I am presenting this argument from a linguistic and rhetorical perspective. The issue at hand, the “sample size,” is not merely restrictive; it is incomplete.

Timnit Gebru’s work primarily focuses on the ethical implications of AI and machine learning technologies, with a special emphasis on bias and fairness. Her co-authored piece, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” which resulted in her ousting from Google makes the case of the limitation of the companies machine learning protocols. She further argues that AI systems often reflect and perpetuate existing biases present in society, particularly racial and gender biases. Gebru advocates for more transparency, accountability, and diversity within the AI research community, emphasizing the need for interdisciplinary approaches that include social sciences and humanities to understand and mitigate biases in AI systems.

Gebru was dismissed from Google in 2020 following controversy over a research paper addressing risks and ethical concerns of large language models she co-authored. Her paper criticized the environmental impact, bias perpetuation, and illusion of understanding by these models, which sparked a debate about academic freedom and ethics in AI. The incident led to significant backlash from the tech and academic communities, highlighting the tension between ethical research and tech companies’ interests. Gebru’s contributions, particularly in AI ethics and diversity, continue to influence the field significantly​​​​​​.

Joy Buolamwini’s, who has also co-written with Gebru, “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification”, is another advocate for increase in data set used in machine learning. Her research highlights the biases in facial recognition technologies. She demonstrates how these systems are less accurate in identifying the faces of women and people of color compared to white men.

Buolamwini’s work underscores the importance of inclusive and diverse datasets in training AI systems to ensure they perform equitably across different demographics. She calls for the adoption of more rigorous benchmarking standards that account for demographic and phenotypic diversity to prevent discriminatory outcomes. Her new book, Unmasking AI is a must read.

Last but not least is, Safiya Noble’s, whom I had the honour of hosting as a moderator, at the University of Waterloo, speaker series, where she presented in her book Algorithms of Oppression, is that search engines and other algorithmic systems can perpetuate social inequalities and biases, particularly against women of color. She illustrates how algorithms in digital platforms, under the guise of neutrality, often prioritize certain types of information, thereby reinforcing stereotypes and marginalizing minority voices. Noble advocates for a rethinking of digital information systems, urging the incorporation of ethical considerations and social values in their design and implementation to combat systemic biases.

Accordingly the raw materials of AIs computation is also vastly geographically limiting.

Expanding on this argument, the limited and incomplete nature of our metaphorical ‘sample size’ shapes and confines our understanding and application of AI, leading to potential biases and oversights that can have widespread implications. We need a more global approach.

It’s not just about the data we use; it’s about how we frame and interpret that data, influenced by our metaphors and language.

By broadening our linguistic and conceptual frameworks, we can foster a more inclusive and comprehensive approach to AI that better reflects the diverse world it serves. This shift requires a reevaluation of our metaphors and an acknowledgment of their power in shaping technological development and application.

From lakes to oceans

AI, along with its raw materials, data, in data lakes, heralded as the transformative force of our time, brings with it undeniable advantages. Nonetheless, it’s not without its challenges. When speaking to various audiences, I adopt a neutral perspective, highlighting that the intrinsic value of data is neutral. Its impact, whether positive or negative, hinges on its application, the individuals wielding it, and their underlying intentions. This stance promotes a balanced understanding, stressing the need for ethical frameworks in AI and data analytics. But then arises the question:

Why should we prefer ‘data oceans’ over ‘data lakes’?

Before I continue and conclude my argument, let me clarify, I am no Luddite, and I am a staunch environmentalist and advocate for sustainability. It disheartens me to acknowledge how much of my existence is intertwined with technology, yet severing ties with the very systems that sustain us is not a viable solution. When I refer to “data oceans,” I’m not advocating for larger server farms or endorsing unchecked tech expansion. Quite the contrary. Additionally, I approach this discussion from a non-tech-deterministic standpoint, rejecting the notion that technology alone can solve all human challenges.

This leads to my crucial point, and that is embracing “data oceans” symbolizes our shift towards a more fluid, dynamic approach to handling information, unlike the static “data lakes”. By aspiring to data ocean, we embrace its vastness, depth, and, importantly, its constant movement and evolution. This perspective allows us to better mirror the real-time changes happening in our world, thereby fostering decisions that are more informed, nuanced, and adaptable to the ever-shifting landscapes of human need and global challenges.

In this new paradigm, we’re not just reacting to past patterns; we’re anticipating future trends and sculpting solutions that are as fluid and responsive as the data itself. This approach doesn’t just change how we view data; it revolutionizes our interaction with the world around us, urging us to think beyond traditional confines and explore new horizons of possibility of technology for all.

Related post

References

  1. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://luiscruz.github.io/green-ai/publications/2021-03-bender-parrots.

  2. Buolamwini, J., & Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of the 1st Conference on Fairness, Accountability, and Transparency, 77–91. http://proceedings.mlr.press/v81/buolamwini18a.html.

  3. Burke, K. (1966). Language as Symbolic Action: Essays on Life, Literature, and Method. University of California Press.

  4. Datanami. (n.d.). Data Management and Governance in the Data Lake. [Website]. Available at: https://www.datanami.com.

  5. Manovich, L. (2001). The language of new media. The MIT Press.

  6. McKinsey & Company. (n.d.). Designing data governance that delivers value. [Website]. Available at: https://www.mckinsey.com.

  7. Noble, S. U. (2018). Algorithms of Oppression: How Search Engines Reinforce Racism. New York University Press.

  8. Starburst. (n.d.). Data governance: Use cases, framework, tools, best practices. [Website]. Available at: https://www.starburst.io.

About me: Hello, my name is Kem-Laurin, and I am one half of the co-founding team of Human Tech Futures. At Human Tech Futures, we’re passionate about helping our clients navigate the future with confidence! Innovation and transformation are at the core of what we do, and we believe in taking a human-focused approach every step of the way.

We understand that the future can be uncertain and challenging, which is why we offer a range of engagement packages tailored to meet the unique needs of both individuals and organizations. Whether you’re an individual looking to embrace change, a business seeking to stay ahead of the curve, or an organization eager to shape a better future, we’ve got you covered.

Connect with us at https://www.humantechfutures.ca/contact


Was this article helpful?

ESC

Eddy, a super-smart generative AI, opening up ways to have tailored queries and responses