The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi
Exploring the Evolution of Lakehouse Technology: A Conversation with Vinoth Chandar and Onehouse CEOIn this episode, Ananth, author of Data Engineering Weekly and CEO of Onehouse, discusses the latest developments in the Lakehouse technology space, particularly focusing on Apache Hudi, Iceberg, and Delta Lake. They discuss the intricacies of building high-scale data ecosystems, the impact of table format standardization, and technical advances in incremental processing and indexing. The conversation delves into the role of open source in shaping the future of data engineering and addresses community questions about integrating various databases and improving operational efficiency.00:00 Introduction and New Year Greetings01:19 Introduction to Apache Hudi and Its Impact02:22 Challenges and Innovations in Data Engineering04:16 Technical Deep Dive: Hudi's Evolution and Features05:57 Comparing Hudi with Other Data Formats13:22 Hudi 1.0: New Features and Enhancements20:37 Industry Perception and the Future of Data Formats24:29 Technical Differentiators and Project Longevity26:05 Open Standards and Vendor Games26:41 Standardization and Data Platforms28:43 Competition and Collaboration in Data Formats33:38 Future of Open Source and Data Community36:14 Technical Questions from the Audience47:26 Closing Remarks and Future Outlook This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com
--------
47:41
Agents of Change: Navigating 2025 with AI and Data Innovation
Agents of Change: Navigating 2025 with AI and Data InnovationIn this episode of Dew, the hosts and guests discuss their predictions for 2025, focusing on the rise and impact of agentic AI. The conversation covers three main categories:1. The role of agent AI2. The future workforce dynamic involving human and AI agent3. Innovations in data platforms heading into 2025.Highlights include insights from Ashwin and our special guest, Rajesh, on building robust agent systems, strategies for data engineers and AI engineers to remain relevant, data quality and observability, and the evolving landscape of Lakehouse architectures.The discussion also discusses the challenges of integrating multi-agent systems and the economic implications of AI sovereignty and data privacy.00:00 Introduction and Predictions for 202501:49 Exploring Agentic AI04:44 The Evolution of AI Models16:36 Enterprise Data and AI Integration25:06 Managing AI Agents36:37 Opportunities in AI and Agent Development38:02 The Evolving Role of AI and Data Engineers38:31 Managing AI Agents and Data Pipelines39:05 The Future of Data Scientists in AI40:03 Multi-Agent Systems and Interoperability44:09 Economic Viability of Multi-Agent Systems47:06 Data Platforms and Lakehouse Implementations53:14 Data Quality, Observability, and Governance01:02:20 The Rise of Multi-Cloud and Multi-Engine Systems01:06:21 Final Thoughts and Future Outlook This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com
--------
1:10:37
Data Engineering Trends With Aswin & Ananth
Welcome to another insightful edition of Data Engineering Weekly. As we approach the end of 2023, it's an opportune time to reflect on the key trends and developments that have shaped the field of data engineering this year. In this article, we'll summarize the crucial points from a recent podcast featuring Ananth and Ashwin, two prominent voices in the data engineering community.Understanding the Maturity Model in Data EngineeringA significant part of our discussion revolved around the maturity model in data engineering. Organizations must recognize their current position in the data maturity spectrum to make informed decisions about adopting new technologies. This approach ensures that adopting new tools and practices aligns with the organization's readiness and specific needs.The Rising Impact of AI and Large Language Models2023 witnessed a substantial impact of AI and large language models in data engineering. These technologies are increasingly automating processes like ETL, improving data quality management, and evolving the landscape of data tools. Integrating AI into data workflows is not just a trend but a paradigm shift, making data processes more efficient and intelligent.Lake House Architectures: The New FrontierLakehouse architectures have been at the forefront of data engineering discussions this year. The key focus has been interoperability among different data lake formats and the seamless integration of structured and unstructured data. This evolution marks a significant step towards more flexible and powerful data management systems.The Modern Data Stack: A Critical EvaluationThe modern data stack (MDS) has been a hot topic, with debates around its sustainability and effectiveness. While MDS has driven hyper-specialization in product categories, challenges in integration and overlapping tool categories have raised questions about its long-term viability. The future of MDS remains a subject of keen interest as we move into 2024.Embracing Cost OptimizationCost optimization has emerged as a priority in data engineering projects. With the shift to cloud services, managing costs effectively while maintaining performance has become a critical concern. This trend underscores the need for efficient architectures that balance performance with cost-effectiveness.Streaming Architectures and the Rise of Apache FlinkStreaming architectures have gained significant traction, with Apache Flink leading the way. Its growing adoption highlights the industry's shift towards real-time data processing and analytics. The support and innovation around Apache Flink suggest a continued focus on streaming architectures in the coming year.Looking Ahead to 2024As we look towards 2024, there's a sense of excitement about the potential changes in fundamental layers like S3 Express and the broader impact of large language models. The anticipation is for more intelligent data platforms that effectively combine AI capabilities with human expertise, driving innovation and efficiency in data engineering.In conclusion, 2023 has been a year of significant developments and shifts in data engineering. As we move into 2024, we will likely focus on refining these trends and exploring new frontiers in AI, lake house architectures, and streaming technologies. Stay tuned for more updates and insights in the next editions of Data Engineering Weekly. Happy holidays, and here's to a groundbreaking 2024 in data engineering! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com
--------
38:11
DEW #133: How to Implement Write-Audit-Publish (WAP), Vector Database - Concepts and examples & Data Warehouse Testing Strategies for Better Data Quality
Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.On DEW #133, we selected the following articleLakeFs: How to Implement Write-Audit-Publish (WAP)I wrote extensively about the WAP pattern in my latest article, An Engineering Guide to Data Quality - A Data Contract Perspective. Super excited to see a complete guide on implementing the WAP pattern in Iceberg, Hudi, and of course, with LakeFs.https://lakefs.io/blog/how-to-implement-write-audit-publish/Jatin Solanki: Vector Database - Concepts and examplesStaying with the vector search, a new class of Vector Databases is emerging in the market to improve the semantic search experiences. The author writes an excellent introduction to vector databases and their applications.https://blog.devgenius.io/vector-database-concepts-and-examples-f73d7e683d3ePolicy Genius: Data Warehouse Testing Strategies for Better Data QualityData Testing and Data Observability are widely discussed topics in Data Engineering Weekly. However, both techniques test once the transformation task is completed. Can we test SQL business logic during the development phase itself? Perhaps unit test the pipeline?The author writes an exciting article about adopting unit testing in the data pipeline by producing sample tables during the development. We will see more tools around the unit test framework for the data pipeline soon. I don’t think testing data quality on all the PRs against the production database is not a cost-effective solution. We can do better than that, tbh.https://medium.com/policygenius-stories/data-warehouse-testing-strategies-for-better-data-quality-d5514f6a0dc9 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com
--------
22:42
DEW #132: The New Generative AI Infra Stack, Databricks cost management at Coinbase, Exploring an Entity Resolution Framework Across Various Use Cases & What's the hype behind DuckDB?
Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.On DEW #132, we selected the following articleCowboy Ventures: The New Generative AI Infra StackGenerative AI has taken the tech industry by storm. In Q1 2023, a whopping $1.7B was invested into gen AI startups. Cowboy ventures unbundle the various categories of Generative AI infra stack here.https://medium.com/cowboy-ventures/the-new-infra-stack-for-generative-ai-9db8f294dc3fCoinbase: Databricks cost management at CoinbaseEffective cost management in data engineering is crucial as it maximizes the value gained from data insights while minimizing expenses. It ensures sustainable and scalable data operations, fostering a balanced business growth path in the data-driven era. Coinbase writes one case about cost management for Databricks and how they use the open-source overwatch tool to manage Databrick’s cost.https://www.coinbase.com/blog/databricks-cost-management-at-coinbaseWalmart: Exploring an Entity Resolution Framework Across Various Use CasesEntity resolution, a crucial process that identifies and links records representing the same entity across various data sources, is indispensable for generating powerful insights about relationships and identities. This process, often leveraging fuzzy matching techniques, not only enhances data quality but also facilitates nuanced decision-making by effectively managing relationships and tracking potential matches among data records. Walmart writes about the pros and cons of approaching fuzzy matching with rule-based and ML-based matching.https://medium.com/walmartglobaltech/exploring-an-entity-resolution-framework-across-various-use-cases-cb172632e4aeMatt Palmer: What's the hype behind DuckDB?So DuckDB, Is it hype? or does it have the real potential to bring architectural changes to the data warehouse? The author explains how DuckDB works and the potential impact of DuckDB in Data Engineering.https://mattpalmer.io/posts/whats-the-hype-duckdb/ This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com