Lakehouses, Data Governance, & AI/ML: Key Takeaways from Data Council 2024

Attending Data Council 2024 was truly a refreshing experience – a vendor-neutral conference unlike any other I’ve encountered recently. Connecting with passionate individuals deeply invested in the data community was inspiring. It was also a great opportunity for me and my colleagues to validate the work we’re doing to improve 1upHealth’s nextgen health cloud data platform. Here are four noteworthy topics I observed during the conference. 

The Expanding Role of Lakehouse Architecture

Lakehouses are emerging and the many talks dedicated to this topic at Data Council  confirmed their importance. 

The lightning talk entitled “Charting the Lakehouse Trail: A Data Migration Adventure” covered the journey of migrating data to lakehouses and various things to be taken into account, including performance tuning, observability requirements, resource optimizations, and best practices for lakehouse architecture. 

Another lightning talk entitled “Tackling I/O Challenges in Modern Data Lakes” focused on:

  • Saving costs with I/O in lakehouse architecture
  • How modern data stacks pose challenges to data locality
  • Why we should care about I/O in lakehouse 
  • What are the emerging techniques with caching that help solve some of the I/O challenges and show cost savings . 

The talk “Open Data Foundations across Hudi, Iceberg and Delta” highlighted bringing in smart conversation capabilities across various table formats, like Detlalake, Iceberg and Hudi, using the apache incubator project called X-tables. This converter focuses on metadata conversion between these table formats without touching underlying data in the lakehouse

The presentation entitled “The Future of Data Engineering in a Post-AI World” provided  valuable information on the modern data stack, AI/ML vocabulary starting with  Descriptive Analytics, Predictive Analytics and all the way to Cognitive Analytics[1][2], and the valuable role of human interventions. 

Storage and compute separation is critical for cost savings and data portability. The talk entitled “Redefining Database Workloads: The Future with Modern Object Storage” focused on disaggregation of storage and query engines, how the modern data stack is evolving with it, and how major warehouse vendors like Snowflake are building capabilities on loading data as external tables from lakehouse architecture. 

Finally, the presentation entitled “The Reality of Building a Modern AI Data Stack” explained the challenges the AI stack poses and features that we need to take into consideration to enable the platform for AI. 

All the discussions surrounding lakehouse architecture resonate deeply with 1upHealth’s vision of delivering cost-effective and highly scalable data management within our 1up FHIR Platform using lakehouse architecture. Lakehouse architecture with proper access controls allow us to implement our data exchange use case enablement beyond APIs and support at scale. 

The Importance of Constructing Dependable Pipelines

Batch and real-time pipelines pose different challenges and the presenter of the talk entitled “Bridging the Gap Between Batch and Real-Time with Mixed-Latency Pipelines” walked the audience through the platform capabilities needed to overcome the challenges posed by mixed workloads coming from real-time and batch pipelines. Some of the points include data freshness, observability, and lineage requirements for mixed workloads. 

We know well about software reliability engineering and the talk entitled “Scaling Data Reliably: A Journey in Growing Through Data Pain Points” covered “Data Reliability Engineering,” which is focused on reliability of data delivery, data freshness, missing data, backfills, and reingestion of data. This talk touched on many data engineering practices around pipelines and was quite interesting. 

The lightning talk entitled “Enabling Data Centric Solutions through Modern Schema Management” highlighted schema enforcement as a contract between producers and consumers. This talk demonstrated the schema enforcement with avro and schema registry

1upHealth’s approach to constructing dependable pipelines through the EL-T paradigm, employing a modern data stack, prioritizing data observability with message tracing, and wholeheartedly embracing open-source technologies across every facet of our health cloud architecture was echoed in various presentations, each offering unique perspectives.

A Focus on Data Governance

Open lineage specification is emerging and most of the data vendors are welcoming the specification, including Colibra, Datadog, and IBM. During the panel entitled “Data Lineage: We’ve Come a Long Way”, the speakers discussed open lineage specification and its importance to the community. 

Another panel discussion entitled “WTF Are We Doing?” emphasized the importance of governance with emerging tools and the AI stack in the data community. 

These talks support 1upHealth’s decisions to give utmost importance to data governance, metadata management, and the future of AI in the 1upHealth platform. 

AI and ML in Data

In a lightning talk entitled “Give Rust a Chance”, the speaker made the interesting language of choice “Rust in the AI and ML world”, which is compatible with Python. The speaker spoke about the importance of Rust along with other JVM-based languages, why it’s easy, and why everyone should give Rust a chance to, well, Rust while building libraries for Python. 

Metaflow is an ML platform for ML and AI workflows. The talk Beyond MLOps: Building AI systems with Metaflow” walked the audience through customer experience and its evaluation through a sample furniture purchase experience with ML. By focusing on Metaflow, tech stack, model development, feature building, and deploying of models becomes easier.

Microsoft presented on A/B testing in a talk entitled “Case Studies from a Methodologist on an Experimentation Platform.” This provided a nice intro to the importance of A/B testing in ML/AI and insights into what 1upHealth should build as a capability in our platform for the ML space. 

Lastly, I found the talk “Is Kubernetes a Database?” funny and interesting. The presentation covered how to use Kubernetes as a database and how to query/join the data (small data) stored in K8s resources. 

There were other talks that resonated well on Apache Arrow and DucksDB. There are a handful of talks on these technologies from InfluxDB and other startups. I would recommend watching the presentations on the Data Council website

Lessons Learned at Data Council 2024 Echo 1upHealth’s Approach

The primary objective of the 1up platform is to efficiently acquire, manage, govern, and serve health data through various products. Many of the talks at the Data Council 2024 echo our approach, including shift left observability, reliable data pipelines, strong data quality, better lineage, data governance to protect data assets, and enablement of AI capabilities in the platform. 

For more thought leadership on these (and other) important data topics, take a moment to subscribe to our blog and follow us on LinkedIn and Twitter

Share with your community

Sign up to get the latest insights and updates from 1upHealth