Lance Format Is Fast. Your Embeddings Are Still a Governance Nightmare.

A data engineer posted a question that’s keeping ML platform teams up at night. Their team switched to Lance format for embeddings a few months ago, blazing fast for vector operations, zero complaints about performance. But now they’re staring at 50+ Lance datasets scattered across S3 with names like user_emb_v3_fixed.lance and user_emb_v3_final_actual.lance. Nobody knows what’s what. Their Iceberg tables live in a proper catalog with schemas, ownership, and lineage. The Lance datasets? They’re governed by a sophisticated system called “ask Kevin, he might remember.”

This isn’t a niche problem. It’s the canary in the coal mine for AI data infrastructure.

The Speed Trap

Lance format delivers on its promise. Designed for high-performance vector search and multimodal AI workloads, it offers filtered searches up to 19× faster than alternatives, 10, 20% lower cold latency for full-text search, and ~20% faster HNSW search. Netflix and Runway have built production systems on it. The format’s architecture supports evolving schemas, start with a column of images, add embeddings, captions, and features over time until you’ve built a comprehensive multimodal asset.

But that flexibility becomes a liability without guardrails. ML engineers in frontier labs treat embeddings as ephemeral artifacts: generate, use, discard. Enterprise data engineers see the same data as permanent assets requiring governance. This philosophical split creates a operational chasm. When your “temporary” embedding dataset becomes the foundation for three production models six months later, ask_kevin.py stops being a viable catalog strategy.

The Catalog Integration That Exposes the Gap

Apache Gravitino’s 1.1.0 release, dropped in December 2025, added a Lance REST service that exposes Lance datasets through a managed HTTP interface. The significance isn’t the technical feat, it’s the admission that vector data needs the same governance as tabular data. Gravitino now federates Lance datasets alongside Iceberg tables, giving organizations unified access control, schema management, and discovery across both structured and vector data.

This integration reveals something uncomfortable: MLOps tooling has a metadata maturity gap. While data engineering solved cataloging years ago with Hive metastores, Unity Catalog, and Iceberg REST specs, ML teams have been YOLO-ing their most valuable data assets into S3 buckets with naming conventions that would make a junior developer weep.

The Lance ecosystem acknowledges this. The official namespace implementations now support Apache Hive MetaStore, Apache Polaris, Unity Catalog, Iceberg REST Catalog, and AWS Glue. For teams not ready for external catalogs, there’s even a Directory Namespace, a catalog that lives in a directory, backed by a Lance table itself. It’s a bridge for teams transitioning from “storage-only” to “governed-assets.”

Two Cultures, One Catalog

The Reddit discussion surfaces a critical divide. Enterprise data engineers integrating Lance for AI features demand catalog support. Frontier AI engineers prefer storage-only solutions and manual management. This isn’t just preference, it’s a reflection of organizational maturity.

Data engineers have lived through the data lake chaos of 2015. They’ve seen what happens when petabytes of Parquet files accumulate without metadata. They know that user_emb_v3_fixed_final_USE_THIS_ONE.lance eventually becomes a lawsuit waiting to happen when GDPR auditors ask which model consumed personally-identifiable embeddings.

AI engineers, focused on iteration speed, view catalogs as bureaucracy. But that perspective breaks down at scale. When your vector dataset evolves from a single embedding column to a 200-column multimodal asset spanning billions of vectors, “just read the code” becomes an impossible documentation strategy.

The Governance Reality Check

Here’s what proper cataloging unlocks for Lance datasets:

Schema Evolution Tracking: Lance supports adding columns over time, but without a catalog, you can’t see when embedding_vit_b_16 was added versus embedding_clip_v2, or which models depend on each.
Access Control: Gravitino’s RBAC and credential vending mean your inference service can access vector data without IAM keys scattered across Kubernetes secrets.
Data Lineage: When a production model degrades, you need to trace which embedding version, source data, and transformation pipeline created the training dataset. Without a catalog, you’re grepping through six-month-old experiment logs.
Discovery: Data scientists can’t reuse embeddings they can’t find. A unified catalog means searching across Iceberg tables and Lance datasets in one interface, discovering that the computer vision team already generated the exact embeddings you need.

The Counterintuitive Truth

The controversy isn’t whether Lance needs catalogs, it’s that embedding governance is data governance. For years, ML artifacts lived in a separate universe from “real” data. Models were versioned in MLflow, datasets in DVC, embeddings in… S3 prefixes? This fragmentation made sense when AI was experimental. It’s catastrophic when AI is production-critical.

Gravitino’s Lance integration forces a reckoning. If you’re treating embeddings as first-class data assets, they need first-class data infrastructure. That means catalogs, access controls, schema enforcement, and lineage tracking, not because it slows you down, but because it’s the only way to scale up.

The Path Forward

The good news? You don’t have to migrate everything overnight. Start with the Directory Namespace to catalog existing Lance datasets in place. Tag datasets with ownership, purpose, and model dependencies. Gradually federate into Gravitino or your existing catalog system as you productionize new pipelines.

The bad news? That “temporary” embedding dataset you created last quarter is already in production, and Kevin is on vacation. You’re not just cataloging data, you’re cataloging technical debt.

The question isn’t whether Lance needs a catalog. It’s whether your MLOps stack is mature enough to treat AI data as real data. The answer, for most teams, is a resounding no, and the clock is ticking.

Apache Gravitino 2025 Summary — Apache Gravitino’s 2025 roadmap explicitly targets the multimodal AI explosion, positioning catalogs as the cognitive foundation for Data Agents.

What’s Your Governance Model?

Are you cataloging Lance datasets, or is “ask Kevin” still your metadata strategy? The comment sections on this topic reveal a split: enterprise teams see catalogs as essential, AI-native startups see them as overhead. But as vector datasets grow from gigabytes to petabytes, that overhead becomes a requirement.

The real controversy isn’t the tooling, it’s the cultural shift. ML engineering needs to grow up, and fast.