-
1 December 2024
- Databricks
Managing data across different platforms and formats can be a challenge for modern organizations. That’s where Lakehouse Federation in Databricks comes in. This tool allows users to access and query data across multiple sources without moving it, saving time, money, and resources. Below, you’ll find a simpler, practical breakdown of what Lakehouse Federation is, how it works, and why it matters.
What is Lakehouse Federation?
Lakehouse Federation is Databricks’ solution for executing queries across multiple data sources without requiring data migration to a centralized system. It leverages Unity Catalog to manage these federated queries by setting up read-only connections to popular databases through built-in drivers available in pro SQL warehouses, serverless SQL warehouses, and Databricks Runtime clusters. Unity Catalog also provides governance and data lineage, ensuring that all access to federated data is properly managed, tracked, and audited across Databricks workspaces.
Key Features and How They Work
Here are the main features of Lakehouse Federation, explained in a straightforward way:
1. Connect to Multiple Data Sources
- You can link Databricks to many databases, cloud services, and file storage systems, such as:
- SQL databases like MySQL, PostgreSQL, or Oracle.
- Cloud storage systems like AWS S3, Azure Data Lake, and Google Cloud Storage.
- Proprietary platforms like Snowflake or SAP.
How it helps: No need to centralize your data. You can analyze it directly from its native storage.
2. No Data Movement
- Forget about large-scale data migrations. Federation lets you run queries directly on your data sources. This saves time and avoids unnecessary duplication.
Example Use Case: You can combine sales data from Google BigQuery with customer feedback stored in PostgreSQL to create a complete picture—without moving either dataset.
3. Smart Query Optimization
- Databricks uses a query engine powered by Photon and Delta Engine that optimizes your SQL queries for each source. It sends computations to the data source itself whenever possible, reducing the load on your central platform.
Why it matters: Faster performance and reduced costs, as you’re not moving huge amounts of data.
4. Centralized Data Governance
- All your data, whether in Databricks or external systems, is subject to the same access controls and security rules. This ensures compliance and protects sensitive information.
Bonus Feature: You can also track metadata, like schemas and relationships, for all your data sources in one place.
Why Should You Use Lakehouse Federation?
Here’s why organizations are adopting this tool:
- Simpler Data Workflows: Analysts and data scientists can access all the data they need without dealing with multiple platforms.
- Cost Efficiency: By skipping data duplication, you save on storage and avoid redundant processing.
- Flexibility: Works across hybrid (on-premise + cloud) or multi-cloud environments.
- Compliance: Helps meet regulations about where data must be stored by letting you analyze it in place.
How to Set It Up?
- Create a connection to your source
- Use Databricks connectors for platforms like Oracle, SAP, AWS S3, and Snowflake. These connectors handle the communication between Databricks and the source system.
CREATE CONNECTION connection_postgresql TYPE postgresql
OPTIONS (
host ‘<hostname>’,
port ‘<port>’,
user ‘<user>’,
password ‘<password>’
);
- Create a foreign catalog
- A foreign catalog replicates an external database within Databricks, allowing users to query and manage data access through Unity Catalog. It is created using a predefined connection to the external data source, either via Catalog Explorer or the CREATE FOREIGN CATALOG SQL command in a notebook or SQL editor. Metadata from the foreign catalog is synchronized with Unity Catalog upon interaction. Data type compatibility between Unity Catalog and the external system can be found in the respective data source’s documentation.
CREATE FOREIGN CATALOG [IF NOT EXISTS] federated_catalog USING CONNECTION connection_postgresql
OPTIONS (database ‘database_name’);
- Run Federated Queries
- Write SQL queries in Databricks, specifying the data source if needed. For example:
SELECT *
FROM federated_catalog.federated_schema.federated_table
WHERE purchase_date > ‘2024-01-01’;
- Optimize Performance
- Use built-in query optimization settings to reduce latency and compute costs.
- Govern Access
- Set permissions and policies for users to ensure secure and compliant data use.
Lakehouse Federation and Materialized Views
Integrating Lakehouse Federation with materialized views offers significant performance and efficiency advantages. Materialized views store the results of complex federated queries in a precomputed, cached form, reducing query latency and minimizing repeated data retrieval from external sources. This is especially valuable when querying large datasets from disparate systems, as it reduces the load on underlying databases and optimizes resource usage in Databricks. Additionally, materialized views can be scheduled to refresh at regular intervals, ensuring that users have access to up-to-date insights while maintaining high query performance.
Key Benefits:
- Provides consistent latency and concurrency for querying external data sources outside the lakehouse.
- Enhances performance for cross-source joins and complex transformations.
- Reduces load on operational databases by utilizing materialized views, minimizing the impact of high or concurrent access.
Examples of usage Lakehouse Federation in Databricks
Building a 360° Customer View
Goal: Integrate data from different systems to get a comprehensive customer view.
Data Sources:
- Transactional SQL database (e.g., PostgreSQL) containing purchase history.
- Cloud storage (e.g., Amazon S3) holding marketing campaign data.
- CRM system (e.g., Salesforce) with customer interaction data.
Implementation:
With Lakehouse Federation, you can query data across all sources without moving it:
SELECT c.customer_id, c.name, SUM(t.total_amount) AS total_spent, m.last_campaign
FROM federated_catalog.crm.customers c
JOIN federated_catalog.sales.transactions t ON c.customer_id = t.customer_id
JOIN federated_catalog.marketing.campaign_data m ON c.customer_id = m.customer_id
WHERE t.purchase_date > ‘2023-01-01’
GROUP BY c.customer_id, c.name, m.last_campaign;
Benefits:
- Unified customer analytics without data duplication.
- Better marketing decisions through integrated insights.
b) Real-Time IoT Data Insights
Goal: Monitor sensor data and analyze trends in real-time.
Data Sources:
- Streaming data from Apache Kafka with live sensor readings.
- Historical data stored in Snowflake.
Implementation:
Create a materialized view to monitor temperature trends from IoT devices:
CREATE MATERIALIZED VIEW iot_temperature_trends AS
SELECT device_id, location, AVG(sensor_value) AS avg_temp, CURRENT_DATE AS report_date
FROM federated_catalog.kafka_stream.iot_data
GROUP BY device_id, location;
Benefits:
- Quick identification of anomalies in real-time.
- Seamless combination of live and historical data.
c) Multi-Cloud Data Integration
Goal: Analyze data stored across AWS and Azure platforms.
Data Sources:
- Sales data in Amazon Redshift.
- Logistics data in Azure Data Lake.
Implementation:
Run federated queries across both clouds (before running this query, ensure that you’ve created connections for both Redshift and Azure Data Lake in Unity Catalog and assigned appropriate permissions):
SELECT s.order_id, s.product_id, l.delivery_status
FROM federated_catalog.aws_redshift.sales_data s
JOIN federated_catalog.azure_datalake.logistics_data l ON s.order_id = l.order_id
WHERE l.delivery_status = ‘Delayed’;
Benefits:
- Cross-cloud analytics without complex ETL processes.
- Improved operational efficiency by integrating logistics and sales data.
d)
Goal: Ensure users access fresh data without overloading source systems.
Implementation: Create and automate the refresh of a materialized view:
CREATE MATERIALIZED VIEW sales_summary AS
SELECT region, product_category, SUM(sales_amount) AS total_sales
FROM federated_catalog.external.sales_data
GROUP BY region, product_category;
ALTER MATERIALIZED VIEW sales_summary
SET (refresh_interval_minutes = 60);
Benefits:
- Reduced query load on source databases.
- Fast access to up-to-date data for analysts.
e) Centralized Data Access Management
Goal: Secure data access and ensure regulatory compliance.
Implementation: Set role-based access control in Unity Catalog:
GRANT SELECT ON catalog federated_catalog.schema.sales_data TO analyst_role;
Benefits:
- Centralized management of permissions across all data sources.
- Compliance with data governance policies and regulatory requirements.
Challenges and limitations
- Read-Only Queries: All federated queries are limited to read-only operations. Consider using Delta Live Tables for ETL if write operations are required
- Connector Compatibility: Ensure the required connectors are available for all your data sources.
- Performance Limitations: Querying multiple sources can sometimes be slower than using centralized data, depending on the data size and source configuration. Leverage materialized views or caching for frequently used queries
- Naming Constraints: Tables or schemas with invalid names in Unity Catalog are ignored. Additionally, all table and schema names are converted to lowercase, requiring lookups to use lowercase. Duplicate lowercase names result in only one being imported.
- Data Handling: Each foreign table query triggers a subquery on the remote system, returning data via a single stream. Large result sets may cause memory issues. Partition large datasets to avoid memory overload during query execution
- Access Restrictions: Single-user access mode is exclusive to connection owners.
Conclusion
Lakehouse Federation in Databricks is a game-changer for businesses dealing with fragmented data systems. It simplifies access to data, reduces costs, and enables faster decision-making, all without the need for extensive data migrations. Whether you’re a data scientist, an analyst, or an IT leader, this tool offers the flexibility and power to make your workflows smoother and more efficient. As data environments grow increasingly complex, tools like Lakehouse Federation will become indispensable for organizations aiming to unlock the full potential of their distributed data assets.