Real-Time Streaming with Azure Databricks and Event Hubs

Project Architecture

Project Overview

This project demonstrates an end-to-end real-time data streaming solution using Azure Event Hubs, Databricks (with Spark Structured Streaming), Delta Lake, and Power BI for visualization. The project follows the Medallion architecture to process data through Bronze, Silver, and Gold layers.

Architecture Stages

Medallion Architecture

Azure Event Hubs

Azure Event Hub
Overview

Azure Event Hubs is a big data streaming platform and event ingestion service capable of receiving and processing millions of events per second. Event Hubs can process and store events, data, or telemetry produced by distributed software and devices. It can capture streaming data into an Azure Blob storage or Azure Data Lake for long-term retention or micro-batch processing.

Azure Event Hubs

How Azure Event Data Flow Works
Features

Test Data Generation

In this project, we're generating fake weather data in JSON format. The data includes attributes such as temperature, humidity, wind speed, wind direction, and precipitation. Below is an example of the JSON format used:

{
  "temperature": 20,
  "humidity": 60,
  "windSpeed": 10,
  "windDirection": "NW",
  "precipitation": 0,
  "conditions": "Partly Cloudy"
}
    

Tools Used

Project Plan

Notes on Real-Time Data Processing with Azure Databricks and Event Hubs

Output Modes in Spark Structured Streaming
Output Mode Behaviour
Complete The entire updated result table is written to external storage.
Append Only new rows appended in the result table since the last trigger are written to external storage.
Update Only the rows that were updated in the result table since the last trigger are written to external storage. If the query doesn't contain aggregations, it is equivalent to Append mode.
Checkpointing in Spark Structured Streaming

Checkpointing is used to prevent duplicate outputs during writeStream operations, even after job restarts or failures.

Windowing in Spark Structured Streaming

Windowing Behavior: The latest window will close when an event with an event time later than 3:25 is received. This is because the upper bound of the window is 3:20. So, an event time after 3:25 is greater than the upper bound plus 5 minutes.

Power BI Integration

To connect with Power BI, you can use Partner Connect for seamless integration between Databricks and Power BI.

GitHub Logo View the full project on GitHub