Transforming JSON to Lakehouse Tables with Microsoft Fabric Notebooks: A Step-by-Step Guide

Discover how to efficiently clean and transform JSON files into Lakehouse tables using Microsoft Fabric Notebooks. This article provides a…

Rui Carvalho

Towards Data Engineering

· ~4 min read · January 28, 2024 (Updated: January 29, 2024) · Free: Yes

Discover how to efficiently clean and transform JSON files into Lakehouse tables using Microsoft Fabric Notebooks. This article provides a detailed walkthrough, from leveraging API data to utilizing PySpark for data transformation.

Get JSON via API

We will use a Data Pipeline — Copy Activity to store a JSON file from an API into a Lakehouse. You can find the full step-by-step on this part in this story. Here we will focus on the file transformation.

Fabric Data Pipeline — Copy Data

Let´s focus on the second step of this pipeline where we use a Notebook activity to call a Notebook in the workspace to transform the JSON file and save it as a table.

Fabric Data Pipeline — Notebook

What is a Notebook?

Notebook refers to a type of interactive computing environment that allows users to combine executable code, rich text, visualizations, equations, and other media in a single document.

Notebooks allow users to write and execute code in a variety of programming languages (most commonly Python, R, and Julia) in discrete blocks or cells. This allows for step-by-step data exploration and analysis.

Fabric Notebooks

The Microsoft Fabric notebook serves as a crucial tool for crafting Apache Spark jobs and conducting machine learning experiments. While data engineers primarily utilize it for data ingestion, preparation, and transformation tasks, data scientists leverage these notebooks for a range of machine learning activities. These include designing experiments and models, tracking model performance, and managing deployment processes.

Currently, we have these JSON files in the flights folder in our Lakehouse.

This is the JSON format. Here we can see data related to flights with information like flight dates, departures, and arrivals.

{
   "pagination":{
      "limit":100,
      "offset":0,
      "count":100,
      "total":382609
   },
   "data":[
      {
         "flight_date":"2024-01-10",
         "flight_status":"scheduled",
         "departure":{
            "airport":"Nhulunbuy",
            "timezone":"Australia/Darwin",
            "iata":"GOV",
            "icao":"YPGV",
            "terminal":null,
            "gate":null,
            "delay":null,
            "scheduled":"2024-01-10T08:00:00+00:00",
            "estimated":"2024-01-10T08:00:00+00:00",
            "actual":"2024-01-08T08:51:00+00:00",
            "estimated_runway":"2024-01-08T08:51:00+00:00",
            "actual_runway":"2024-01-08T08:51:00+00:00"
         },
         "arrival":{
            "airport":"Darwin",
            "timezone":"Australia/Darwin",
            "iata":"DRW",
            "icao":"YPDN",
            "terminal":"D",
            "gate":null,
            "baggage":null,
            "delay":56,
            "scheduled":"2024-01-08T09:54:00+00:00",
            "estimated":"2024-01-08T09:54:00+00:00",
            "actual":null,
            "estimated_runway":null,
            "actual_runway":null
         },
         "airline":{
            "name":"empty",
            "iata":null,
            "icao":null
         },
         "flight":{
            "number":null,
            "iata":null,
            "icao":null,
            "codeshared":null
         },
         "aircraft":null,
         "live":null
      }]
}

Notebook

You will need to create a new PySpark Notebook in your workspace with the following cells, linked to your Lakehouse.

In this first cell, we are loading all the JSON files from the flights folder to a py dataframe.

# Load JSON files in folder flights to a python dataframe
df = spark.read.json("Files/flights/")

The result will be an array structure with all the data that you saw above.

In this following cell, the goal is to read the JSON´s content, make some transformations to specific attributes, and then explode the array so we have the dataframe with a column format.

Call SQL functions:

col, to_timestamp, to_date, explode.

The transformation work may vary depending on the JSON structure.

# Apply transformation to the dataframe
from pyspark.sql.functions import col, to_timestamp, to_date, explode

exploded_df = df.select(explode(col("data")).alias("data"))

tf_df = exploded_df.select(

    to_date(col("data.flight_date"), "yyyy-MM-dd").alias("flight_date"),
    col("data.flight_status").alias("flight_status"),
    col("data.departure.iata").alias("departure_airport_code"),
    to_timestamp(col("data.departure.scheduled")).alias("departure_scheduled_date"),
    to_timestamp(col("data.arrival.scheduled")).alias("arrival_scheduled_date"),
    col("data.airline.iata").alias("airline_code"),
    col("data.flight.number").alias("flight_number")
)


display(tf_df)

First, we will use the explode function to return a new row for each element in the given array or map, in this case, it´s called data.

pyspark — explode result

Then let´s select the data in our data frame explode_df as columns using the pyspark sql functions. Some transformations are:

to_date(col("data.flight_date"), "yyyy-MM-dd").alias("flight_date") — convert flight_date to date with format yyyy-MM-dd.
to_timestamp(col("data.departure.scheduled")).alias("departure_scheduled_date") — convert column schedule, which is a date, to a timestamp format.

To all the columns that we have selected, we gave them an alias. As we display our tf_df dataframe, now we have a columnar format.

display dataframe

For our last step, we want to save this result as a delta table in the Lakehouse.

Very simple 👇

# Load as a table to the lakehouse

tf_df.write.format("delta").mode("overwrite").save("Tables/fctFlights")

Here´s the final result:

JSON to Delta Table to Fabric Lakehouse

We can set this in our pipeline, and each time a new JSON file is stored in the Lakehouse we run this notebook.