Microsoft Fabric - Good Notebook Development Practices | endjin

Ed Freeman 9th May 2024

Tutorial

Notebooks can very easily become a large, unstructured dump of code with a chain of dependencies so convoluted that it becomes very difficult to track lineage throughout your transformations. With a few simple steps, you can turn notebooks into a well-structured, easy-to-follow repository for your code.

In this video, we take a look at a few good practices when working with notebooks as a primary artifact type in your code-base. We'll look at things like using commentary within your notebooks, achieving modularization by structuring your logic into separate classes and functions, and defining "schemas as code" to be explicit about the data you're reading and writing. The full transcript is available below.

The talk contains the following chapters:

00:00 Intro
00:12 Principles underpinning good notebook development
00:56 Methods to achieve good notebook development
02:27 Workspace artifact folder structure
05:27 Notebook structure demo - Commentary
06:19 Defining classes and functions
08:47 Consuming classes and functions - magic commands
09:36 Defining schemas as code
12:05 Consuming schemas when creating Delta tables
13:05 Round-up & Outro

Useful links:

Microsoft Fabric End to End Demo Series:

From Descriptive to Predictive Analytics with Microsoft Fabric:

Microsoft Fabric First Impressions:

Decision Maker's Guide to Microsoft Fabric

and find all the rest of our content here.

Transcript

Ed Freeman: In this video, we're going to drill deeper into our Silver Notebooks and take a look at some good practises for efficient and sustainable development in Fabric. Let's crack on!

If we think about the principles we're after when we're maintaining Notebooks, we want to be able to reuse the code that we've got, that we're building for our different Bronze, Silver and Gold workloads. We want to be able to organise it so it's easy to discover artefacts and to maintain going forward. And overall, we're trying to make the code more maintainable, more understandable and more testable.

We can pull our common logic out into functions, and we can make sure that people understand what each function does. They can extend it if they need to, but we can also write some tests against that if we need to validate that functions logic. Though the methods that we have available to us are pretty generic to any notebook technology really, but we can use commentary. So using the markdown cells that you can put inside of Notebooks, you can start talking through and working through what you're doing step by step in your notebook. And your future self will thank you no end for doing that, but also your teammates who need to get familiar with the code that you're writing.

We want to use classes and functions, so every pretty much every programming language, especially all those that are available in Fabric, has its own notion of how you define classes and functions. And these are nice object oriented ways of kind of encapsulating your code and modularizing your code so that you can reuse it across Notebooks and across workloads. Potentially we want to separate out our Notebooks. If we're using Notebooks as our primary development artefact, we want to make sure we're not putting everything in one gigantic notebook because those become very unmaintainable very quickly. So we can separate out these Notebooks and then we can call them from a parent notebook so that everything is orchestrated in the dependency tree that you're after.

And finally, we want to use folders and subfolders. And since the last video, Fabric has released folders at the workspace level, which is great and I'll show you that right now let's go over to my other screen.

So those of you who have seen the previous episodes, you'll probably not recognise this view as as much as you have done so far. And that's because as I mentioned, there's a new folders feature in Fabric, that means we can now structure our artefacts in a much easier to comprehend way. So I've gone forward this that this layout here, you don't need to go for this yourself, but I found it quite useful. So we have a Bronze, a Silver and a Gold folder which includes any artefact that is primarily related to that particular.

The medallion architecture I refer doing it this way than aligning it with data sources because not always do you have one data source that goes from Bronze to Silver and then all the way through to Gold. Often times on the Silver layer you're combining multiple data sources into the same entity and therefore that type of structure doesn't really lend itself to that. So I like having Bronze, Silver, Gold. Now there will be some things that are kind of a bit of a grey area you might want to. You might be in integrating

reading some data from Bronze to go into Silver. So you'll have to use your use your initiative as to where you put some of these artefacts. We've also got an entry points folder here that just includes kind of my orchestrator, but also other types of orchestrator that I've built for other reasons and some miscellaneous. So if I've got some artefacts in my in my repo that do a specific thing like there a helper pipeline to repopulate data for example that would go in there.

But if we look into these things a little bit more, So I've got Bronze, I've got the data reading helper that I that I need to use in some of the Silver Notebooks. But it primarily is there to integrate with the Bronze and get the data out of Bronze. We've got the ingest price paid data pipeline, so the actual pipeline, initial pipeline to grab the data from the web source and then we've got the Bronze Lake house that actually which includes all of my the files that I need to process for this workload.

And then in the Silver, we have most of the Silver related things, so all of the Silver related Notebooks, the data pipeline that processes the data to Silver and then the Silver Lake House. And then in Gold we have the Gold Wrangler, we have the Gold Schema definitions and the Gold Lake houses as well.

So all of these things are automatically and a much better way to organise artefacts in in Fabric versus the big long list that we had in the previous video. This is one way, one good practise for you to take away when organising your artefacts in your Fabric workspace. But then if we go back to more specifically about what we're talking about today, we want to create modular code. Now the main place where we actually execute all of our code is in this. For example, this process price paid to Silver

and we saw this last time and we go through various kind of steps, we first of all, we've got the commentary. So we've got just these headings for now. But you can imagine you can write much more detailed commentary if you wanted to. And what this means is if you go into the view, you can actually have a table of contents and you can much easily, more easily navigate around your notebook. So it not only gives you additional information about what you're doing, you can click on this table of contents and you can bump around your notebook so that it's easier to actually navigate between cells rather than scrolling all the time.

But if we go to these alley date set transformations, this is where we packaged a bunch of our data set transformations and put them in a class by themselves. So what this class is, it's called this price paid Wrangler and we've just got this generic apply transformation step. What actually happens in here? Well, let's go and have a look. If I go back to the workspace and go to the Silver layer, so this is where I've created the class called Price Paid Wrangler. And in Python it's really easy to create a class Users class keyword and then you have a various and a number of methods in that class.

Now the one that's mainly exposed when creating this when using this class from a different notebook is. This is allied transformations. But you see that the pattern that I tend to follow, engine tend to follow is trying to create the individual transformations that you want to build as separate kind of local functions of themselves. So we have a function that creates a postcode area column and of various other columns to do with the postcode. And then we've got conversion, conversion of values, we've got adding the year dimension, we've got cleaning the unique ID. And all of these are just defined as their own function in this class.

So what that means is at the top level the only one that we technically want used outside of this, outside of this notebook is this alley transformations, but under the covers it's doing this dot transform and feeding this filtering this data frame through all of these different sub functions until it comes out of the other end as a fully transformed data frame.

So this is what this is how we tend to structure our wranglers so we have a class or a set of classes. Depending on how much work you need to do, you might want to split it out more, but at least a single class which is meant to align itself with a certain part of the middle and architecture. So in this case, the Silver.

And we essentially compose all of the bits of the bits of logic we want to run against our data frame in a single class. That's where this is all defined. Now, how we actually access that in this notebook is right at the top.

We load these helper Notebooks. Now this. So we're using the magic command which is percent run. This allows us to find a notebook somewhere else in our workspace and run the run the notebook. All that does is it essentially runs the notebook in the context of your current notebook session that you're calling it from. So all of the classes, functions or logic that is held within that notebook that you're calling will get executed as part of this context.

So you'll see we are doing the data reading helper, we're doing the Silver layer, which is what I've just shown, and we're also doing this defined Silver schemas. So automatically those 3 Notebooks will be available in this notebook context. Now one of the last things I wanted to show you is this defined Silver scheme as. So one thing that isn't always necessary when we are talking about lake houses is actually doing kind of schemer on, right? What that means is technically when things are files in the lake, we don't always need to be. We don't always need to provide structure to those files. The CSV files that are in my in the in the Bronze lake house, they haven't been written conforming to a specific schema in my lake house. They are just CSV files. It's a kind of justice delimited text files that's not just a database table schema already. What that means is we can read that out from spark using just schema on read or yellowy spark to infer the schema. But when we write back out to Silver and write back out to Gold, we don't want to always rely on these schema inference because it might not be giving us the exact types that we want.

The one thing that we commonly do as well is we define actual schema objects. Now there are various ways to do this. One way which we find quite simple is using the built-in struck types in Spark, which can be seen as a way of defining A schema. So you've got a stroke type and then within the stroke type you have what Spark calls stroke fields. And these, as you can see take various bits of information, the name of the column, the data type of the column, the null ability and then it also has as a four argument just a key value set of metadata properties. So if you want to add additional metadata to your schema objects you can do that and that can sometimes help with the control flow. For example, if you need to highlight which of your columns is the primary key, which is a foreign key for example, you can add bits of metadata to those schema objects, those stroke fields and you can use that in your control.

Although if you want to. so what this helps us do. This helps us alley the schema to a data frame before we write it, and it also allows us to create our adult tables upfront rather than just creating them when we do a dot savers table. You'll see we have a couple of schema objects here and these are actually used. I go back in here, They are used down here. So we have that we have inferred are we have inherited, sorry the variables that have been defined in other Notebooks because we've run those Notebooks we can now access those variables from in here. So this price paid schema is being added to this delta table created if not exists. So if we haven't already got the delta table, then we will create it from scratch with the schema that we've defined here.

And then once we once we actually write the table, everything should align with the schema because we have already rewritten it. So the data frame on the way in to the target table must conform to the same schema by default. If not, it will likely throw an error unless you're using some of Delta Lakes other features to allow schema evolution, which we'll talk about in a different episode.

That's everything for this video. In the next video, we're going to see how we can go from the Silver layer to the Gold layer and the principles we follow while doing so. As ever, please hit like if you've enjoyed this video, and hit subscribe so you don't miss another episode. Thanks for watching.