Microsoft Fabric Data Warehouse is an advanced data warehousing solution within Microsoft Fabric for organizations seeking to manage and analyse large volumes of data efficiently. Data is stored in Onelake, Microsoft’s unified data lake, utilizing open formats like Delta Lake to facilitate seamless integration and data sharing across various workloads.
The Fabric Data Warehouse integrates seamlessly with other Microsoft Fabric components, such as Data Factory for data ingestion, Power BI for analysis and reporting, and Synapse Spark for data science and machine learning.
This Blog outlines how to assess and optimize the load on your Microsoft Fabric capacities. It also offers strategies for managing overload situations and provides guidance on optimizing compute resources for Fabric Data warehousing.
Risk of Depleting Resources:
The Fabric capacity model, while simplifying setup and enabling collaboration, risks depleting shared compute resources or paying for more resources than necessary.
Managed Service Features:
To address these risks, Fabric, as a managed service, handles resource allocation in two main ways:
- Bursting and Smoothing
- Throttling
Bursting & Smoothing:
Bursting allows CPU-intensive tasks to be completed quickly by temporarily increasing resources during demand spikes, without needing to upgrade to a higher SKU (a larger resource allocation). After the demand drops, the system returns to the normal resource level, helping to optimize costs.
For example, instead of running a job on 64 CU and completing in 60 seconds, bursting could use 256 CUs to complete the job in 15 seconds.
Smoothing ensures that resource usage is spread evenly over time, reducing the risk of reaching capacity limits. Both ensure that tasks are completed efficiently without requiring additional long-term resources.
In simpler terms, Let the system handle heavy tasks efficiently without requiring more permanent resources (e.g., a bigger server).
Advantages: Tasks can be performed at any time of the day, and CPU resources are used efficiently without over-provisioning.
Throttling:
Throttling delays or rejects non-essential operations when the system is experiencing sustained high demand for CPU resources, which exceeds the current SKU’s capacity limit.
Basically, when too much demand hits, it slows down or blocks non-critical tasks to keep the system stable.
Advantages: It prevents system overload by managing demand, ensuring that critical operations are prioritized while less important ones are delayed or blocked.
Example Scenario: Top Star Movie Ticket Booking During Release Week
Consider the following example for understanding the concepts. Imagine a major Top Star movie release, and millions of fans flock to an online ticket booking platform to get tickets for the first show on the release day.
| Scenario | Bursting | Smoothing | Throttling |
| What happens? | Huge surge of users trying to buy tickets as soon as the booking opens | Predictable spikes of traffic over time, like during the morning, afternoon, and evening hours. | Excessive demand, leading to delays or rejection of non-critical requests (like seat selection) to prioritize actual ticket booking. |
| User Experience | Users experience minimal delay when trying to book tickets. The system scales up resources (e.g., adds more servers) to ensure that the surge is handled smoothly. | Users experience slight delays in non-urgent tasks, like seat selection or order confirmation, but can still book tickets without interruption. | Users may experience a waiting queue or delays when selecting seats or accessing the platform. However, the ticket booking itself is prioritized. |
| System Action | Too many users trying to access the platform at once, leading to high server load, it temporarily increase the capacity due to spikes. | The platform delays or queues non-urgent requests, like seat selection or additional user features, spreading the demand over time. | The system limits or delays non-essential actions like viewing available seats, loading movie posters, or making background updates, but the ticket booking process remains unaffected. |
| Cost Implications | Higher costs due to the need to temporarily scale up infrastructure (e.g., adding extra servers or compute power). | No extra cost because it doesn’t add new infrastructure; it optimizes the use of current resources by spreading demand. | No extra cost as no extra infrastructure is added, but users may experience delays or limitations in non-essential tasks. |
Considerations:
Planning and Budgeting Tools for Capacity Sizing in Microsoft Fabric
When determining the right size for your capacity in Microsoft Fabric, it’s crucial to understand that compute requirements can vary based on several factors, including:
Operation Complexity: The type of operations (e.g., DAX queries, Python notebooks) performed and how well they are executed can significantly impact the compute needs.
Concurrency: The number of concurrent users or processes running at the same time can affect resource consumption.
Paused Capacity: Pausing your capacity can help manage costs during periods of low activity. However, if overutilization occurs before pausing, the excess usage may still result in charges.
Monitoring: Regularly monitoring your capacity usage and understanding how bursting, smoothing, and throttling work can help optimize performance and control costs effectively. Follow the Lesson to know more on Lesson 47 – Manage and monitor performance of the Warehouse in Microsoft Fabric
How to Plan and Budget:
Provision Trial Capacities: Before committing to a specific capacity size, you can provision trial capacities. This allows you to test different resource configurations and understand the actual usage patterns.
Pay-As-You-Go F SKUs: You can also use pay-as-you-go F SKUs, which provide flexibility in scaling resources up or down based on actual usage. This lets you measure the true capacity requirements before investing in a reserved F SKU instance.
Cost Savings in Fabric Data Warehouse
Microsoft Fabric Data Warehouse uses a serverless architecture, where capacity usage is calculated based on active per-query capacity unit seconds rather than the time the front-end nodes and backend nodes are provisioned. This approach helps minimize costs by only charging for the actual query execution time.
This means that instead of being charged for the time the infrastructure is running, you are only charged for the actual query execution time.
As a result, you avoid paying for excess services and nodes that are not actively being used, which helps in saving costs.
Key Cost-Saving Tips:
Optimize T-SQL Queries:
Write efficient queries by minimizing unnecessary columns, calculations, and aggregations to reduce query resource consumption.
Use Optimal Data Types: Choose the smallest data types possible (e.g., reducing VARCHAR(500) to VARCHAR(25) or DECIMAL(32,8) to DECIMAL(10,2)) to lower resource usage.
Star Schema Design: Use a star schema to reduce the number of rows read and minimize query joins, leading to faster queries and reduced resource usage.
Maintain Updated Statistics: Ensure statistics are current, especially after data changes, to help the SQL engine generate efficient query plans. Consider using FULLSCAN for more accurate statistics.
*** A FULLSCAN is a method of updating statistics in a database. This process ensures that the statistics are as accurate as possible, which helps the SQL engine generate efficient query execution plans.
Monitor Queries and Usage: Use built-in views like sys.dm_exec_requests and Fabric Toolbox for detailed insights into active queries and resource usage.
Query Insights: Leverage Query Insights to review historical query activity, identify resource-intensive queries, and track execution performance.
Tools for Monitoring:
Queryinsights.exec_requests_history view: It provides all the relevant details for each query execution, which can be linked with the operation IDs in the capacity metrics app. The key columns for monitoring capacity usage include: distributed_statement_id, command (query text), start_time, and end_time.
By applying these practices, you can reduce unnecessary compute usage and achieve cost savings in Fabric Data Warehouse.
Conclusion:
Microsoft Fabric Data Warehouse offers a powerful, cost-efficient solution for managing and analyzing large datasets. By utilizing a serverless architecture and smart features like Bursting, Smoothing, and Throttling, it ensures that resources are optimized, and costs are minimized without sacrificing performance. The ability to scale resources dynamically and only pay for actual query execution time further enhances cost efficiency.
By following best practices such as optimizing queries, choosing optimal data types, and maintaining up-to-date statistics, organizations can efficiently manage their data warehousing needs while keeping costs under control. Proper planning, monitoring, and the use of tools like Query Insights can help organizations maximize the value of their Microsoft Fabric Data Warehouse.
| Tags | Microsoft Fabric |
| Useful Links | |
| MS Learn Modules | |
Test Your Knowledge |
Quiz |
