This guide will explain how to utilize shared metdata tables in Azure Synapse Analytics to provide additional opportunities to interact with your Business Central data.
Azure Synapse Analytics allows the different workspace computational engines to share databases and tables between its Apache Spark pools and serverless SQL pool.
Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet, Delta, or CSV as the storage format. These tables will immediately become available for querying by any of the Azure Synapse workspace Spark pools. The Spark created, managed, and external tables are also made available as external tables with the same name in the corresponding synchronized database in serverless SQL pool.
You can read more about shared metadata tables here.
Shared metadata tables combine the advantages of multiple different approaches to data storage. Like a traditional database table, they can be queried using SQL. At the same time, they store their data on the data lake, reducing storage costs and eliminating the need for database compute. This also means Power BI can connect to them in DirectQuery mode, using the Serverless SQL endpoint of Azure Synapse Analytics workspace. You can find this endpoint in the Azure Portal on the Overview blade of your Synapse workspace resource.
Connecting with DirectQuery allows you to create composite multi-table based views in Power BI, similar to API entities that have been exposed as read-only API pages or queries.
In addition, the tables can easily be read and modified using Spark notebooks written in Python, C#, Scala or SQL to support big data analytics or machine learning scenarios.
bc2adls can be configured to automatically create shared metadata tables when it is exporting data from Business Central. Perform the following steps:
- Create a serverless Apache Spark pool in Azure Synapse Analytics. The small compute size is sufficient.
- Ensure that the CreateParquetTable notebook has been imported into your Synapse environment. If not, import the notebook.
- When the notebook is run, it uses the credentials of the system-assigned managed identity of your Azure Synapse workspace. Since the notebook needs to read the files on the data lake, the managed identity requires at least read permissions on that resource. To ensure this, add a role assignment to your data lake, providing the Storage Blob Data Contributer role to your Synapse workspace managed identity (which is named the same as your Synapse workspace). See Step 3 of the Setup guide for an example of how to manage role assignments on the data lake.
- On the Export to Azure Data Lake Storage page in Business Central, ensure that CDM data format is set to Parquet.
- When executing the Consolidation_AllEntities pipeline, provide the name of the previously created Spark pool as the sparkpoolName parameter. The pipeline will still run if the parameter is missing, but no shared metadata tables will be created in that case.
After the pipeline run has successfully completed, a new lake database will be visible in the Data section of the Azure Synapse Analytics workspace. The database will be named after the data lake container with a table for each entity. To test whether the table has been created successfully, you can generate an automatic SQL query to display its contents:
The Consolidation_OneEntity pipeline checks whether the entity it is currently processing already exists in the data folder of the Data Lake. If that is not the case - indicating that a new table has been exported - the pipeline will execute a pySpark notebook that creates the Spark table. Since the data is in Parquet format, Spark can read the schema from the Parquet files, when creating the table. This means that columns will retain their defined data types (or Spark equivalents).
Note Table and database names are always in lower case and non-alphanumeric characters, e.g., hyphens (-), are replaced with underscores (_).
To minimize the number of Spark jobs that are triggered, the CreateParquetTable notebook is only executed when a new table is exported. This also means that in existing bc2adls deployments, no shared metadata table will be created for entities already present in the data folder. This can be solved by modifying the SparkTableConditions activity in the Consolidation_OneEntity pipeline or by triggering the notebook manually for the desired entities. In that case, the container_name and entity_name parameters need to be supplied. Also run the session with the option Run as management identity enabled.
Similarly, when the schema of a table changes, it is advised to drop and recreate the table. This is currently not done automatically.