Join Alex, Tory and Rudy as they tackle topics such as Azure Search, Data Explorer, and Big Data. Their first installment is now live.

 Authors: Alex Gebreamlak, Tory Waterman, Rudy Sandoval

Azure Search, Azure Data Explorer and Big Data – Part 1 of 3  |  Azure Search, Azure Data Explorer and Big Data – Part 2 of 3  |  Azure Search, Azure Data Explorer and Big Data – Part 3 of 3

Azure Blob Storage Client Library for .Net

Azure Blob storage is used to store large amounts of unstructured data. Unstructured data is data that does not adhere to a particular data model or definition, such as text or binary data. Blob storage offers three types of resources: 

  • The storage account 
  • A container in the storage account 
  • A blob in the container 

A blob can be a file or a virtual directory. 

The following diagram captures the relationship between the above resources: 

 

azure blob container

 

We needed to create various applications for one of our clients using Azure Blob storage client library for .Net to interact with Azure Blob storages with a fairly large number of blobs, billions in count, to collect metadata information; mainly total blob count and total blob size per blob directory. 

There are multiple data providers that load data to their respective storage accounts from which the data goes through an orchestration process before it gets stored in a specific blob container set for each data provider within a single Gen 2 storage account. Our first task was to collect blob metadata, like file size, and store the metadata collected in Azure SQL database. We were also tasked to audit if all the blobs loaded in the providers’ blob storages went through the orchestration process and if they all made it to their final destination.  

We used the following .NET classes to interact with blob storage resources: 

The following shows the general follow when interacting with blob storage through the client API: 

  • Get a reference to a storage account and create a BlobServiceClient object 

            CloudStorageAccount account= CloudStorageAccount.Parse(storageConnectionString); 

           CloudBlobClient serviceClient = account.CreateCloudBlobClient(); 

  • Get a reference to a specific container 

            CloudBlobContainer container = serviceClient.GetContainerReference(containerName); 

  • Get a reference to BlobResultsSegment 

              BlobResultSegment resultSegment = await container.ListBlobsSegmentedAsync(…) 

           Or 

          var directory = container.GetDirectoryReference(directoryName); 

          BlobResultSegment resultSegment= await  directory.ListBlobsSegmentedAsync(…); 

  • List blobs then get/set metadata information, rename, delete, download…etc. on each blob 

            foreach  (var  blobItem in  resultSegment.Results) { 

             } 

One of the drawbacks of the API is that other than directly targeting a specific directory or blob, it doesn’t support fetching blobs with a query directly from a blob storage; for example, give me all the blobs older than certain days from a specific directory or container. One would have to do filtering after the blobs are fetched and are in memory. As you can imagine this doesn’t scale and it is an expensive process when you have millions of files. You can use a continuation token along with setting the number of blobs that can be fetched with a single batch to help with performance, but this won’t help with pre-filtering before loading the blobs.  The below code shows how we used a continuation token with batch size (maxResults). 

  Azure Client Library

  

Here is the strategy we used to provide our client with current metadata of blobs from Azure Blob storage using the client API along with Azure Function Apps : 

  • Identified unique data sources per data provider and store them in Azure SQL database. These data sources are essentially virtual directory paths within a container. What we are mainly interested in is the total size of the data source and the number of blobs in it.  
  • Used each data source above as a directory reference name in the client API. 
  • Set up various fetching mechanisms: 
  1. Full Scan: Scans and fetches metadata from a blob storage account for all of our data providers. We needed to do one full scan where we fetch folder size and number of files per data source initially. But we run full scan as needed for various reasons, for example, after an outage. Each full scan will have a unique batch ID. 
  2. Partial Scan: updates the last completed full scan data with metadata information for certain select data sources. The partial scan is controlled by a query of the data sources we need to update; the query can be updated from a configuration file. It always updates the last successful Full Scan. We created multiple instances of Partial Scans as needed. 
  3. Continuation Scan: restarts and completes a failed or stopped Full Scan. For all the scans, the SQL queries that we use to get data sources are order by ID. That way we know at which ID a scan has stopped and we can adjust the next query accordingly to continue the scan from where it stopped. A Full Scan may fail after running for several hours for various reasons. Using a continuation scan allows us not to start a Full Scan all over again. 
  4. Live Scan: updates data sources’ metadata from individual blob metadata; in our case, .parquet files. For this, we created Azure functions with Azure Event Trigger per data providers and within each function added logic to determine the correct data source path that we need to update based on the blob’s name. Live Scan complements the other scans and provides our client with up to the minute data at any given time. It always updates the last successful Full Scan. This needs to be complemented with an internal process to turn the Azure functions off whenever a large amount of data is being loaded by data providers and turn them back on when the upload is completed. Otherwise, we will be triggering azure functions for however many million files each data provider uploads. This was added to close the gap in the accuracy of the data we are collecting as new files might be added after Full or Partial Scans and before the next Full or Partial Scans. Otherwise, we were not interested in fetching individual blob files. Here is an example of an Azure function we implemented: 

Azure Event Grid Trigger  

 

We created a console app using .NET Core for the various scans mentioned above other than the Live Scan. We deployed multiple instances of the console App as Azure Container Instance or Azure WebJobs to accommodate the various scans; we have multiple instance of Partial Scan for some of the big data sources and scheduled them at different times. We are also running a Full Scan once a week.  

In summary, the client API has an inherent problem in terms of fetching targeted data from an Azure Blob storage which in turn leads to a long processing time to get to the data one desires to access. By employing the strategies listed above, we were able to meet the requirements of our client; mainly, getting up to the minute metadata of blobs from a blob storage per data source or blob directory.  

Some of the lessons we learned along the way are: 

  • Avoid Azure Blob Trigger completely if you have a lot of files in Azure Blob storage. Please refer to the below link for the details: 

https://medium.com/@loopjockey/structuring-azure-blobs-for-functions-8305ba427356 

  • You need to use VNET integration to connect and access Azure Blob storage from the client API running in Azure WebJobs or Azure container instance to avoid the unauthorized error: 

            https://docs.microsoft.com/en-us/azure/app-service/web-sites-integrate-with-vnet 

 

Azure Search, Azure Data Explorer and Big Data – Part 2 of 3