In this scenario, we had adls2 blob storage with terabytes of parquet files in various formats. Much of this data was replicated in Synapse DW though it wasn’t complete. Our “source of truth” was the parquet files in adls2.
The aim was to deliver some sort of method to search across these files to aid users in finding relevant data.
We initially thought that since the adls2 storage was the most reliable in the long run that we should try and use that as our sole source of data. However, while Azure Search contained built-in connectors and indexers for sources like SQL and JSON, there was no connector to index adsl2 blob storage. While the data here was exactly what we were looking for the problem was actually getting it into Azure Search.
The solution at the time was to use the Azure Search API which allowed us to write a customized parser to manually go through each file one-by-one, parse the relevant fields, and then upload the results in batches. Creating this parser was not a very simple process, however. Many of the parquet files had different formats, often with 50 or more columns. This potentially meant that the index on which these would go would have to have hundreds of columns.
We could also create a separate index for each different column format but this would also potentially mean creating dozens and dozens of different indexes.
We also quickly realized that most of the columns in these different formats were not relevant, i.e. users would not likely search for any of this data. It would just take up unnecessary storage space and potentially return irrelevant results to the user.
Coming up with a list of relevant fields to index was the next step. Creating a single index with all of these relevant fields and extracting from each unique parquet file only the fields that matched our list simplified the problem.
This proved some difficulty, however. Naming conventions for various fields that were desired for indexing were not always consistent. For example, “FIELD_ID” and “FIELDID” were essentially the same but would be considered different and not indexed. It became a bit of a headache to try and enumerate out all the possibilities for all the different fields we were looking for. Additionally, the indexing process using the API was just running too slowly and would have taken weeks to index. A highly reactive search service that would index new information quickly was desired.
Next, we ditched running custom uploads and used the built-in connectors for Synapse DW. Many of the parquet files sitting in adls2 were generated from Synapse DW so we could instead go directly to Synapse DW. This provided us faster indexing (50 million rows/day with 3 replicas) and a better method of searching for relevant fields. Search indexes have several limitations, however, in that they require some sort of unique ID for each row and additionally require some sort of versioning (so it’s able to detect incremental changes).
Most of the tables in Synapse DW had unique IDs but the ones that didn’t weren’t able to be indexed at all. We had to search for other solutions to index these tables. Simply adding the unique IDs to the tables was not quite as simple as it sounds because the tables were populated by full loads from adsl2 and would be wiped out and entirely replaced. Additional steps would have to be taken during the full load to preserve IDs and row versions.
Even worse, we found that adding rowversion columns to the tables was not terribly feasible. We would have to point at the best looking datetime column and hope that it would be good enough to work.
All of these concerns together gave us a lot to think about regarding our current solution for indexing and accessing data. Though 50 million rows/day with 3 replicas (S3 tier) was fast it wasn’t fast enough. There were several tables with more an a billion rows which could take weeks to index.
Azure Data Explorer
This led us to Azure Data Explorer (ADX). ADX has two massive advantages: 1. It can directly index and query adls2 and 2. It can ingest data significantly faster than either of our earlier solutions. This was particularly appealing because it meant we could do away with the complicated procedure of loading adls2 data into Synapse DW.
In conclusion, the Azure Search can be a very handy tool if you need to index some smaller data sets with little to no variance in schema. Once you start introducing more varied schemas and data sets, you need to also start introducing additional indices. This, in addition to ingesting very large data sets is where ADX shines.
Join us for part 3 Azure Search, Azure Data Explorer and Big Data – Part 3 of 3 and get in touch for more information