We’re seeing more and more interest in, and adoption of, the Delta Lake format in large-scale data projects at Catapult Systems. One question that we are often asked is how data scientists, who may not be working in Databricks itself, can access the data stored in a Databricks Delta Lake. Although there is the open source “deltalake” package in Python, the package requires usage of PySpark for data access, adding complexity where a data scientist may not want it.  

Recently, Databricks released a new Python package called “databricks-sql-connector”. This package makes use of SQL endpoints, recently promoted to general availability, in Databricks to call tables in a Delta Lakehouse using native SQL syntax, directly within Python. 

In this guide, we will explain how to use this package in your own Python projects. We will assume that you have access to: 

  • An Azure Databricks workspace
  • A SQL endpoint in Azure Databricks workspace connected to a Delta Lake 
  • A Delta table that has been defined within your Databricks workspace 

Step 1 – Get Connection Data for the Databricks SQL Endpoint 

Navigate to the SQL view in your Databricks workspace, and select SQL endpoints from the left-hand menu: 

Databricks SQL Menu

 

This will bring up a list of the SQL endpoints that are available to you. Click on the desired endpoint, and then click on “Connection details”. You will need to copy the “Server hostname” and the “HTTP path” fields:

SQL Endpoint Connection Details

Step 2: Create a Databricks Personal Access Token (PAT)

In the left-hand menu, navigate to Settings -> User Settings -> Personal Access Tokens. Click on “Generate new token” and follow the instructions.

Some important notes:

  • All PATs will expire. If you are running production workloads using this, you will need to create a key rotation process to account for this.
  • Your token will only be available to copy at creation time, so copy it before closing the window that it is shown in.
  • We strongly recommend you store this token in a secure, encrypted location such as Azure Key Vault.

Step 3: Install the Databricks SQL Python Package

In your compute environment (for example in Azure Machine Learning Studio), bring up a terminal. Navigate to the appropriate virtual environment and install the Databricks SQL connector for Python. You must be running Python 3.7 or higher this to work:

Step 4: Execute the SQL query

Sample Python code to execute a query against a table in the Delta Lake is below. In this example we output results to a Pandas DataFrame, but this can be tweaked depending on specific project needs.