We’re seeing more and more interest in, and adoption of, the Delta Lake format in large-scale data projects at Catapult Systems. One question that we are often asked is how data scientists, who may not be working in Databricks itself, can access the data stored in a Databricks Delta Lake. Although there is the open source “deltalake” package in Python, the package requires usage of PySpark for data access, adding complexity where a data scientist may not want it.
Recently, Databricks released a new Python package called “databricks-sql-connector”. This package makes use of SQL endpoints, recently promoted to general availability, in Databricks to call tables in a Delta Lakehouse using native SQL syntax, directly within Python.
In this guide, we will explain how to use this package in your own Python projects. We will assume that you have access to:
- An Azure Databricks workspace
- A SQL endpoint in Azure Databricks workspace connected to a Delta Lake
- A Delta table that has been defined within your Databricks workspace
Step 1 – Get Connection Data for the Databricks SQL Endpoint
Navigate to the SQL view in your Databricks workspace, and select SQL endpoints from the left-hand menu:
This will bring up a list of the SQL endpoints that are available to you. Click on the desired endpoint, and then click on “Connection details”. You will need to copy the “Server hostname” and the “HTTP path” fields:
Step 2: Create a Databricks Personal Access Token (PAT)
In the left-hand menu, navigate to Settings -> User Settings -> Personal Access Tokens. Click on “Generate new token” and follow the instructions.
Some important notes:
- All PATs will expire. If you are running production workloads using this, you will need to create a key rotation process to account for this.
- Your token will only be available to copy at creation time, so copy it before closing the window that it is shown in.
- We strongly recommend you store this token in a secure, encrypted location such as Azure Key Vault.
Step 3: Install the Databricks SQL Python Package
In your compute environment (for example in Azure Machine Learning Studio), bring up a terminal. Navigate to the appropriate virtual environment and install the Databricks SQL connector for Python. You must be running Python 3.7 or higher this to work:
pip install databricks-sql-connector
Step 4: Execute the SQL query
Sample Python code to execute a query against a table in the Delta Lake is below. In this example we output results to a Pandas DataFrame, but this can be tweaked depending on specific project needs.
from databricks import sql as dbsql
import pandas as pd
# Configuration variables - from your keyvault and Databricks
databricks_pat = "databricks_pat"
server_hostname = "server_hostname"
http_path = "http_path"
# Define the connection to the SQL endpoint
connection = dbsql.connect(
server_hostname = server_hostname
http_path = http_path
access_token = databricks_pat
# Represent the query as a string
sql_query = "SELECT * FROM database_name.table_name LIMIT 1"
# Execute the query
with connection.cursor() as cursor:
# Get the data
data = cursor.fetchall()
# Get the column names
cursor.execute("SHOW COLUMNS IN database_name.table_name")
columns = cursor.fetchall()
columns = [column for column in columns]
df = pd.DataFrame(data=data, columns=columns)