Tutorial: Reproducible Spark+Delta Tests Without the Hassle
Pre-requisite: if you’re not familiar with Docker, I recommend you follow the official introductory guide.
Setting up a Spark instance can be quite the hassle. Add a data catalog like Delta or Apache Iceberg, and the complexity blows up!
But testing catalog-specific queries against a Spark instance should be easy. And, if you’re an automation nerd like myself, it should be reproducible, as well.
Unfortunately, the traditional method of installing the OpenJDK, JVM, Spark drivers, Hive plugins, Delta jars is anything but easy, let alone reproducible (As a matter of fact, on the eve of publishing this article, the Live Notebooks on PySpark’s official quickstart page failed to launch due to a bug downloading OpenJDK… 😅).
Faced with this problem, I decided to take matters into my own hands — and published a Docker Hub repository for running Spark Connect servers!
Spark Connect
Spark Connect is Apache’s newest gRPC protocol for Apache Spark. It lets any client that implements the Spark Connect protocol to connect to a Spark Server, and execute workloads seamlessly.
What this Docker container does is take care of all the setup for launching a Spark instance with Spark Connect server, to make the development experience as smooth as possible.
Running the Container
Running it is as straight-forward as any other container. Simply:
We use the “delta” tag if we wish to spin the server with a Delta catalog, and the “iceberg” tag for the Iceberg catalog.
The docker run command will pull the image, and start it, resulting in a generous slew of logs, ending with a message akin to “Spark Connect server started at 0.0.0.0:15002” — letting us know that the Spark instance is LIVE and ready to accept connections.
Now, connecting to it in PySpark is as easy as:
Note: You may have to install Python packages
grpcio,grpcio-statusandprotobufforpysparkto connect successfully.
The Spark session is ON and completely functional for all you development needs.
Reproducible Tests
If you’d like to take things a step further, we can manage the Spark server’s lifecycle entirely in our test suite, ensuring reproducible, specified-in-code, end-to-end dependencies.
To do so, we can create a pytest fixture, which is essentially a reusable resource for tests!
import docker
# You can use testcontainers instead of the docker package
# but I find it a bit bloated and buggy
# when working with custom images.
# The docker package is what testcontainers uses
# behind the scenes, and I find it more consistent.
import pytest
import time
# Utility function:
# Once the container has produced the desired message, return.
def wait_for_log(container, message, timeout=30):
start = time.time()
while True:
logs = container.logs().decode("utf-8")
if message in logs:
return
if time.time() - start > timeout:
raise TimeoutError(f"Message '{message}' not found in logs")
time.sleep(0.5)
# The default scope is function,
# which means the fixture is created and destroyed every test.
@pytest.fixture(scope="function")
def spark_connect_server_url():
docker_client = docker.from_env()
# Let's run a Spark Connect server with a Delta catalog!
container = docker_client.containers.run(
image="franciscoabsampaio/spark-connect-server:delta",
detach=True,
ports={'15002/tcp': 15002}
)
wait_for_log(container, message="Spark Connect server started")
# Pass the connection string to whoever is using the fixture.
yield "sc://localhost:15002"
# Cleanup
container.stop()Now, we can simply pass the fixture to whichever test requires a running Spark Connect server. For example:
def test_basic_read(spark_connect_server_url): # Use the fixture
spark = SparkSession.builder \
.remote(spark_connect_server_url) \
.getOrCreate()
table_name = "test_table"
# CREATE TABLE USING DELTA
spark.sql(f"""
CREATE TABLE {table_name} (
id INT,
name STRING
)
USING DELTA
""")
# INSERT INTO TABLE
spark.sql(f"INSERT INTO {table_name} VALUES (1, 'bird'), (2, 'spark')")
# SELECT * FROM TABLE
df = spark.sql(f"SELECT * FROM {table_name} ORDER BY id")
rows = df.collect()
# Test conditions
assert len(rows) == 2
assert rows[0]["id"] == 1
assert rows[0]["name"] == "bird"The yield statement locks the fixture execution until the test finishes — at which point the cleanup code is executed.
And that’s it — you now have a fully reproducible, containerized Spark + Delta test setup, ready to integrate into any CI/CD pipeline or local test suite.
You can find the container on Docker Hub: 👉 franciscoabsampaio/spark-connect-server
If this saved you time (or frustration), give it a ⭐️ on GitHub or share it with your team — and let me know what feature you’d like supported next!
Happy testing! 🚀