Introduction
daft
is a fast and distributed python query engine built on top of the rust programming language.
One of its biggest features is its support over multi-modal data.
As such, many developers are interested in getting their hands on daft and playing around with it.
Downloading and running daft on a single node (most likely your local machine) is simple. This largely just includes downloading daft in your python project, importing it into your script or notebook, and directly interfacing with it.
Experimenting with daft in a distributed environment, however, becomes quite a bit more challenging.
In order to use distributed daft, developers must use the ray
cluster management software (the sdk and/or the cli tool).
However, familiarizing oneself with ray and its intricacies is not simple.
Developers are often required to be aware of some arcane knowledge into the intersection of ray and their choice of cloud-provider.
The story can be so challenging that after multiple attempts, the developer may just give up on trying daft in a distributed manner altogether.
Simplifying launches
Ideally, developers should be able to bring their own cloud and get up and running with running daft in a distributed setting as quickly as possible.
This is where daft launcher
comes into the picture.
Daft launcher is a command-line tool that aims to provide some simpler abstractions over ray, thus enabling a quick uptime during experimentation. This is also a great tool during actual development; the ability to quickly spin up and manage clusters is a powerful asset to any data engineer.
This book will aim to introduce daft launcher and how it enables the developers to get up and running quickly with daft.
Commands
Daft launcher currently exposes 6 commands to interface with and manage your cluster. They are:
Succinctly, the idea is that you are able to list clusters (list
), start new clusters (up
), and tear down existing clusters (down
).
You are also able to submit jobs to the cluster (submit
) and view the dashboard of a given cluster (connect
).
The dashboard gives you the ability to access the ray web ui, which gives you additional information into statuses on the cluster and current/past jobs.
Finally, as a convenience, you are also able to initialize a configuration file (init-config
) that is pre-populated with some configuration options that will be used by the other commands.
Let's dive into each command individually.
Init Config
This command is, in essence, the entrypoint to using daft launcher.
This will initialize an empty configuration file, named .daft.toml
, in the current working directory.
The file itself will contain some default values that you can tune to your liking.
Some of the values are required, while others are optional; which ones are which will be denoted as such.
Example
# initialize the default .daft.toml configuration file
daft init-config
# or, if you want, specify a custom name
daft init-config my-custom-config.toml
Config file specification
Each available configuration option is denoted below, as well as a small blurb on what it does and whether it is required or optional. If it is optional, its default value will be defined as well.
[setup]
# (required)
# The name of the cluster.
name = ...
# (required)
# The cloud provider that this cluster will be created in.
# Has to be one of the following:
# - "aws"
# - "gcp"
# - "azure"
provider = ...
# (optional; default = None)
# The IAM instance profile ARN which will provide this cluster with the necessary permissions to perform whatever actions.
# Please note that if you don't specify this field, Ray will create an automatic instance profile for you.
# That instance profile will be minimal and may restrict some of the feature of Daft.
iam_instance_profile_arn = ...
# (required)
# The AWS region in which to place this cluster.
region = ...
# (optional; default = "ec2-user")
# The ssh user name when connecting to the cluster.
ssh_user = ...
# (optional; default = 2)
# The number of worker nodes to create in the cluster.
number_of_workers = ...
# (optional; default = "m7g.medium")
# The instance type to use for the head and worker nodes.
instance_type = ...
# (optional; default = "ami-01c3c55948a949a52")
# The AMI ID to use for the head and worker nodes.
image_id = ...
# (optional; default = [])
# A list of dependencies to install on the head and worker nodes.
# These will be installed using UV (https://docs.astral.sh/uv/).
dependencies = [...]
[run]
# (optional; default = ['echo "Hello, World!"'])
# Any post-setup commands that you want to invoke manually.
# This is a good location to install any custom dependencies or run some arbitrary script.
setup_commands = [...]
Up
This command spins up a cluster given some configuration file. The configuration file itself will contain all of the information that daft launcher will require in order to know how to spin that specific cluster up.
Example
# spin up a cluster using the default .daft.toml configuration file
daft up
# or, if you want, spin up a cluster using a custom configuration file
daft up -c my-custom-config.toml
This command will do a couple of things:
- Firstly, it will reach into your cloud provider and spin up the necessary resources. This includes things such as the worker nodes, security groups, permissions, etc.
- When the nodes are spun up, the ray and daft dependencies will be downloaded into a python virtual environment.
- Next, any other custom dependencies that you've specified in the configuration file will then be downloaded.
- Finally, the setup commands that you've specified in the configuration file will be run on the head node.
The command will only return successfully when the head node is fully set up.
Even though this command will request the worker nodes to also spin up, it will not wait for them to be spun up before returning.
Therefore, when the command completes, and you type in daft list
, the worker nodes may be in a "pending" state.
Don't be concerned; they should, in a couple of seconds, be fully running.
Down
The down command is pretty much the opposite of the up command. It takes the cluster specified in the configuration file and tears it down.
Example
# spin down a cluster using the default .daft.toml configuration file
daft down
# or, if you want, spin down a cluster using a custom configuration file
daft down -c my-custom-config.toml
This command will tear down all instances in the cluster, not just the head node. When each instance has been requested to shut down, the command will return successfully.
List
The list command is extremely helpful for getting some observability into the current state of all of your clusters. List will return a formatted table of all of the clusters that you currently have, running and terminated. It will tell you each of their instance names, as well as their public IPs (given that they are still running).
Example
daft list
An example output after running the above command would be:
Running:
- daft-demo, head, i-053f9d4856d92ea3d, 35.94.91.91
- daft-demo, worker, i-00c340dc39d54772d
- daft-demo, worker, i-042a96ce1413c1dd6
The name of the cluster which was booted up is "daft-demo". The cluster is comprised of 3 instances: 1 head node and 2 worker nodes.
The list command can output multiple clusters as well. For example, let's say I created another configuration file and spun up a new cluster using it.
daft init-config new-cluster.toml
daft up -c new-cluster.toml
Then, after running daft list
, the output would be:
Running:
- daft-demo, head, i-053f9d4856d92ea3d, 35.94.91.91
- daft-demo, worker, i-00c340dc39d54772d, 44.234.112.173
- daft-demo, worker, i-042a96ce1413c1dd6, 35.94.206.130
- new-cluster, head, i-0be0db9803bd06652, 35.86.200.101
- new-cluster, worker, i-056f46bd69e1dd3f1, 44.242.166.108
- new-cluster, worker, i-09ff0e1d8e67b8451, 35.87.221.180
Now, let's say I terminated the new cluster using daft down -c new-cluster.toml
.
Then, after running daft list
, the output would be:
Running:
- daft-demo, head, i-053f9d4856d92ea3d, 35.94.91.91
- daft-demo, worker, i-00c340dc39d54772d, 44.234.112.173
- daft-demo, worker, i-042a96ce1413c1dd6, 35.94.206.130
Shutting-down:
- new-cluster, head, i-0be0db9803bd06652, 35.86.200.101
- new-cluster, worker, i-056f46bd69e1dd3f1, 44.242.166.108
- new-cluster, worker, i-09ff0e1d8e67b8451, 35.87.221.180
The state of the new-cluster has changed from "Running" to "Shutting-down". In a couple seconds, the state should then be finalized to "Terminated".
Submit
The submit command enables you submit a working directory and command to the remote cluster in order to be run. The working directory will be zipped prior to being sent over the wire, and then will be unzipped on the remote head node.
An important thing to keep in mind is how dependencies are utilized by the source code in the working directory.
During the initial daft up
command that you ran, the dependencies should have been specified in the configuration file.
During the cluster's initialization process, the cluster will download all the dependencies into a python virtual environment.
The working directory that you submit will then be run in that virtual environment, thus enabling it to access those pre-downloaded dependencies.
Example
# submit a job using the default .daft.toml configuration file
daft submit -i my-keypair.pem -w my-working-director
# submit a job using the default .daft.toml configuration file
daft submit -c my-custom-config.toml -i my-keypair.pem -w my-working-director
Connect
The connect command enables you to view the Ray dashboard of a specified cluster that you currently have running.
The way this is done is by establishing a port-forward over SSH from your local machine to the head node of the cluster (connecting localhost:8265
to the remote head's 8265
).
The head node then serves some HTML/CSS/JS that you can view in your browser by pointing it towards localhost:8265
).
An important thing to note is that this command will require that you have the appropriate private SSH keypair to authenticate against the remote head's public SSH keypair. You will need to pass this SSH keypair as an argument to the command.
Example
# establish the port-forward using the default .daft.toml configuration file
daft connect -i my-keypair.pem
# or, if you want, establish the port-forward using a custom configuration file
daft connect -c my-custom-config.toml -i my-keypair.pem
Sql
Daft now supports a SQL API.
This means that you can run raw SQL queries against your data using daft.
The SQL dialect is the postgres
standard.
Example
# run a sql query using the default .daft.toml configuration file
daft sql -- "\"SELECT * FROM my_table\""
# or, if you want, establish the port-forward using a custom configuration file
daft sql -c my-custom-config.toml -- "\"SELECT * FROM my_table\""
Example
Okay, let's try our hand with an example project. Let's spin up a cluster and submit a basic job to execute on it.
This project will proceed assuming you're using uv
and aws
.
However, the concepts should translate to whatever python package manager and cloud provider that you specifically choose.
Prerequisites
The following should be installed on your machine:
- The aws cli tool. (Assuming you're using aws as your cloud provider).
- Some type of python package manager.
We recommend using
uv
to manage everything (i.e., dependencies, as well as the python version itself). It's much cleaner and faster thanpip
.
Permissions
...
Getting started
Run the following commands to initialize your project:
# create the project directory
cd some/working/directory
mkdir launch-test
cd launch-test
# initialize the project
uv init --python 3.12
uv venv
source .venv/bin/activate
# install daft launcher
uv pip install "daft-launcher"
At this point, you'll have a properly set up python project. You'll have a pretty basic working directory. It should look something like this:
/
|- .venv/
|- hello.py
|- pyproject.toml
|- README.md
|- .python-version
In your virtual environment, you'll have daft launcher installed.
You can verify this by running daft --version
, which should return the latest version of daft launcher which is available.
You can even try running daft --help
and see what commands are available.
Note that other commands for daft launcher may still not work just yet. This is because most likely because you haven't configured your AWS credentials. There are a couple of different ways of doing so, but for the purposes of this example, let's establish an SSO connection and verify that. Thus, run the following:
# configure your sso
aws configure sso
# login to your sso
aws sso login
This should open up your browser. Accept the following requests, and return to your terminal. You see a success message from the aws cli tool. At this point, your aws cli tool has been configured, and your environment is fully setup.
Running a job
First, let's just get some boilerplate code out of the way.
Let's create a working directory and move our hello.py
file into it.
mkdir src
mv hello.py src
Next, let's import daft and run a simple query inside of hello.py
.
import daft
df = daft.from_pydict({ "values": [0, 1, 8] })
df.with_column("result", daft.col("values").cbrt()).show()
Okay, now that we have some basic boilerplate code, let's actually try and run it using daft launcher.
Future Plans
The following is a non-exhaustive list of ideas for future improvements to the daft launcher project:
GCP and Azure support
Daft launcher only currently supports AWS. We want to extend the launcher to support GCP and Azure as well.
Improved stories around permissioning
We want to create minimal, stripped-down permission profiles that users can enable and run daft launcher with. Users can obviously use power permission profiles (i.e., admin profiles) to use with daft launcher, however, this blocks off a lot of users trying this out on cloud profiles that they don't own (i.e., employees working on their company's cloud).