Introduction

daft is a fast and distributed python query engine built on top of the rust programming language. One of its biggest features is its support over multi-modal data. As such, many developers are interested in getting their hands on daft and playing around with it.

Downloading and running daft on a single node (most likely your local machine) is simple. This largely just includes downloading daft in your python project, importing it into your script or notebook, and directly interfacing with it.

Experimenting with daft in a distributed environment, however, becomes quite a bit more challenging. In order to use distributed daft, developers must use the ray cluster management software (the sdk and/or the cli tool). However, familiarizing oneself with ray and its intricacies is not simple. Developers are often required to be aware of some arcane knowledge into the intersection of ray and their choice of cloud-provider. The story can be so challenging that after multiple attempts, the developer may just give up on trying daft in a distributed manner altogether.

Simplifying launches

Ideally, developers should be able to bring their own cloud and get up and running with running daft in a distributed setting as quickly as possible. This is where daft launcher comes into the picture.

Daft launcher is a command-line tool that aims to provide some simpler abstractions over ray, thus enabling a quick uptime during experimentation. This is also a great tool during actual development; the ability to quickly spin up and manage clusters is a powerful asset to any data engineer.

This book will aim to introduce daft launcher and how it enables the developers to get up and running quickly with daft.

Commands

Daft launcher currently exposes 6 commands to interface with and manage your cluster. They are:

  1. daft init-config
  2. daft up
  3. daft down
  4. daft list
  5. daft submit
  6. daft dashboard

Succinctly, the idea is that you are able to list clusters (list), start new clusters (up), and tear down existing clusters (down). You are also able to submit jobs to the cluster (submit) and view the dashboard of a given cluster (dashboard). The dashboard gives you the ability to access the ray web ui, which gives you additional information into statuses on the cluster and current/past jobs. Finally, as a convenience, you are also able to initialize a configuration file (init-config) that is pre-populated with some configuration options that will be used by the other commands.

Let's dive into each command individually.

Init Config

This command is, in essence, the entrypoint to using daft launcher. This will initialize an empty configuration file, named .daft.toml, in the current working directory. The file itself will contain some default values that you can tune to your liking. Some of the values are required, while others are optional; which ones are which will be denoted as such.

Example

# initialize the default .daft.toml configuration file
daft init-config

# or, if you want, specify a custom name
daft init-config my-custom-config.toml

The contents of the file will be roughly:

[setup]
name = "$NAME - required"
provider = "aws"
region = "$REGION - optional, defaults to us-west-2"
ssh_user = "$SSH_USER - optional, defaults to ec2-user"
number_of_workers = "$NUMBER_OF_WORKERS - optional, defaults to 2 worker nodes"
instance_type = "$INSTANCE_TYPE - optional, defaults to m7g.medium"
image_id = "$IMAGE_ID - optional, defaults to ami-01c3c55948a949a52"
dependencies = ["$DEPENDENCIES - optional, defaults to an empty list"]

[run]
setup_commands = ['echo "Hello, World!"']

Each field in the setup section is demarcated with a comment that states whether it is required or optional. If it is optional, the default value that we use internally will be listed as well.

Up

This command spins up a cluster given some configuration file. The configuration file itself will contain all of the information that daft launcher will require in order to know how to spin that specific cluster up.

Example

# spin up a cluster using the default .daft.toml configuration file
daft up

# or, if you want, spin up a cluster using a custom configuration file
daft up -c my-custom-config.toml

This command will do a couple of things:

  1. Firstly, it will reach into your cloud provider and spin up the necessary resources. This includes things such as the worker nodes, security groups, permissions, etc.
  2. When the nodes are spun up, the ray and daft dependencies will be downloaded into a python virtual environment.
  3. Next, any other custom dependencies that you've specified in the configuration file will then be downloaded.
  4. Finally, the setup commands that you've specified in the configuration file will be run on the head node.

The command will only return successfully when the head node is fully set up. Even though this command will request the worker nodes to also spin up, it will not wait for them to be spun up before returning. Therefore, when the command completes, and you type in daft list, the worker nodes may be in a "pending" state. Don't be concerned; they should, in a couple of seconds, be fully running.

Down

The down command is pretty much the opposite of the up command. It takes the cluster specified in the configuration file and tears it down.

Example

# spin down a cluster using the default .daft.toml configuration file
daft down

# or, if you want, spin down a cluster using a custom configuration file
daft down -c my-custom-config.toml

This command will tear down all instances in the cluster, not just the head node. When each instance has been requested to shut down, the command will return successfully.

List

The list command is extremely helpful for getting some observability into the current state of all of your clusters. List will return a formatted table of all of the clusters that you currently have, running and terminated. It will tell you each of their instance names, as well as their public IPs (given that they are still running).

Example

daft list

An example output after running the above command would be:

Running:
  - daft-demo, head, i-053f9d4856d92ea3d, 35.94.91.91
  - daft-demo, worker, i-00c340dc39d54772d
  - daft-demo, worker, i-042a96ce1413c1dd6

The name of the cluster which was booted up is "daft-demo". The cluster is comprised of 3 instances: 1 head node and 2 worker nodes.

The list command can output multiple clusters as well. For example, let's say I created another configuration file and spun up a new cluster using it.

daft init-config new-cluster.toml
daft up -c new-cluster.toml

Then, after running daft list, the output would be:

Running:
  - daft-demo, head, i-053f9d4856d92ea3d, 35.94.91.91
  - daft-demo, worker, i-00c340dc39d54772d, 44.234.112.173
  - daft-demo, worker, i-042a96ce1413c1dd6, 35.94.206.130
  - new-cluster, head, i-0be0db9803bd06652, 35.86.200.101
  - new-cluster, worker, i-056f46bd69e1dd3f1, 44.242.166.108
  - new-cluster, worker, i-09ff0e1d8e67b8451, 35.87.221.180

Now, let's say I terminated the new cluster using daft down -c new-cluster.toml. Then, after running daft list, the output would be:

Running:
  - daft-demo, head, i-053f9d4856d92ea3d, 35.94.91.91
  - daft-demo, worker, i-00c340dc39d54772d, 44.234.112.173
  - daft-demo, worker, i-042a96ce1413c1dd6, 35.94.206.130
Shutting-down:
  - new-cluster, head, i-0be0db9803bd06652, 35.86.200.101
  - new-cluster, worker, i-056f46bd69e1dd3f1, 44.242.166.108
  - new-cluster, worker, i-09ff0e1d8e67b8451, 35.87.221.180

The state of the new-cluster has changed from "Running" to "Shutting-down". In a couple seconds, the state should then be finalized to "Terminated".

Submit

The submit command enables you submit a working directory and command to the remote cluster in order to be run. The working directory will be zipped prior to being sent over the wire, and then will be unzipped on the remote head node.

An important thing to keep in mind is how dependencies are utilized by the source code in the working directory. During the initial daft up command that you ran, the dependencies should have been specified in the configuration file. During the cluster's initialization process, the cluster will download all the dependencies into a python virtual environment. The working directory that you submit will then be run in that virtual environment, thus enabling it to access those pre-downloaded dependencies.

Example

# submit a job using the default .daft.toml configuration file
daft submit -i my-keypair.pem -w my-working-director

# submit a job using the default .daft.toml configuration file
daft submit -c my-custom-config.toml -i my-keypair.pem -w my-working-director

Dashboard

The dashboard command enables you to view the Ray dashboard of a specified cluster that you currently have running. The way this is done is by establishing a port-forward over SSH from your local machine to the head node of the cluster (connecting localhost:8265 to the remote head's 8265). The head node then serves some HTML/CSS/JS that you can view in your browser by pointing it towards localhost:8265).

An important thing to note is that this command will require that you have the appropriate private SSH keypair to authenticate against the remote head's public SSH keypair. You will need to pass this SSH keypair as an argument to the command.

Example

# establish the port-forward using the default .daft.toml configuration file
daft dashboard -i my-keypair.pem

# or, if you want, establish the port-forward using a custom configuration file
daft dashboard -c my-custom-config.toml -i my-keypair.pem

Example

Okay, let's try our hand with an example project. Let's spin up a cluster and submit a basic job to execute on it.

This project will proceed assuming you're using uv and aws. However, the concepts should translate to whatever python package manager and cloud provider that you specifically choose.

Prerequisites

The following should be installed on your machine:

  • The aws cli tool. (Assuming you're using aws as your cloud provider).
  • Some type of python package manager. We recommend using uv to manage everything (i.e., dependencies, as well as the python version itself). It's much cleaner and faster than pip.

Permissions

...

Getting started

Run the following commands to initialize your project:

# create the project directory
cd some/working/directory
mkdir launch-test
cd launch-test

# initialize the project
uv init --python 3.12
uv venv
source .venv/bin/activate

# install daft launcher
uv pip install "daft-launcher"

At this point, you'll have a properly set up python project. You'll have a pretty basic working directory. It should look something like this:

/
|- .venv/
|- hello.py
|- pyproject.toml
|- README.md
|- .python-version

In your virtual environment, you'll have daft launcher installed. You can verify this by running daft --version, which should return the latest version of daft launcher which is available. You can even try running daft --help and see what commands are available.

Note that other commands for daft launcher may still not work just yet. This is because most likely because you haven't configured your AWS credentials. There are a couple of different ways of doing so, but for the purposes of this example, let's establish an SSO connection and verify that. Thus, run the following:

# configure your sso
aws configure sso

# login to your sso
aws sso login

This should open up your browser. Accept the following requests, and return to your terminal. You see a success message from the aws cli tool. At this point, your aws cli tool has been configured, and your environment is fully setup.

Running a job

First, let's just get some boilerplate code out of the way. Let's create a working directory and move our hello.py file into it.

mkdir src
mv hello.py src

Next, let's import daft and run a simple query inside of hello.py.

import daft
df = daft.from_pydict({ "values": [0, 1, 8] })
df.with_column("result", daft.col("values").cbrt()).show()

Okay, now that we have some basic boilerplate code, let's actually try and run it using daft launcher.

Future Plans

The following is a non-exhaustive list of ideas for future improvements to the daft launcher project:

Real-time logs

During development, it is often useful to see the output of the job in real-time. Currently, print out statements are not available, which can hinder quick debugging methods. We want to extend the launcher such that instead of submitting the job and immediately closing the connection, the connection will remain open and the stdout of the head node will be printed out in real-time. The connection will only be closed when the remote process is finished itself.

Improved detection of local keypairs

Currently, you need to specify the path to the keypair in the configuration file. This is not ideal. The daft launcher should be able to detect the keypair automatically by querying the remote instance to pull the name of the public key, and using that public key name to find the locally stored private key. This will allow for a more seamless experience when using the launcher. In the case that you have changed the name of the private key locally, you can always fall back to manually specifying its path.