How does DuckDB on the server work?

For a long time Count has relied on DuckDB, running it locally in the browser to enable responsive, and powerful analytical workflows. We believe this adoption has made Count the best product for making canvases that go beyond the limitations of static dashboard while inviting collaborators to explore, refine, and build upon the analyse.

In announcing our release of "DuckDB on the server", we're pushing further ahead in this direction through the introduction of on-demand DuckDB instances in the cloud, bringing its capabilities closer to your data.

What is DuckDB on the server?

DuckDB is a technology we have enthusiastically embraced, used, and supported here at Count to help us deliver flexibility and speed in analysis. We already use it in different ways across our product, for instance, running it in-browser enabling you to benefit from the strong performance it enables in analysis workflows.

Our challenge has always been keeping the user closer to original data for accuracy, while balancing a need to deliver a performant experience. “Big data” is great. Usable and responsive data is better.

With this release we are introducing an intermediate step between your databases—BigQuery, Databricks etc.—and the computation completed in the browser. Rather than shipping increasingly large payloads to clients increasingly more frequently, we can now create DuckDB instances closer to your original datasource, that can then pass more optimised result sets down the analysis chain.

The results are two fold:

The performance in canvases can be massively improved simply by virtue of not having to send unnecessarily large datasets downstream to continue analysis, or support user exploration
You can reduce the compute costs within your data infrastructure by shifting query load to Count-backed caches vs. repeating large and expensive queries as you manipulate data within a canvas

How does this work?

‍We should note first that this is just one element of what you should consider around optimising analysis within Count. Often, the biggest improvements can be found by simply caching larger queries against your data on a time-basis. For instance, if you only care about data being correct to a 24-hour window, you can ensure that the most expensive queries are only triggered once within that window, with every subsequent query running against a Count-hosted cache.

Our preference is always to deliver the most data that can be performantly delivered to your browser. When you switch a downstream cell to DuckDB, we first test whether you have—by choice or circumstance—all of the results you need to continue. If you don’t, and when it would be more efficient, we have the option to run the query within our server environment. We use Parquet and Arrow files to stream data between our stored versions of your data, onwards into ephemeral DuckDB VMs within Google Cloud.

As that VM processes your data, we stream it onwards to our storage buckets, where it can be sent onwards to the browser for subsequent computer. You see this in cells, and also notice the sequential completion as data is streamed down the pipeline. Streaming enables us to speed this up rather than waiting for complete datasets to be created, uploaded, and subsequently downloaded.

[Make an annotated bento image of the different cell processing states]

Relying on Arrow enables extremely performant random access to results-set, which aids downstream filtering and sorting of large datasets. Parquet is used selectively on larger results sets to compress data before it's handed over. By the time you see results in the canvas, we have already used DuckDB-WASM to convert those Parquet files back to Arrow so they can be used in original form on client.

For the most part, we lean on DuckDB to make the smartest decisions about compression, downloads, and memory-management. The upshot of this is that memory and capacity limits in your account, workspace, and individual cells are sometimes hard to understand and predict. We’ve set these based on our analysis of typical customer workloads, but will happily help find the best solution for you.

‍How do we assure your data security?

‍When it comes to moving your critical data between systems, it’s entirely reasonable to want to understand how we are assuring its security.

We take this assurance extremely seriously, and have built a robust process on top of our existing compliance and security processes. In short:

When we instantiate a DuckDB in the server process, we do so building on publicly available and auditable Docker images within Google Cloud. The additional layers Count includes are subject to the same Google security scans, which allow us to quickly remediate any potential third-party security vulnerabilities.

These VMs are powered by Google Cloud Run. This is completely ephemeral, and we don’t intentionally persist data between instances. It may be possible that data does persist as they are shared instances for speed of execution, so we take additional steps to ensure security.

Upon startup, we enable it to access exactly the result-sets required through presigned URLs within our storage buckets, where data is held encrypted at rest. The instances hold no credentials for any other data—yours or obviously anyone else’s.

On top of that isolation, we also actively monitor the resources accessed by the DuckDB processes against a pre-calculated whitelist based on the parent-cells in the canvas. If—through malicious or erroneous action—it attempts to read beyond this, the process is immediately terminated and no results are returned onwards into the canvas.

We believe this appropriately addresses risks around sharing instances.We believe these steps (among others not described) enable us to give you the performance of compute closer to your data, without exposing you to unacceptable risk.

Your data is yours. We keep it that way.

Business Improvement

Product

Use cases

Pricing

Professional Services

Customers

Resources

How Does DuckDB on the Server Work?

What is DuckDB on the server?

How does this work?

‍How do we assure your data security?

Find more great reads from our blog

Think, Plan, Execute: How to Run Self-Service Analytics in a Thoughtful, Effective Way

The Future of Data Teams: Enabling a Virtuous Cycle