This Week in Databend #82

Databend is a modern cloud data warehouse, serving your massive-scale analytics needs at low cost and complexity. Open source alternative to Snowflake. Also available in the cloud: https://app.databend.com .

Special Note: This Week in Databend will be gradually migrated to the Databend Blog. We will keep the content in sync until the final migration is complete.

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

AST

  • select from stage support uri with connect*ion options (#9926)

Catalog

  • Iceberg/create-catalog (#9017)

Expression

  • type decimal support agg func min/max (#10085)
  • add sum/avg for decimal types (#10059)

Pipeline

  • enrich core pipelines processors (#10098)

Query

  • create stage, select stage, copy, infer_schema support named file format (#10084)
  • query result cache (#10042)

Storage

  • table data cache (#9772)
  • use drop_table_by_id api in drop all (#10054)
  • native storage format support nested data types (#9798)

Code Refactoring πŸŽ‰

Meta

  • add compatible layer for upgrade (#10082)
  • More elegant error handling (#10112, #10114, etc.)

Cluster

  • support exchange sorting (#10149)

Executor

  • add check processor graph completed (#10166)

Planner

  • apply constant folder at physical plan builder (#9889)

Query

  • use accumulating to impl single state aggregator (#10125)

Storage

  • adopt OpenDAL's batch delete support (#10150)
  • adopt OpenDAL query based metadata cache (#10162)

Build/Testing/CI Infra Changes πŸ”Œ

  • release deb repository (#10080)
  • release with systemd units (#10145)

Bug Fixes πŸ”§

Expression

  • no longer return Variant as common super type (#9961)
  • allow auto cast from string and variant (#10111)

Cluster

  • fix limit query hang in cluster mode (#10006)

Storage

  • wrong column statistics when contain tuple type (#10068)
  • compact not work as expected with add column (#10070)
  • fix add column min/max stat bug (#10137)

What's On In Databend

Stay connected with the latest news about Databend.

Query Result Cache

In the past week, Databend now supports caching of query results!

             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” 1  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” 1
             β”‚         β”œβ”€β”€β”€β–Ίβ”‚         β”œβ”€β”€β”€β–ΊDummy───►Downstream
Upstream────►│Duplicateβ”‚ 2  β”‚         β”‚ 3
             β”‚         β”œβ”€β”€β”€β–Ίβ”‚         β”œβ”€β”€β”€β–ΊDummy───►Downstream
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚         β”‚
                            β”‚ Shuffle β”‚
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” 3  β”‚         β”‚ 2  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚         β”œβ”€β”€β”€β–Ίβ”‚         β”œβ”€β”€β”€β–Ίβ”‚  Write  β”‚
Upstream────►│Duplicateβ”‚ 4  β”‚         β”‚ 4  β”‚ Result  β”‚
             β”‚         β”œβ”€β”€β”€β–Ίβ”‚         β”œβ”€β”€β”€β–Ίβ”‚  Cache  β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Table Data Cache

Databend now supports table data cache:

  • disk cache: raw column(compressed) data of the data block.
  • in-memory cache(experimental): deserialized column objects of a data block.

For cache-friendly workloads, the performance gains are significant.

Deb Source & Systemd Support

Databend now offers the official Deb package source and supports the use of systemd to manage the service.

For DEB822 Source Format:

sudo curl -L -o /etc/apt/sources.list.d/datafuselabs.sources https://repo.databend.rs/deb/datafuselabs.sources
sudo apt update
sudo apt install databend
sudo systemctl start databend-meta
sudo systemctl start databend-query

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Service Activation Progress Report

When starting a Query/Meta node, it is necessary to perform checks and output them explicitly to help the user diagnose faults and confirm status.

Example:

storage check succeed
meta check failed: timeout, no response. endpoints: xxxxxxxx .
status check failed: address already in use.

Issue 10193: Feature: output the necessary progress when starting a query/meta node

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBig-WuuBohuTANGcameronbraid
andylokandyariesdevilb41shBig-WuuBohuTANGcameronbraid
Chasen-ZhangClSlaiddantengskydrmingdrmereverpcpcjohnhaxx7
Chasen-ZhangClSlaiddantengskydrmingdrmereverpcpcjohnhaxx7
lichuangmergify[bot]PsiACERinChanNOWWWsoyeric128sundy-li
lichuangmergify[bot]PsiACERinChanNOWWWsoyeric128sundy-li
suyanhanxTCeasonXuanwoxudong963youngsofunzhang2014
suyanhanxTCeasonXuanwoxudong963youngsofunzhang2014
zhyass
zhyass

Connect With Us

We'd love to hear from you. Feel free to run the code and see if Databend works for you. Submit an issue with your problem if you need help.

DatafuseLabs Community is open to everyone who loves data warehouses. Please join the community and share your thoughts.

This Week in Databend #81

Databend is a modern cloud data warehouse, serving your massive-scale analytics needs at low cost and complexity. Open source alternative to Snowflake. Also available in the cloud: https://app.databend.com .

Special Note: This Week in Databend will be gradually migrated to the Databend Blog. We will keep the content in sync until the final migration is complete.

What's New

Check out what we've done this week to make Databend even better for you.

Accepted RFCs πŸ›«

  • rfc: query result cache (#10014)

Features & Improvements ✨

Planner

  • support EXPLAIN ANALYZE statement to profile query execution (#10023)
  • derive new filter and push down (#10021)

Query

  • alter table add/drop column SQL support (#9851)
  • new table function infer_schema (#9936)
  • add privilege check for select (#9924)
  • improve signed numeric keys (#9987)
  • support to parse jwt metadata and add multiple identity issuer configuration (#9971)
  • support create file format (#10009)

Storage

  • adopt OpenDAL's native scan support (#9985)
  • add drop_table_by_id api (#9990)

Expression

  • add operation for decimal (#9926)

Functions

  • support array_any function (#9953)
  • support array_sort (#9941)

Sqllogictest

  • add time travel test for alter table (#9939)

Code Refactoring πŸŽ‰

Meta

  • move application level types such as user/role/storage-config to crate common-meta/app (#9944)
  • fix abuse of ErrorCode (#10056)

Query

  • use transform_sort_merge use heap to sort blocks (#10047)

Storage

  • introduction of FieldIndex and ColumnId types for clear differentiation of use (#10017)

Build/Testing/CI Infra Changes πŸ”Œ

  • run benchmark for clickbench result format (#10019)
  • run benchmark both s3 & fs (#10050)

Bug Fixes πŸ”§

Privilege

  • add privileges on system.one to PUBLIC by default (#10040)

Catalog

  • parts was not distributed evenly (#9951)

Planner

  • type assertion failed on subquery (#9937)
  • enable outer join to inner join optimization (#9943)
  • fix limit pushdown outer join (#10043)

Query

  • fix add column update bug (#10037)

Storage

  • fix sub-column of added-tuple column return default 0 bug (#9973)
  • new bloom filter that bind index with Column Id instead of column name (#10022)

What's On In Databend

Stay connected with the latest news about Databend.

RFC: Query Result Cache

Caching the results of queries against data that doesn't update frequently can greatly reduce response time. Once cached, the result will be returned in a much shorter time if you run the query again.

How to Write a Scalar / Aggregate Function

Did you know that you can enhance the power of Databend by creating your own scalar or aggregate functions? Fortunately, it's not a difficult task!

The following guides are intended for Rust developers and Databend users who want to create their own workflows. The guides provide step-by-step instructions on how to create and register your own functions using Rust, along with code snippets and examples of various types of functions to walk you through the process.

Profile-Guided Optimization

Profile-guided optimization (PGO) is a compiler optimization technique that collects execution data during the program runtime and allows for tailoring optimizations tailored to both hot and cold code paths.

In this blog, we'll guide you through the process of optimizing Databend binary builds using PGO. We'll use Databend's SQL logic tests as an example to illustrate the step-by-step procedure.

Please note that PGO always requires generating perf data using workloads that are statistically representative. However, there's no guarantee that performance will always improve. Decide whether to use it based on your actual needs.

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

To make our documentation clearer and easier to understand, we plan to restructure our function-related documentation to follow the same format as DuckDB's documentation. This involves breaking down the task into smaller sub-tasks based on function categories, so that anyone who wants to help improve Databend's documentation can easily get involved.

Issue 10029: Tracking: re-org the functions doc

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBig-WuuBohuTANGdantengskydependabot[bot]
andylokandyb41shBig-WuuBohuTANGdantengskydependabot[bot]
drmingdrmereverpcpcflaneur2020johnhaxx7leiyskylichuang
drmingdrmereverpcpcflaneur2020johnhaxx7leiyskylichuang
mergify[bot]PsiACERinChanNOWWWsoyeric128sundy-liTCeason
mergify[bot]PsiACERinChanNOWWWsoyeric128sundy-liTCeason
wubxXuanwoxudong963xxchanyoungsofunyufan022
wubxXuanwoxudong963xxchanyoungsofunyufan022
zhang2014ZhiHanZzhyass
zhang2014ZhiHanZzhyass

Connect With Us

We'd love to hear from you. Feel free to run the code and see if Databend works for you. Submit an issue with your problem if you need help.

DatafuseLabs Community is open to everyone who loves data warehouses. Please join the community and share your thoughts.

This Week in Databend #80

Databend is a modern cloud data warehouse, serving your massive-scale analytics needs at low cost and complexity. Open source alternative to Snowflake. Also available in the cloud: https://app.databend.com .

Special Note: This Week in Databend will be gradually migrated to the Databend Blog. We will keep the content in sync until the final migration is complete.

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Meta

  • add databend-meta config grpc_api_advertise_host (#9835)

AST

  • select from stage with files/pattern (#9877)
  • parse decimal type (#9894)

Expression

  • add Decimal128 and Decimal256 type (#9856)

Functions

  • support array_indexof (#9840)
  • support array function array_unique, array_distinct (#9875)
  • support array aggregate functions (#9903)

Query

  • add column id in TableSchema; use column id instead of index when read and write data (#9623)
  • support view in system.columns (#9853)

Storage

  • ParquetTable support topk optimization (#9824)

Sqllogictest

  • leverage sqllogictest to benchmark tpch (#9887)

Code Refactoring πŸŽ‰

Meta

  • remove obsolete meta service api read_msg() and write_msg() (#9891)
  • simplify UserAPI and RoleAPI by introducing a method update_xx_with(id, f: FnOnce) (#9921)

Cluster

  • split exchange source to reader and deserializer (#9805)
  • split and eliminate the status for exchange transform and sink (#9910)

Functions

  • rename some array functions add array_ prefix (#9886)

Query

  • TableArgs preserve info of positioned and named args (#9917)

Storage

  • ParquetTable list file in read_partition (#9871)

Build/Testing/CI Infra Changes πŸ”Œ

  • support for running benchmark on PRs (#9788)

Bug Fixes πŸ”§

Functions

  • fix nullable and or domain cal (#9928)

Planner

  • fix slow planner when ndv error backtrace (#9876)
  • fix order by contains aggregation function (#9879)
  • prevent panic when delete with subquery (#9902)

Query

  • fix insert default value datatype (#9816)

What's On In Databend

Stay connected with the latest news about Databend.

Why You Should Try Sccache

Sccache is a ccache-like project started by the Mozilla team, supporting C/CPP, Rust and other languages, and storing caches locally or in a cloud storage backend. The community first added native support for the Github Action Cache Service to Sccache in version 0.3.3, then improved the functionality in v0.4.0-pre.6 so that the production CI can now use it.

Now, opendal, open-sourced by Datafuse Labs, acts as a storage access layer for sccache to interface with various storage services (s3/gcs/azlob etc.).

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Try using build-info

To get information about git commits, build options and credits, we now use vergen and cargo-license.

build-info can collect build-information of your Rust crate. It might be possible to use it to refactor the relevant logic in common-building.

pub struct BuildInfo {
    pub timestamp: DateTime<Utc>,
    pub profile: String,
    pub optimization_level: OptimizationLevel,
    pub crate_info: CrateInfo,
    pub compiler: CompilerInfo,
    pub version_control: Option<VersionControl>,
}

Issue 9874: Refactor: Try using build-info

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGdependabot[bot]drmingdrmer
andylokandyariesdevilb41shBohuTANGdependabot[bot]drmingdrmer
everpcpcflaneur2020johnhaxx7leiyskylichuangmergify[bot]
everpcpcflaneur2020johnhaxx7leiyskylichuangmergify[bot]
PsiACERinChanNOWWWsoyeric128sundy-liTCeasonXuanwo
PsiACERinChanNOWWWsoyeric128sundy-liTCeasonXuanwo
xudong963youngsofunzhang2014
xudong963youngsofunzhang2014

Connect With Us

We'd love to hear from you. Feel free to run the code and see if Databend works for you. Submit an issue with your problem if you need help.

DatafuseLabs Community is open to everyone who loves data warehouses. Please join the community and share your thoughts.

This Week in Databend #79

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

Special Note: This Week in Databend will be gradually migrated to the Databend Blog. We will keep the content in sync until the final migration is complete.

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

AST

  • add syntax about parsing presign options with content type (#9771)

Format

  • add TSV file format back (#9732)

Functions

  • support array functions prepend and append (#9844)
  • support array concat (#9804)

Query

  • add topn runtime filter in native storage format (#9738)
  • enable hashtable state pass from partial to final (#9809)

Storage

  • add pruning stats to EXPLAIN (#9724)
  • cache bloom index object (#9712)

Code Refactoring πŸŽ‰

  • 'select from stage' use ParquetTable (#9801)

Meta

  • expose a single "kvapi" as public interface (#9791)
  • do not remove the last node from a cluster (#9781)

AST/Expression/Planner

  • unify Span and Result (#9713)

Executor

  • merge simple pipe and resize pipe (#9782)

Bug Fixes πŸ”§

Base

  • fix not linux and macos jemalloc fallback to std (#9786)

Config

  • fix table_meta_cache can't be disabled (#9767)

Meta

  • when import data to meta-service dir, the specified "id" has to be one of the "initial_cluster" (#9755)

Query

  • fix and refactor aggregator (#9748)
  • fix memory leak for data port (#9762)
  • fix panic when cast jsonb to string (#9813)

Storage

  • fix up max_file_size may oom (#9740)

What's On In Databend

Stay connected with the latest news about Databend.

DML Command - UPDATE

Modifies rows in a table with new values.

Note: Databend guarantees data integrity. In Databend, Insert, Update, and Delete operations are guaranteed to be atomic, which means that all data in the operation must succeed or all must fail.

Syntax

UPDATE <table_name>
SET <col_name> = <value> [ , <col_name> = <value> , ... ]
    [ FROM <table_name> ]
    [ WHERE <condition> ]

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Support Arrow Flight SQL Protocol

Currently Databend supports the MySQL protocol, and it would be great if Databend could support the Arrow Flight SQL protocol as well.

Typically a lakehouse stores data in parquet files using the MySQL protocol while Databend has to do deserialization from parquet to arrow and then back to MySQL data types. Again on the caller end users use data frames or MySQL result iterators, which also requires serialization of types. With Arrow Flight SQL all of these back and forth serialization costs can be avoided.

Issue 9832: Feature: Support Arrow Flight SQL protocol

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGdantengskydependabot
andylokandyariesdevilb41shBohuTANGdantengskydependabot
drmingdrmereverpcpcflaneur2020johnhaxx7leiyskymergify[bot]
drmingdrmereverpcpcflaneur2020johnhaxx7leiyskymergify[bot]
PsiACERinChanNOWWWsoyeric128sundy-liTCeasonXuanwo
PsiACERinChanNOWWWsoyeric128sundy-liTCeasonXuanwo
youngsofunyufan022zhang2014
youngsofunyufan022zhang2014

Connect With Us

We'd love to hear from you. Feel free to run the code and see if Databend works for you. Submit an issue with your problem if you need help.

DatafuseLabs Community is open to everyone who loves data warehouses. Please join the community and share your thoughts.

This Week in Databend #78

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

Special Note: This Week in Databend will be gradually migrated to the Databend Blog. We will keep the content in sync until the final migration is complete.

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

SQL

  • eliminate extra group by scalars (#9706)

Query

  • add privilege check for insert/delete/optimize (#9664)
  • enable empty projection (#9675)
  • add aggregate limit in final aggregate stage (#9716)
  • add optional column names to create/alter view statement (#9715)

Storage

  • add prewhere support in native storage format (#9600)

Code Refactoring πŸŽ‰

IO

  • move io constants to common/io (#9700)
  • refine fuse/io/read (#9711)

Planner

  • rename Scalar to ScalarExpr (#9665)

Storage

  • refactor cache layer (#9672)
  • pruner.rs -> fuse_bloom_pruner.rs (#9710)
  • make pruner hierarchy to chain (#9714)

Build/Testing/CI Infra Changes πŸ”Œ

  • support setup minio storage & external s3 storage in docker image (#9676)

Bug Fixes πŸ”§

Expression

  • fix missing simple_cast (#9671)

Query

  • fix efficiently_memory_final_aggregator result is not stable (#9685)
  • fix max_result_rows only limit output results nums (#9661)
  • fix query hang in two level aggregator (#9694)

Storage

  • may get wrong datablocks if not sorted by output schema (#9470)
  • bloom filter is using wrong cache key (#9706)

What's On In Databend

Stay connected with the latest news about Databend.

Databend All-in-One Docker Image

Databend Docker Image now supports setting up MinIO storage and external AWS S3 storage.

Now you can easily use a Docker image for your first experiment with Databend.***

Run with MinIO as backend

docker run \
    -p 8000:8000 \
    -p 9000:9000 \
    -e MINIO_ENABLED=true \
    datafuselabs/databend

Run with self managed query config

docker run \
    -p 8000:8000 \
    -e DATABEND_QUERY_CONFIG_FILE=/etc/databend/mine.toml \
    -v query_config_file:/etc/databend/mine.toml \
    datafuselabs/databend

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Vector search captures the meaning and context of unstructured data, and is commonly used for text or image processing, enabling the use of semantics to find similar results and obtain more valid results than traditional keyword retrieval.

Databend plans to provide users with a richer and more efficient means of querying by supporting vector search, and the introduction of Faiss Index may be an initial solution.

Issue 9699: feat: vector search (Faiss index)

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

We're gearing up for the v0.9 release of Databend. Stay tuned.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGdantengskydependabot[bot]
andylokandyariesdevilb41shBohuTANGdantengskydependabot[bot]
everpcpcflaneur2020johnhaxx7leiyskymergify[bot]PsiACE
everpcpcflaneur2020johnhaxx7leiyskymergify[bot]PsiACE
RinChanNOWWWsandfleesundy-lixudong963zhang2014zhyass
RinChanNOWWWsandfleesundy-lixudong963zhang2014zhyass

Connect With Us

We'd love to hear from you. Feel free to run the code and see if Databend works for you. Submit an issue with your problem if you need help.

DatafuseLabs Community is open to everyone who loves data warehouses. Please join the community and share your thoughts.