Skip to content

odnoklassniki/spark-to-clickhouse-sink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

spark-to-clickhouse-sink

A thick-write-only-client for writing across several ClickHouse MergeTree tables located in different shards.
It is a good alternative to writing via Clickhouse Distributed Engine which has been proven to be a bad idea for several reasons.

The core functionality is the writer. It works on top of Apache Spark and takes DataFrame as an input.

Streaming and Exactly-Once Samantic

The writer can also write data w/o duplicates from repeatable source. For example it can be very useful in achieving EOS semantic when Kafka-Clickhouse sink is needed. It is a good alternative to ClickHouse Kafka Engine.

To make it work the writer needs the following:

Here is a pseudo-code how it could be used to consume data from Kafka and insert into ClickHouse written in Spark Structured Streaming (ver. 2.4+)

val streamDF = spark.readStream()
                	  .format("kafka")
                    .option(<kafka brokers and topics>)
                    .load();

val writer = new SparkToClickHouseWriter(<my_conf>)

streamDF.forEachBatch(df -> writer.write(df))

Refs

There is a talk about problems with ClickHouse Distributed and Kafka engines and reasons which forced us to implement this util library.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages