[Feature] [InfluxDB Source] add read by chunk #6808

15767714253 · 2024-05-07T13:57:19Z

Purpose of this pull request

This pull request enhances the InfluxDB source connector by introducing chunked data retrieval. The feature allows the connector to efficiently process large datasets by querying and fetching data in chunks.

Does this PR introduce any user-facing change?

No,The existing functionality of the InfluxDB source connector remains unchanged. Users who upgrade to the version containing this patch will have the additional option to enable chunked data retrieval for performance improvements, but this is not a breaking change, and the feature is opt-in.

How was this patch tested?

I have conducted multiple synchronizations in our company's production environment, with a single table containing more than 300 million data. However, this feature puts a significant load on the resources of InfluxDB itself, which is related to the size of the data read.

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update change log that in connector document. For more details you can refer to connector-v2
2. Update plugin-mapping.properties and add new connector information in it
3. Update the pom file of seatunnel-dist
Update the release-note.

Hisoka-X · 2024-05-08T02:40:19Z

cc @hailin0 @EricJoy2048

zhilinli123 · 2024-05-08T03:19:45Z

#4572
Change the document to v2 format
reference：
https://github.com/apache/seatunnel/blob/dev/docs/en/connector-v2/source/Mysql.md
https://github.com/apache/seatunnel/blob/dev/docs/en/connector-v2/sink/Mysql.md

@15767714253

zhilinli123 · 2024-05-09T01:54:41Z

docs/en/connector-v2/source/InfluxDB.md

+### For SeaTunnel Zeta Engine
+
+> 1. You need to ensure that the [influxDB connector jar package](https://mvnrepository.com/artifact/org.apache.seatunnel/connector-influxdb) has been placed in directory `${SEATUNNEL_HOME}/lib/`.
+
 ## Key features


Key Features

zhilinli123 · 2024-05-09T01:55:59Z

docs/zh/connector-v2/source/InfluxDB.md

+
+> 1. 需要确保连接器Jar包 [influxDB connector jar package](https://mvnrepository.com/artifact/org.apache.seatunnel/connector-influxdb) 被放在目录 `${SEATUNNEL_HOME}/lib/`.
+
+## Key features


Key Features

zhilinli123 · 2024-05-09T01:56:59Z

docs/zh/connector-v2/source/InfluxDB.md

+
+## Options
+
+|        name        |  type  | required | default value |


Refer to mysql Documentation

zhilinli123 · 2024-05-09T01:59:30Z

docs/zh/connector-v2/source/InfluxDB.md

+                value = INT
+                rt = STRING
+                time = BIGINT
+            }


Using split_column is for shard querying, using chunk_size is for chunk querying, and using neither is for regular querying.

EricJoy2048 · 2024-05-09T08:48:17Z

docs/en/connector-v2/source/InfluxDB.md

 ## Key features

 - [x] [batch](../../concept/connector-v2-features.md)
 - [ ] [stream](../../concept/connector-v2-features.md)
 - [x] [exactly-once](../../concept/connector-v2-features.md)
 - [x] [column projection](../../concept/connector-v2-features.md)
+- [x] [parallelism](../../concept/connector-v2-features.md)
+- [ ] [support multiple table reading](../../concept/connector-v2-features.md)
+- [x] [parallelism](../../concept/connector-v2-features.md)


EricJoy2048 · 2024-05-09T09:27:35Z

...ain/java/org/apache/seatunnel/connectors/seatunnel/influxdb/source/InfluxdbSourceReader.java

+                    }
+                },
+                () -> {
+                    log.error("this chunk reader influxDB complete");


EricJoy2048 · 2024-05-09T09:50:29Z

...ain/java/org/apache/seatunnel/connectors/seatunnel/influxdb/source/InfluxdbSourceReader.java

-                read(split, output);
+    public void pollNext(Collector<SeaTunnelRow> output) throws InterruptedException {
+        // reader influxDB By chunk
+        if (StringUtils.isEmpty(config.getSplitKey()) && config.getChunkSize() > 0) {


Even if users set split_key, they can still use Chunk's read mode. Can this be understood as follows?

I think using Chunk mode to read should be a better way. A better approach is to give a default value if the user has not set chunk_size, and then use the readBychunkSize method uniformly to read the data.

I think this is not advisable, because I believe that chunked reading puts a lot of performance pressure on InfluxDB, and some users would still prefer to shift the pressure to SeaTunnel.

I think this is not advisable, because I believe that chunked reading puts a lot of performance pressure on InfluxDB, and some users would still prefer to shift the pressure to SeaTunnel.

Can the pressure on InfluxDB be resolved by reducing parallelism (e.g. setting parallelism to 1) or config speed limit config in env https://seatunnel.apache.org/docs/2.3.5/concept/speed-limit?

The parallelism was originally 1; it has nothing to do with the speed of reading, but rather the consumption of InfluxDB's own chunk operation.

EricJoy2048 · 2024-05-09T10:00:03Z

...ain/java/org/apache/seatunnel/connectors/seatunnel/influxdb/source/InfluxdbSourceReader.java

+
+    private void readByChunkSize(InfluxDBSourceSplit split, Collector<SeaTunnelRow> output) {
+        influxdb.query(
+                new Query(split.getQuery(), config.getDatabase()),


Use CountDownLatch is not a good idea. You can update the code like this: (Then method readByChunkSize will become a sync methods and will not return until the reading thread is completed )

final CompletableFuture<Void> queryCompleteFeature = new CompletableFuture<>(); influxdb.query( new Query(split.getQuery(), config.getDatabase()), config.getChunkSize(), (cancellable, queryResult) -> { if (cancellable.isCanceled()) { log.info("this chunk reader influxDB is canceled"); queryCompleteFeature.complete(); return; } if (queryResult.hasError()) { log.error( "this chunk reader influxDB result has error [{}]", queryResult.getError()); queryCompleteFeature.completeExceptionally(new InfluxdbConnectorException) return; } for (QueryResult.Result result : queryResult.getResults()) { List<QueryResult.Series> serieList = result.getSeries(); if (CollectionUtils.isNotEmpty(serieList)) { for (QueryResult.Series series : serieList) { for (List<Object> values : series.getValues()) { SeaTunnelRow row = InfluxDBRowConverter.convert( values, seaTunnelRowType, columnsIndexList); output.collect(row); } } } else { log.info("this chunk reader influxDB series is empty"); } } }, () -> { log.info("this chunk reader influxDB complete"); queryCompleteFeature.complete(); }, throwable -> { log.error( "this chunk reader influxDB result has error [{}]", throwable.getMessage()); queryCompleteFeature.completeExceptionally(new InfluxdbConnectorException(throwable)) }); } queryCompleteFeature.get();

Yes, I've already tried to switch to CompletableFuture yesterday, but I haven't got it working yet due to limited time. I know CountDownLatch is not a good choice.

But merely changing this method to a synchronous one is not enough, as it occupies the lock in the pollNext method.

But merely changing this method to a synchronous one is not enough, as it occupies the lock in the pollNext method.

It is normal to occupy a lock, and we must ensure that the lock(output.getCheckpointLock()) cannot be released until a split read is completed

davidzollo added the First-time contributor First-time contributor label May 7, 2024

Hisoka-X added feature New feature influxdb labels May 8, 2024

zhoulonghua added 3 commits May 8, 2024 19:55

InfluxDB source add read by chunk

4f341b7

InfluxDB source add read by chunk

f2a5de7

InfluxDB source add read by chunk fix doc

23af1d4

15767714253 force-pushed the dev branch from bcf7659 to 23af1d4 Compare May 8, 2024 13:05

zhoulonghua added 2 commits May 8, 2024 21:07

InfluxDB source add read by chunk fix doc

1936c2f

InfluxDB source add read by chunk fix doc

caf1de3

zhilinli123 reviewed May 9, 2024

View reviewed changes

docs/zh/connector-v2/source/InfluxDB.md

## Options

| name | type | required | default value |

Copy link

Contributor

zhilinli123 May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to mysql Documentation

zhilinli123 reviewed May 9, 2024

View reviewed changes

EricJoy2048 reviewed May 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] [InfluxDB Source] add read by chunk #6808

[Feature] [InfluxDB Source] add read by chunk #6808

15767714253 commented May 7, 2024

Hisoka-X commented May 8, 2024

zhilinli123 commented May 8, 2024 •

edited

zhilinli123 May 9, 2024

zhilinli123 May 9, 2024

zhilinli123 May 9, 2024

zhilinli123 May 9, 2024

15767714253 May 9, 2024 •

edited

EricJoy2048 May 9, 2024

EricJoy2048 May 9, 2024

15767714253 May 9, 2024

EricJoy2048 May 9, 2024

15767714253 May 9, 2024

EricJoy2048 May 9, 2024

15767714253 May 9, 2024

EricJoy2048 May 9, 2024

15767714253 May 9, 2024

15767714253 May 9, 2024

EricJoy2048 May 9, 2024


		> 1. 需要确保连接器Jar包 [influxDB connector jar package](https://mvnrepository.com/artifact/org.apache.seatunnel/connector-influxdb) 被放在目录 `${SEATUNNEL_HOME}/lib/`.

		## Key features

[Feature] [InfluxDB Source] add read by chunk #6808

Are you sure you want to change the base?

[Feature] [InfluxDB Source] add read by chunk #6808

Conversation

15767714253 commented May 7, 2024

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Hisoka-X commented May 8, 2024

zhilinli123 commented May 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

15767714253 May 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhilinli123 commented May 8, 2024 •

edited

15767714253 May 9, 2024 •

edited