public interface Connector extends AutoCloseable
Lifecycle
The lifecycle of a connector is as follows:
Connectors are allowed to be stateful. DSBulk will always preserve the order of method invocations outlined above. It will never reuse a connector for more than one operation, and will never perform both a read and a write operation on the same connector instance.When reading, connectors act as publishers and emit data read from their datasources as a flow
of Records; when writing, connectors act as a transforming function: they receive Records from an upstream publisher, write them to the external datasource, then re-emit them to
downstream subscribers.
Read operations
A read() operation returns a Publisher of Publisher of Records.
All publishers returned by read methods are guaranteed to be subscribed only once;
implementors are allowed to optimize for single-subscriber use cases. Note however that the
workflow runner may invoke a read method twice during an operation, in case a data size sampling
is done before the operation is started. In this case, the connector is expected to "rewind" so
that each invocation of the read method behaves as if the data source was being read from the
beginning. If this is not supported, consider implementing supports(ConnectorFeature)
and disallowing CommonConnectorFeature.DATA_SIZE_SAMPLING.
Reading in parallel: connectors that are able to distinguish natural boundaries when reading
(e.g. when reading from more than one file, or reading from more than one database table) should
implement read() by emitting their records grouped by such resources; this allows DSBulk
to optimize read operations, and is specially valuable if records in the original dataset are
grouped by partition key inside each resource, because this natural grouping can thus be
preserved.
DSBulk will optimize reads according to the read concurrency. Implementors should implement
readConcurrency() carefully.
Write operations
The write() method is invoked only once per operation. The write function itself,
however, is expected to be invoked each time the workflow runner wishes the connector to write a
batch of records. Therefore implementors should expect this function to be invoked many times
(potentially thousands of times) during an operation, and invocation may also happen in parallel
from different threads. Implementors should therefore make sure that their implementation is
thread-safe.
Implementors should transform streams such that each record is written to the destination, then re-emitted to downstream subscribers, e.g. (pseudo-code using the Reactor Framework):
public Function<Publisher<Record>,Publisher<Record>> write() {
return upstream -> {
return Flux.from(upstream)
.doOnNext(writer::write)
.doOnTerminate(writer::flush);
}
The transformed publisher is guaranteed to be subscribed only once; implementors are allowed to optimize for single-subscriber use cases. Implementors are also allowed to memoize the transforming functions.
DSBulk will optimize writes according to the write concurrency. Implementors should implement
writeConcurrency() carefully.
Error handling
Unrecoverable errors (i.e., errors that put the connector in an unstable state, or in a state
where no subsequent records can be read or written successfully) should be emitted as onError signals to downstream subscribers. DSBulk will abort the operation immediately.
Recoverable errors however (i.e., errors that do not comprise the connector's ability to emit
further records successfully) should be converted to ErrorRecords and emitted as such.
This way, DSBulk is able to detect such errors and treat them accordingly (for example, by
redirecting them to a "bad file" or by updating a counter of errors).
| Modifier and Type | Method and Description |
|---|---|
default void |
close()
Closes the connector.
|
default void |
configure(Config settings,
boolean read,
boolean retainRecordSources)
Configures the connector.
|
RecordMetadata |
getRecordMetadata()
Returns metadata about the records that this connector can read or write.
|
default void |
init()
Initializes the connector.
|
Publisher<Publisher<Record>> |
read()
Reads all records from the datasource in a flow of flows that can be consumed in parallel.
|
int |
readConcurrency()
Returns the desired read concurrency, that is, how many resources are expected to be read in
parallel.
|
default boolean |
supports(ConnectorFeature feature)
Whether or not the connector supports the given feature.
|
Function<Publisher<Record>,Publisher<Record>> |
write()
Returns a function that handles writing records to the datasource.
|
int |
writeConcurrency()
Returns the desired write concurrency, that is, how many resources are expected to be written
in parallel.
|
@NonNull Publisher<Publisher<Record>> read()
The inner flows are guaranteed to be consumed with a parallelism no greater than the read concurrency.
If the underlying datasource cannot split its records into multiple flows in any efficient manner, then this method should return a publisher of one single inner publisher.
This method should only be called after the connector is properly configured and initialized.
Publisher of records read from the datasource, grouped by resources.@NonNull Function<Publisher<Record>,Publisher<Record>> write()
This method should only be called after the connector is properly configured and initialized.
Function that writes records from the upstream flow to the
datasource, then emits the records written to downstream subscribers.default void configure(@NonNull
Config settings,
boolean read,
boolean retainRecordSources)
throws IllegalArgumentException
settings - the connector settings.read - whether the connector should be configured for reading or writing.retainRecordSources - whether the connector should retain sources when emitting records; only applicable when the connector is being configured for
reads.IllegalArgumentException - if the connector fails to configure properly.default void init()
throws Exception
This method should only be called after the connector is properly configured.
Exception - if the connector fails to initialize properly.default void close()
throws Exception
close in interface AutoCloseableException - if the connector fails to close properly.default boolean supports(@NonNull
ConnectorFeature feature)
feature - the feature to check.true if this connector supports the feature, false otherwise.@NonNull RecordMetadata getRecordMetadata()
This method should only be called after the connector is properly configured and initialized.
int readConcurrency()
The workflow runner is guaranteed to never consume more inner flows in parallel than readConcurrency.
Depending on the read concurrency, and whether it is greater than the number of cores or not, the runner will try to assign one thread to each resource, thus eliminating any context switch, from the resource to read until the final statements to execute.
This method should only be called after the connector is properly configured and initialized.
int writeConcurrency()
The workflow runner is guaranteed to never invoke the write() function by more than
writeConcurrency threads concurrently.
Depending on the write concurrency, and whether it is greater than the number of cores or not, the runner will decide to assign one thread to each resource, thus eliminating any context switch, from the decoded rows until the final resource being written.
This method should only be called after the connector is properly configured and initialized.
Copyright © 2017–2021 DataStax. All rights reserved.