public abstract class AbstractFileBasedConnector extends Object implements Connector
| Modifier and Type | Class and Description |
|---|---|
protected static interface |
AbstractFileBasedConnector.RecordReader
A reader for
Records. |
protected static interface |
AbstractFileBasedConnector.RecordWriter
A writer for
Records. |
| Constructor and Description |
|---|
AbstractFileBasedConnector() |
| Modifier and Type | Method and Description |
|---|---|
protected Flux<Record> |
applyPerFileLimits(Flux<Record> records)
Applies per-file limits to a stream of records coming from
readSingleFile(java.net.URL). |
void |
close()
Closes the connector.
|
void |
configure(Config settings,
boolean read,
boolean retainRecordSources)
Configures the connector.
|
protected abstract String |
getConnectorName()
Returns the connector name, e.g.
|
protected URL |
getOrCreateDestinationURL()
Returns the URL that the connector should write to.
|
void |
init()
Initializes the connector.
|
protected boolean |
isDataSizeSamplingAvailable()
Checks whether it is safe to perform data size sampling on this connector's data source.
|
protected List<URL> |
loadURLs(Config settings)
Validates and computes the list of URLs to be read or written.
|
protected abstract AbstractFileBasedConnector.RecordReader |
newSingleFileReader(URL url)
Returns a new
AbstractFileBasedConnector.RecordReader instance; cannot be null. |
protected abstract AbstractFileBasedConnector.RecordWriter |
newSingleFileWriter()
Returns a new
AbstractFileBasedConnector.RecordWriter instance; cannot be null. |
protected void |
processURLsForRead()
Inspects the list or URLs as loaded by
loadURLs(Config) and determines exactly what
files and folders need to be read. |
protected void |
processURLsForWrite()
Inspects the list or URLs as loaded by
loadURLs(Config) and determines if the
connector should write to a single file, or to a directory of files. |
Publisher<Publisher<Record>> |
read()
Reads all records from the datasource in a flow of flows that can be consumed in parallel.
|
int |
readConcurrency()
Returns the desired read concurrency, that is, how many resources are expected to be read in
parallel.
|
protected Flux<Record> |
readSingleFile(URL url)
Reads a single text file accessible through the given URL.
|
protected Flux<URL> |
scanRootDirectory(Path root)
Scans a directory for readable files and returns the files found as a stream.
|
Function<Publisher<Record>,Publisher<Record>> |
write()
Returns a function that handles writing records to the datasource.
|
int |
writeConcurrency()
Returns the desired write concurrency, that is, how many resources are expected to be written
in parallel.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitgetRecordMetadata, supportsprotected static final String URL
protected static final String URLFILE
protected static final String COMPRESSION
protected static final String ENCODING
protected static final String FILE_NAME_PATTERN
protected static final String SKIP_RECORDS
protected static final String MAX_RECORDS
protected static final String MAX_CONCURRENT_FILES
protected static final String RECURSIVE
protected static final String FILE_NAME_FORMAT
protected boolean read
protected boolean retainRecordSources
protected Charset encoding
protected String compression
protected String fileNameFormat
protected boolean recursive
protected String pattern
protected long skipRecords
protected long maxRecords
protected int resourceCount
protected int maxConcurrentFiles
protected Deque<AbstractFileBasedConnector.RecordWriter> writers
protected AbstractFileBasedConnector.RecordWriter singleWriter
protected AtomicInteger fileCounter
protected AtomicInteger nextWriterIndex
public int readConcurrency()
ConnectorThe workflow runner is guaranteed to never consume more inner flows in parallel than readConcurrency.
Depending on the read concurrency, and whether it is greater than the number of cores or not, the runner will try to assign one thread to each resource, thus eliminating any context switch, from the resource to read until the final statements to execute.
This method should only be called after the connector is properly configured and initialized.
readConcurrency in interface Connectorpublic int writeConcurrency()
ConnectorThe workflow runner is guaranteed to never invoke the Connector.write() function by more than
writeConcurrency threads concurrently.
Depending on the write concurrency, and whether it is greater than the number of cores or not, the runner will decide to assign one thread to each resource, thus eliminating any context switch, from the decoded rows until the final resource being written.
This method should only be called after the connector is properly configured and initialized.
writeConcurrency in interface Connectorpublic void configure(@NonNull
Config settings,
boolean read,
boolean retainRecordSources)
Connectorconfigure in interface Connectorsettings - the connector settings.read - whether the connector should be configured for reading or writing.retainRecordSources - whether the connector should retain sources when emitting records; only applicable when the connector is being configured for
reads.public void init()
throws URISyntaxException,
IOException
ConnectorThis method should only be called after the connector is properly configured.
init in interface ConnectorURISyntaxExceptionIOException@NonNull public Publisher<Publisher<Record>> read()
ConnectorThe inner flows are guaranteed to be consumed with a parallelism no greater than the read concurrency.
If the underlying datasource cannot split its records into multiple flows in any efficient manner, then this method should return a publisher of one single inner publisher.
This method should only be called after the connector is properly configured and initialized.
@NonNull public Function<Publisher<Record>,Publisher<Record>> write()
ConnectorThis method should only be called after the connector is properly configured and initialized.
public void close()
Connectorclose in interface Connectorclose in interface AutoCloseable@NonNull protected abstract String getConnectorName()
connector.foo.@NonNull protected Flux<Record> readSingleFile(@NonNull URL url)
Implementors should not care about maxRecords and skipRecords, these will be
applied later on, see applyPerFileLimits(Flux).
url - The URL to read; must not be null; must be accessible and readable (but not
necessarily hosted on the local filesystem).Records; never null but may be empty.@NonNull protected abstract AbstractFileBasedConnector.RecordReader newSingleFileReader(@NonNull URL url) throws IOException
AbstractFileBasedConnector.RecordReader instance; cannot be null. Only used when reading. Each
invocation of this method is expected to return a newly-allocated instance. The reader is
expected to be initialized already, and ready to emit its first record. It is possible to throw
IOException if the reader cannot be initialized.IOException@NonNull protected abstract AbstractFileBasedConnector.RecordWriter newSingleFileWriter()
AbstractFileBasedConnector.RecordWriter instance; cannot be null. Only used when writing. Each
invocation of this method is expected to return a newly-allocated instance.@NonNull protected List<URL> loadURLs(@NonNull Config settings)
URLFILE and URL configuration options.
Expected to be called during the configuration phase.
protected void processURLsForRead()
throws URISyntaxException,
IOException
loadURLs(Config) and determines exactly what
files and folders need to be read.
Should be called at the beginning of the initialization process, but only when reading, never when writing.
This method expects that loadURLs(Config) has been previously called.
URISyntaxExceptionIOExceptionprotected void processURLsForWrite()
throws URISyntaxException,
IOException
loadURLs(Config) and determines if the
connector should write to a single file, or to a directory of files.
Should be called at the beginning of the initialization process, but only when writing, never when reading.
This method expects that loadURLs(Config) has been previously called, and also
expects exactly one URL to be present, which can be either a directory or a file.
URISyntaxExceptionIOException@NonNull protected Flux<URL> scanRootDirectory(@NonNull Path root)
@NonNull protected Flux<Record> applyPerFileLimits(@NonNull Flux<Record> records)
readSingleFile(java.net.URL).@NonNull protected URL getOrCreateDestinationURL()
This can be either a single file or a directory of files. If the former, each invocation of this method will return the same URL; if the latter, each invocation of this method will generate a new URL inside the directory, with a unique file name.
protected boolean isDataSizeSamplingAvailable()
This implementation simply checks that none of the urls to read is reading from standard input, since standard input is not rewindable. Any other URL protocol is considered safe.
Copyright © 2017–2021 DataStax. All rights reserved.