A client to prepare loading and issue the load SQL
query to Vector
A client to prepare loading and issue the load SQL
query to Vector
connection information
to which table this client will load data
This client opens a JDBC connection when instantiated. To prevent leaks, the close method must be called
A partition corresponding to one DataStream
A partition corresponding to one DataStream
a list of indices of the parent partitions that will be assigned to this DataStream
and will
subsequently be serialized and sent through a single connection
Class that contains the matching algorithm used to assign RDD
partitions to Vector hosts, based on affinities
.
Class that contains the matching algorithm used to assign RDD
partitions to Vector hosts, based on affinities
.
The algorithm used here tries to assign partitions to hosts for which they have affinity. For this reason only partitions that have affinity to at least one host are matched here, the others are assigned to a random node. Also, this algorithm aims to minimize the maximum number of partitions that a host will have assigned, i.e. the most data a host will process
Information to connect to a VectorEndpoint (DataStream)
Configuration for read/write end points - one entry for each Vector end point expecting data
Contains helpers to obtain VectorEndpoint information from Vector
's SQL interface.
Contains helpers to obtain VectorEndpoint information from Vector
's SQL interface.
The way this information is obtained, by issuing a select from a system table, will very likely be modified in the future
Helper to determine how much padding (# of trash bytes) needs to be written to properly align a type with size typeSize
, given that we are currently at pos
Spark -> Vector loading part
Loading from
Spark
toVector
will be initiated through theSparkSQL
/DataFrames
APIs. The sequence of operations is:RDD
with its corresponding data type information, its fields will either be matched to existing table's columns or they will help generate acreate table
SQL statement that will be first submitted toVector
.Spark
worker to be able to process, serialize and write binary data toVector
end points.DataStreams
and that will create aNarrowDependency
to the inputRDD
Vector
leader nodeRowWriter
andDataStreamWriter
objects, part of the closure, are serialized and sent to worker processesInsertRDD
was created), serializes it intoByteBuffers
and then flushes them through the socket towards one (and only one) predeterminedVector
end pointcommit
orabort
depending on whether any of the workers failed. Note, we currently do not retrySpark
workers since partial loading is not supported inVector
yet.Unloading from
Vector
toSpark
will be initiated through theSparkSQL
/DataFrames
APIs. The sequence of operations is:SparkSQL
select query, a com.actian.spark_vector.datastream.reader.ScanRDD is created containing as many partitions as there areDataStreams
Spark
worker to be able to read binary data fromVector
end points, deserialize and process itVector
leader nodeRowReader
andDataStreamReader
objects, part of the closure, are serialized and sent to worker processesScanRDD
was created), deserializes it intoByteBuffers
and then through anIterator[Row]
we can collect the data row by rowcommit
orabort
depending on whether any of the workers failed. Note, we currently do not retrySpark
workers since partial loading is not supported inVector
yet.