A client to prepare loading and issue the load SQL query to Vector
A client to prepare loading and issue the load SQL query to Vector
connection information
to which table this client will load data
This client opens a JDBC connection when instantiated. To prevent leaks, the close method must be called
A partition corresponding to one DataStream
A partition corresponding to one DataStream
a list of indices of the parent partitions that will be assigned to this DataStream and will
subsequently be serialized and sent through a single connection
Class that contains the matching algorithm used to assign RDD partitions to Vector hosts, based on affinities.
Class that contains the matching algorithm used to assign RDD partitions to Vector hosts, based on affinities.
The algorithm used here tries to assign partitions to hosts for which they have affinity. For this reason only partitions that have affinity to at least one host are matched here, the others are assigned to a random node. Also, this algorithm aims to minimize the maximum number of partitions that a host will have assigned, i.e. the most data a host will process
Information to connect to a VectorEndpoint (DataStream)
Configuration for read/write end points - one entry for each Vector end point expecting data
Contains helpers to obtain VectorEndpoint information from Vector's SQL interface.
Contains helpers to obtain VectorEndpoint information from Vector's SQL interface.
The way this information is obtained, by issuing a select from a system table, will very likely be modified in the future
Helper to determine how much padding (# of trash bytes) needs to be written to properly align a type with size typeSize, given that we are currently at pos
Spark -> Vector loading part
Loading from
SparktoVectorwill be initiated through theSparkSQL/DataFramesAPIs. The sequence of operations is:RDDwith its corresponding data type information, its fields will either be matched to existing table's columns or they will help generate acreate tableSQL statement that will be first submitted toVector.Sparkworker to be able to process, serialize and write binary data toVectorend points.DataStreamsand that will create aNarrowDependencyto the inputRDDVectorleader nodeRowWriterandDataStreamWriterobjects, part of the closure, are serialized and sent to worker processesInsertRDDwas created), serializes it intoByteBuffersand then flushes them through the socket towards one (and only one) predeterminedVectorend pointcommitorabortdepending on whether any of the workers failed. Note, we currently do not retrySparkworkers since partial loading is not supported inVectoryet.Unloading from
VectortoSparkwill be initiated through theSparkSQL/DataFramesAPIs. The sequence of operations is:SparkSQLselect query, a com.actian.spark_vector.datastream.reader.ScanRDD is created containing as many partitions as there areDataStreamsSparkworker to be able to read binary data fromVectorend points, deserialize and process itVectorleader nodeRowReaderandDataStreamReaderobjects, part of the closure, are serialized and sent to worker processesScanRDDwas created), deserializes it intoByteBuffersand then through anIterator[Row]we can collect the data row by rowcommitorabortdepending on whether any of the workers failed. Note, we currently do not retrySparkworkers since partial loading is not supported inVectoryet.