datastream

Spark -> Vector loading part

Loading from Spark to Vector will be initiated through the SparkSQL/DataFrames APIs. The sequence of operations is:

Given an input RDD with its corresponding data type information, its fields will either be matched to existing table's columns or they will help generate a create table SQL statement that will be first submitted to Vector.
Helpers com.actian.spark_vector.datastream.writer.RowWriter and com.actian.spark_vector.datastream.writer.DataStreamWriter objects are created and they contain all the needed information for a Spark worker to be able to process, serialize and write binary data to Vector end points.
An com.actian.spark_vector.datastream.writer.InsertRDD is created, containing as many partitions as there are DataStreams and that will create a NarrowDependency to the input RDD
Driver initiates the load, issuing a SQL query to Vector leader node
Driver initiates Spark job => RowWriter and DataStreamWriter objects, part of the closure, are serialized and sent to worker processes
Each worker process reads its corresponding write configuration and starts processing input data (as assigned by the driver when InsertRDD was created), serializes it into ByteBuffers and then flushes them through the socket towards one (and only one) predetermined Vector end point
During this time, the driver remains blocked waiting for the SQL query to finish. Once all workers are done, the driver then issues a commit or abort depending on whether any of the workers failed. Note, we currently do not retry Spark workers since partial loading is not supported in Vector yet.

Unloading from Vector to Spark will be initiated through the SparkSQL/DataFrames APIs. The sequence of operations is:

Given a SparkSQL select query, a com.actian.spark_vector.datastream.reader.ScanRDD is created containing as many partitions as there are DataStreams
Helpers com.actian.spark_vector.datastream.reader.RowReader and com.actian.spark_vector.datastream.reader.DataStreamReader objects are created and they contain all the needed information for a Spark worker to be able to read binary data from Vector end points, deserialize and process it
Driver initiates the unload, issuing a SQL query to Vector leader node
Driver initiates Spark job => RowReader and DataStreamReader objects, part of the closure, are serialized and sent to worker processes
Each worker process reads its corresponding read configuration and starts processing output data (as assigned by the driver when ScanRDD was created), deserializes it into ByteBuffers and then through an Iterator[Row] we can collect the data row by row
During this time, the driver remains blocked waiting for the SQL query to finish. Once all workers are done, the driver then issues a commit or abort depending on whether any of the workers failed. Note, we currently do not retry Spark workers since partial loading is not supported in Vector yet.

Linear Supertypes

AnyRef, Any

Type Members

case class DataStreamClient(vectorProps: VectorConnectionProperties, table: String) extends Serializable with Logging with Product

A client to prepare loading and issue the load SQL query to Vector
A client to prepare loading and issue the load SQL query to Vector
vectorProps
connection information
table
to which table this client will load data

Note
This client opens a JDBC connection when instantiated. To prevent leaks, the close method must be called
case class DataStreamPartition(index: Int, rdd: RDD[_], parentIndices: Seq[Int]) extends Partition with Product with Serializable

A partition corresponding to one DataStream
A partition corresponding to one DataStream
parentIndices
a list of indices of the parent partitions that will be assigned to this DataStream and will subsequently be serialized and sent through a single connection
final class DataStreamPartitionAssignment extends BipartiteAssignment

Class that contains the matching algorithm used to assign RDD partitions to Vector hosts, based on affinities.
Class that contains the matching algorithm used to assign RDD partitions to Vector hosts, based on affinities.
The algorithm used here tries to assign partitions to hosts for which they have affinity. For this reason only partitions that have affinity to at least one host are matched here, the others are assigned to a random node. Also, this algorithm aims to minimize the maximum number of partitions that a host will have assigned, i.e. the most data a host will process
case class VectorEndpoint(host: String, port: Int, username: String, password: String) extends Serializable with Product

Information to connect to a VectorEndpoint (DataStream)
case class VectorEndpointConf(vectorEndpoints: IndexedSeq[VectorEndpoint]) extends Serializable with Product

Configuration for read/write end points - one entry for each Vector end point expecting data

Value Members

object DataStreamPartitionAssignment
object VectorEndpoint extends Logging with Serializable

Contains helpers to obtain VectorEndpoint information from Vector's SQL interface.
Contains helpers to obtain VectorEndpoint information from Vector's SQL interface.

Note
The way this information is obtained, by issuing a select from a system table, will very likely be modified in the future
object VectorEndpointConf extends Serializable
def padding(pos: Int, typeSize: Int): Int

Helper to determine how much padding (# of trash bytes) needs to be written to properly align a type with size typeSize, given that we are currently at pos
package reader
package writer

package datastream

Type Members

case class DataStreamClient(vectorProps: VectorConnectionProperties, table: String) extends Serializable with Logging with Product

case class DataStreamPartition(index: Int, rdd: RDD[_], parentIndices: Seq[Int]) extends Partition with Product with Serializable

final class DataStreamPartitionAssignment extends BipartiteAssignment

case class VectorEndpoint(host: String, port: Int, username: String, password: String) extends Serializable with Product

case class VectorEndpointConf(vectorEndpoints: IndexedSeq[VectorEndpoint]) extends Serializable with Product

Value Members

object DataStreamPartitionAssignment

object VectorEndpoint extends Logging with Serializable

object VectorEndpointConf extends Serializable

def padding(pos: Int, typeSize: Int): Int

package reader

package writer

Inherited from AnyRef

Inherited from Any

Ungrouped