spark_vector

Spark-Vector connector.

With this connector, data can be loaded from Spark sources into Vector and results of Vector computations can be consumed in Spark and transformed into a DataFrame. Both operations are performed in parallel: the data coming from every input RDD partition is serialized using Vector's binary protocol and passed through socket connections to Vector end points. Although there is a chance that network communication is incurred at this point, most of the time this connector will try to assign only local RDD partitions to each Vector end point. Respectively, Vector data is unloaded through socket connections and consumed in Spark into an RDD.

Throughout the documentation we will use DataStream and Vector end point interchangeably. A Vector DataStream is the logical stream of consuming binary data in Vector. Typically, these DataStreams will be executed in parallel (i.e. there will be as many threads as DataStreams allocated), but there will be cases when a Vector thread will handle multiple DataStreams. On the other hand, each connection to a Vector end point maps to exactly one DataStream.

Linear Supertypes

AnyRef, Any

Type Members

implicit class ComposePartial[A, B] extends AnyRef

Having two PartialFunctions f and g with no side effects, we compose them into another partial function h such that h.isDefinedAt(x) if f.isDefinedAt(x) and g.isDefinedAt(f(x))
Having two PartialFunctions f and g with no side effects, we compose them into another partial function h such that h.isDefinedAt(x) if f.isDefinedAt(x) and g.isDefinedAt(f(x))
returns
a PartialFunction composed from f and g
case class ProfAcc(name: String, acc: Long = 0) extends Product with Serializable

An accumulator for profiling, Contains a label and the number of milliseconds (acc)
case class ProfAccMap(accs: Map[String, ProfAcc], started: Stack[ProfAcc] = Stack.empty[ProfAcc]) extends Product with Serializable

Contains all ProfAccs defined and which of those are currently accumulating/measuring time (as a stack)
trait Profiling extends AnyRef

Trait to be used when profiling is needed.
Trait to be used when profiling is needed. To profile a section of the code, the following steps should be followed: - Call profileInit(<label_for_first_section>, <label_for_second_section>, ...) and store it into an implicit value - Use profile(<section_name>) and profileEnd in a bracket opening/closing fashion, where the code between a profile call and its corresponding profileEnd will have its execution time measured and stored into its accumulator - Call profilePrint at the end to log the profiling information gathered

Value Members

package colbuffer

Implements buffering (and value serialization) for Vector columns
package datastream

Spark -> Vector loading part
Spark -> Vector loading part
Loading from Spark to Vector will be initiated through the SparkSQL/DataFrames APIs. The sequence of operations is:
- Given an input RDD with its corresponding data type information, its fields will either be matched to existing table's columns or they will help generate a create table SQL statement that will be first submitted to Vector.
- Helpers com.actian.spark_vector.datastream.writer.RowWriter and com.actian.spark_vector.datastream.writer.DataStreamWriter objects are created and they contain all the needed information for a Spark worker to be able to process, serialize and write binary data to Vector end points.
- An com.actian.spark_vector.datastream.writer.InsertRDD is created, containing as many partitions as there are DataStreams and that will create a NarrowDependency to the input RDD
- Driver initiates the load, issuing a SQL query to Vector leader node
- Driver initiates Spark job => RowWriter and DataStreamWriter objects, part of the closure, are serialized and sent to worker processes
- Each worker process reads its corresponding write configuration and starts processing input data (as assigned by the driver when InsertRDD was created), serializes it into ByteBuffers and then flushes them through the socket towards one (and only one) predetermined Vector end point
- During this time, the driver remains blocked waiting for the SQL query to finish. Once all workers are done, the driver then issues a commit or abort depending on whether any of the workers failed. Note, we currently do not retry Spark workers since partial loading is not supported in Vector yet.
Unloading from Vector to Spark will be initiated through the SparkSQL/DataFrames APIs. The sequence of operations is:
- Given a SparkSQL select query, a com.actian.spark_vector.datastream.reader.ScanRDD is created containing as many partitions as there are DataStreams
- Helpers com.actian.spark_vector.datastream.reader.RowReader and com.actian.spark_vector.datastream.reader.DataStreamReader objects are created and they contain all the needed information for a Spark worker to be able to read binary data from Vector end points, deserialize and process it
- Driver initiates the unload, issuing a SQL query to Vector leader node
- Driver initiates Spark job => RowReader and DataStreamReader objects, part of the closure, are serialized and sent to worker processes
- Each worker process reads its corresponding read configuration and starts processing output data (as assigned by the driver when ScanRDD was created), deserializes it into ByteBuffers and then through an Iterator[Row] we can collect the data row by row
- During this time, the driver remains blocked waiting for the SQL query to finish. Once all workers are done, the driver then issues a commit or abort depending on whether any of the workers failed. Note, we currently do not retry Spark workers since partial loading is not supported in Vector yet.
package sql

Implements SparkSQL API, including a BaseRelation and a RelationProvider
package srp
package util
package vector

This package contains a series of helpers to create and manage JDBC connections to Vector, to define the equivalence of SparkSQL and Vector data types, to create tables when they do not exist (respecting input DataFrame's schema), to obtain column metadata for Vector tables, and high level methods to initiate loading to Vector tables.

package spark_vector

Type Members

implicit class ComposePartial[A, B] extends AnyRef

case class ProfAcc(name: String, acc: Long = 0) extends Product with Serializable

case class ProfAccMap(accs: Map[String, ProfAcc], started: Stack[ProfAcc] = Stack.empty[ProfAcc]) extends Product with Serializable

trait Profiling extends AnyRef

Value Members

package colbuffer

package datastream

package sql

package srp

package util

package vector

Inherited from AnyRef

Inherited from Any

Ungrouped