Package

com.actian

spark_vector

Permalink

package spark_vector

Spark-Vector connector.

With this connector, data can be loaded from Spark sources into Vector and results of Vector computations can be consumed in Spark and transformed into a DataFrame. Both operations are performed in parallel: the data coming from every input RDD partition is serialized using Vector's binary protocol and passed through socket connections to Vector end points. Although there is a chance that network communication is incurred at this point, most of the time this connector will try to assign only local RDD partitions to each Vector end point. Respectively, Vector data is unloaded through socket connections and consumed in Spark into an RDD.

Throughout the documentation we will use DataStream and Vector end point interchangeably. A Vector DataStream is the logical stream of consuming binary data in Vector. Typically, these DataStreams will be executed in parallel (i.e. there will be as many threads as DataStreams allocated), but there will be cases when a Vector thread will handle multiple DataStreams. On the other hand, each connection to a Vector end point maps to exactly one DataStream.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. spark_vector
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. implicit class ComposePartial[A, B] extends AnyRef

    Permalink

    Having two PartialFunctions f and g with no side effects, we compose them into another partial function h such that h.isDefinedAt(x) if f.isDefinedAt(x) and g.isDefinedAt(f(x))

    Having two PartialFunctions f and g with no side effects, we compose them into another partial function h such that h.isDefinedAt(x) if f.isDefinedAt(x) and g.isDefinedAt(f(x))

    returns

    a PartialFunction composed from f and g

  2. case class ProfAcc(name: String, acc: Long = 0) extends Product with Serializable

    Permalink

    An accumulator for profiling, Contains a label and the number of milliseconds (acc)

  3. case class ProfAccMap(accs: Map[String, ProfAcc], started: Stack[ProfAcc] = Stack.empty[ProfAcc]) extends Product with Serializable

    Permalink

    Contains all ProfAccs defined and which of those are currently accumulating/measuring time (as a stack)

  4. trait Profiling extends AnyRef

    Permalink

    Trait to be used when profiling is needed.

    Trait to be used when profiling is needed. To profile a section of the code, the following steps should be followed: - Call profileInit(<label_for_first_section>, <label_for_second_section>, ...) and store it into an implicit value - Use profile(<section_name>) and profileEnd in a bracket opening/closing fashion, where the code between a profile call and its corresponding profileEnd will have its execution time measured and stored into its accumulator - Call profilePrint at the end to log the profiling information gathered

Value Members

  1. package colbuffer

    Permalink

    Implements buffering (and value serialization) for Vector columns

  2. package datastream

    Permalink

    Spark -> Vector loading part

    Spark -> Vector loading part

    Loading from Spark to Vector will be initiated through the SparkSQL/DataFrames APIs. The sequence of operations is:

    • Given an input RDD with its corresponding data type information, its fields will either be matched to existing table's columns or they will help generate a create table SQL statement that will be first submitted to Vector.
    • Helpers com.actian.spark_vector.datastream.writer.RowWriter and com.actian.spark_vector.datastream.writer.DataStreamWriter objects are created and they contain all the needed information for a Spark worker to be able to process, serialize and write binary data to Vector end points.
    • An com.actian.spark_vector.datastream.writer.InsertRDD is created, containing as many partitions as there are DataStreams and that will create a NarrowDependency to the input RDD
    • Driver initiates the load, issuing a SQL query to Vector leader node
    • Driver initiates Spark job => RowWriter and DataStreamWriter objects, part of the closure, are serialized and sent to worker processes
    • Each worker process reads its corresponding write configuration and starts processing input data (as assigned by the driver when InsertRDD was created), serializes it into ByteBuffers and then flushes them through the socket towards one (and only one) predetermined Vector end point
    • During this time, the driver remains blocked waiting for the SQL query to finish. Once all workers are done, the driver then issues a commit or abort depending on whether any of the workers failed. Note, we currently do not retry Spark workers since partial loading is not supported in Vector yet.

    Unloading from Vector to Spark will be initiated through the SparkSQL/DataFrames APIs. The sequence of operations is:

    • Given a SparkSQL select query, a com.actian.spark_vector.datastream.reader.ScanRDD is created containing as many partitions as there are DataStreams
    • Helpers com.actian.spark_vector.datastream.reader.RowReader and com.actian.spark_vector.datastream.reader.DataStreamReader objects are created and they contain all the needed information for a Spark worker to be able to read binary data from Vector end points, deserialize and process it
    • Driver initiates the unload, issuing a SQL query to Vector leader node
    • Driver initiates Spark job => RowReader and DataStreamReader objects, part of the closure, are serialized and sent to worker processes
    • Each worker process reads its corresponding read configuration and starts processing output data (as assigned by the driver when ScanRDD was created), deserializes it into ByteBuffers and then through an Iterator[Row] we can collect the data row by row
    • During this time, the driver remains blocked waiting for the SQL query to finish. Once all workers are done, the driver then issues a commit or abort depending on whether any of the workers failed. Note, we currently do not retry Spark workers since partial loading is not supported in Vector yet.
  3. package sql

    Permalink

    Implements SparkSQL API, including a BaseRelation and a RelationProvider

  4. package srp

    Permalink
  5. package util

    Permalink
  6. package vector

    Permalink

    This package contains a series of helpers to create and manage JDBC connections to Vector, to define the equivalence of SparkSQL and Vector data types, to create tables when they do not exist (respecting input DataFrame's schema), to obtain column metadata for Vector tables, and high level methods to initiate loading to Vector tables.

Inherited from AnyRef

Inherited from Any

Ungrouped