Having two PartialFunctions
f
and g
with no side effects, we compose them into another partial function h
such that
h.isDefinedAt(x)
if f.isDefinedAt(x)
and g.isDefinedAt(f(x))
Having two PartialFunctions
f
and g
with no side effects, we compose them into another partial function h
such that
h.isDefinedAt(x)
if f.isDefinedAt(x)
and g.isDefinedAt(f(x))
a PartialFunction
composed from f
and g
An accumulator for profiling, Contains a label and the number of milliseconds (acc
)
Contains all ProfAccs defined and which of those are currently accumulating/measuring time (as a stack)
Trait to be used when profiling is needed.
Trait to be used when profiling is needed. To profile a section of the code, the following steps should be followed: - Call profileInit(<label_for_first_section>, <label_for_second_section>, ...) and store it into an implicit value - Use profile(<section_name>) and profileEnd in a bracket opening/closing fashion, where the code between a profile call and its corresponding profileEnd will have its execution time measured and stored into its accumulator - Call profilePrint at the end to log the profiling information gathered
Implements buffering (and value serialization) for Vector
columns
Spark -> Vector loading part
Spark -> Vector loading part
Loading from Spark
to Vector
will be initiated through the SparkSQL
/DataFrames
APIs.
The sequence of operations is:
RDD
with its corresponding data type information, its fields will either be matched to existing table's columns or they will help generate a
create table
SQL statement that will be first submitted to Vector
.Spark
worker to be able to process, serialize and write binary data to Vector
end points.DataStreams
and that will create a NarrowDependency
to the input RDD
Vector
leader nodeRowWriter
and DataStreamWriter
objects, part of the closure, are serialized and sent to worker processesInsertRDD
was created), serializes it into
ByteBuffers
and then flushes them through the socket towards one (and only one) predetermined Vector
end pointcommit
or abort
depending on whether any of the
workers failed. Note, we currently do not retry Spark
workers since partial loading is not supported in Vector
yet.Unloading from Vector
to Spark
will be initiated through the SparkSQL
/DataFrames
APIs.
The sequence of operations is:
SparkSQL
select query, a com.actian.spark_vector.datastream.reader.ScanRDD is created containing as many partitions as there are DataStreams
Spark
worker to be able to read binary data from Vector
end points, deserialize and process itVector
leader nodeRowReader
and DataStreamReader
objects, part of the closure, are serialized and sent to worker processesScanRDD
was created), deserializes it into
ByteBuffers
and then through an Iterator[Row]
we can collect the data row by rowcommit
or abort
depending on whether any of the
workers failed. Note, we currently do not retry Spark
workers since partial loading is not supported in Vector
yet.
Implements SparkSQL
API, including a BaseRelation
and a RelationProvider
This package contains a series of helpers to create and manage JDBC connections to Vector
, to define the equivalence of SparkSQL
and Vector
data types,
to create tables when they do not exist (respecting input DataFrame
's schema), to obtain column metadata for Vector
tables, and high level methods to initiate
loading to Vector
tables.
Spark-Vector connector.
With this connector, data can be loaded from
Spark
sources intoVector
and results ofVector
computations can be consumed inSpark
and transformed into aDataFrame
. Both operations are performed in parallel: the data coming from every inputRDD
partition is serialized usingVector's
binary protocol and passed through socket connections toVector
end points. Although there is a chance that network communication is incurred at this point, most of the time this connector will try to assign only localRDD
partitions to eachVector
end point. Respectively,Vector
data is unloaded through socket connections and consumed inSpark
into anRDD
.Throughout the documentation we will use
DataStream
andVector
end point interchangeably. AVector DataStream
is the logical stream of consuming binary data inVector
. Typically, theseDataStream
s will be executed in parallel (i.e. there will be as many threads asDataStreams
allocated), but there will be cases when aVector
thread will handle multipleDataStreams
. On the other hand, each connection to aVector
end point maps to exactly oneDataStream
.