Having two PartialFunctions f and g with no side effects, we compose them into another partial function h such that
h.isDefinedAt(x) if f.isDefinedAt(x) and g.isDefinedAt(f(x))
Having two PartialFunctions f and g with no side effects, we compose them into another partial function h such that
h.isDefinedAt(x) if f.isDefinedAt(x) and g.isDefinedAt(f(x))
a PartialFunction composed from f and g
An accumulator for profiling, Contains a label and the number of milliseconds (acc)
Contains all ProfAccs defined and which of those are currently accumulating/measuring time (as a stack)
Trait to be used when profiling is needed.
Trait to be used when profiling is needed. To profile a section of the code, the following steps should be followed: - Call profileInit(<label_for_first_section>, <label_for_second_section>, ...) and store it into an implicit value - Use profile(<section_name>) and profileEnd in a bracket opening/closing fashion, where the code between a profile call and its corresponding profileEnd will have its execution time measured and stored into its accumulator - Call profilePrint at the end to log the profiling information gathered
Implements buffering (and value serialization) for Vector columns
Spark -> Vector loading part
Spark -> Vector loading part
Loading from Spark to Vector will be initiated through the SparkSQL/DataFrames APIs.
The sequence of operations is:
RDD with its corresponding data type information, its fields will either be matched to existing table's columns or they will help generate a
create table SQL statement that will be first submitted to Vector.Spark worker to be able to process, serialize and write binary data to Vector end points.DataStreams and that will create a NarrowDependency to the input RDDVector leader nodeRowWriter and DataStreamWriter objects, part of the closure, are serialized and sent to worker processesInsertRDD was created), serializes it into
ByteBuffers and then flushes them through the socket towards one (and only one) predetermined Vector end pointcommit or abort depending on whether any of the
workers failed. Note, we currently do not retry Spark workers since partial loading is not supported in Vector yet.Unloading from Vector to Spark will be initiated through the SparkSQL/DataFrames APIs.
The sequence of operations is:
SparkSQL select query, a com.actian.spark_vector.datastream.reader.ScanRDD is created containing as many partitions as there are DataStreamsSpark worker to be able to read binary data from Vector end points, deserialize and process itVector leader nodeRowReader and DataStreamReader objects, part of the closure, are serialized and sent to worker processesScanRDD was created), deserializes it into
ByteBuffers and then through an Iterator[Row] we can collect the data row by rowcommit or abort depending on whether any of the
workers failed. Note, we currently do not retry Spark workers since partial loading is not supported in Vector yet.
Implements SparkSQL API, including a BaseRelation and a RelationProvider
This package contains a series of helpers to create and manage JDBC connections to Vector, to define the equivalence of SparkSQL and Vector data types,
to create tables when they do not exist (respecting input DataFrame's schema), to obtain column metadata for Vector tables, and high level methods to initiate
loading to Vector tables.
Spark-Vector connector.
With this connector, data can be loaded from
Sparksources intoVectorand results ofVectorcomputations can be consumed inSparkand transformed into aDataFrame. Both operations are performed in parallel: the data coming from every inputRDDpartition is serialized usingVector'sbinary protocol and passed through socket connections toVectorend points. Although there is a chance that network communication is incurred at this point, most of the time this connector will try to assign only localRDDpartitions to eachVectorend point. Respectively,Vectordata is unloaded through socket connections and consumed inSparkinto anRDD.Throughout the documentation we will use
DataStreamandVectorend point interchangeably. AVector DataStreamis the logical stream of consuming binary data inVector. Typically, theseDataStreams will be executed in parallel (i.e. there will be as many threads asDataStreamsallocated), but there will be cases when aVectorthread will handle multipleDataStreams. On the other hand, each connection to aVectorend point maps to exactly oneDataStream.