Spark-Vector loader
Spark-Vector loader
The Spark-Vector loader is a utility that facilitates loading files of different formats (for now CSV,Parquet, and Orc only) into
Vector, through Spark and using the Spark-Vector connector.
For CSV parsing, the Spark-Vector loader uses the csv method of the org.apache.spark.sql.DataFrame class.
Example:
This scala code snippet to read a csv file (executed in spark-shell):
sqlContext.sql("""CREATE TEMPORARY TABLE large_table USING com.actian.spark_vector.sql.DefaultSource OPTIONS ( host "vectorhost", instance "VI", database "dbName", table "large_table" )""") sqlContext.sql("""CREATE TEMPORARY TABLE csv_files USING csv OPTIONS (path "hdfs://namenode:8020/data/csv_file*", header "false", sep "|") """) val results = sqlContext.sql("""insert into table large_table select * from csv_files""")
is equivalent to
spark-submit --master spark://spark_master:7077 --class com.actian.spark_vector.loader.Main $SPARK_VECTOR/loader/target/spark_vector_loader-assembly-2.0-SNAPSHOT.jar load csv -sf "hdfs://namenode:8020/data/csv_file*" -sc "|" -vh vectorhost -vi VI -vd dbName -tt large_table
To read a parquet file this
sqlContext.read.parquet("hdfs://namenode:8020/data/parquet_file.parquet").registerTempTable("parquet_file") sqlContext.sql("""insert into table large_table select * from parquet_file""")
is equivalent to
spark-submit --master spark://spark_master:7077 --class com.actian.spark_vector.loader.Main $SPARK_VECTOR/loader/target/spark_vector_loader-assembly-2.0-SNAPSHOT.jar load parquet -sf "hdfs://namenode:8020/data/parquet_file.parquet" -vh vectorhost -vi VI -vd dbName -tt large_table
To read a orc file this
sqlContext.read.orc("hdfs://namenode:8020/data/orc_file.orc").registerTempTable("orc_file") sqlContext.sql("""insert into table large_table select * from orc_file""")
is equivalent to
spark-submit --master spark://spark_master:7077 --class com.actian.spark_vector.loader.Main $SPARK_VECTOR/loader/target/spark_vector_loader-assembly-2.0-SNAPSHOT.jar load orc -sf "hdfs://namenode:8020/data/orc_file.orc" -vh vectorhost -vi VI -vd dbName -tt large_table
Of course, by using the Spark-Vector connector directly, one can load arbitrarily complex relations (not only files) into Vector and
files of any format that Spark is able to read.
For a complete list of options available, see Args.
Args.