Spark Sample Codes in Scala

1. Preliminaries

  • Scala code is build by Maven. For an example, refer to [1].
  • For a way to install Spark, refer to [2].
  • OS is CentOS7
  • CPU: AMD Ryzen 7 3700X 8-Core Processor

2. Sample codes

2.1 Accumulation

(1) Build the following code. This code is based on [3].

package package1
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object App {
  def main(args : Array[String]){
    val conf = new SparkConf().setAppName("my-app").setMaster("local")
    val sc = new SparkContext(conf)

    val data = Array(1, 2, 3, 4, 5)

    val accum = sc.longAccumulator("my-app")
    var rdd = sc.parallelize(data)

    rdd.foreach(x => accum.add(x))
    println("Counter value: " + accum.value)    
  }
}

(2) Execute code by spark-submit[4].

~/spark-3.1.2/bin/spark-submit --class package1.App --master local[8] ./target/deequ-1.0-jar-with-dependencies.jar

2.2 Calculate π

(1) Build the following code. This code is based on [3].

package package1
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object App {
  def main(args : Array[String]){
    val conf = new SparkConf().setAppName("my-app").setMaster("local")
    val sc = new SparkContext(conf)
    println(args(0))
    val NUM_SAMPLES = args(0).toInt
    val count = sc.parallelize(1 to NUM_SAMPLES).filter { _ =>
      val x = math.random
      val y = math.random
      x*x + y*y < 1
    }.count()
    println(s"Pi is roughly ${4.0 * count / NUM_SAMPLES}")
  }
}

(2) Execute code by spark-submit[4]. It takes 25sec. That is faster than PostgreSQL’s SQL[6] 150sec.

~/spark-3.1.2/bin/spark-submit --class package1.App --master local[8] ./target/deequ-1.0-jar-with-dependencies.jar 1000000000

2.3 Spark SQL

(1) Create a test dataset by the following python code with argument 600000000.

import random
import sys

args = sys.argv
random.seed(0)
print("value,dummy1,dummy2,dummy3,dummy4,dummy5")
for i in range(int(args[1])):
  print(str(random.random()) + "," + "01234567890123456789dummy1" + "," + "01234567890123456789dummy2" + "," + "01234567890123456789dummy3" + "," + "01234567890123456789dummy4" + "," + "01234567890123456789dummy5")

(2) Build and execute the following code which is based on [7]. The following sql takes 110sec. That is slower than PostgreSQL’s performance 7sec.

package package1
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

import java.io.File

object App {
  case class RawData(productName: String, totalNumber: String, status: String, valuable: String)

  def main(args : Array[String]){
    val currentDir = new File(".").getAbsoluteFile().getParent()  
    println(currentDir)
    
    val spark = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config("spark.some.config.option", "some-value")
      .getOrCreate()

    val df = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("./data.csv")
    df.show()

    df.createOrReplaceTempView("test")

    val start=System.nanoTime()
    val result = spark.sql("SELECT avg(value) as average_value FROM test")
    result.show()
    val end = System.nanoTime()
    println("Time elapsed: " + (end-start) / 1000000000.0 + " secs")
  }
}

3. References

[1] Various ways to print “Hello world” in Scala

[2] Installing Apache Spark from source

[3] RDD Programming Guide

https://spark.apache.org/docs/latest/rdd-programming-guide.html

[4] Submitting Applications

https://spark.apache.org/docs/latest/submitting-applications.html

[5] Apache Spark Examples

https://spark.apache.org/examples.html

[6] Calculate π in PostgreSQL

[7] Spark SQL, DataFrames and Datasets Guide

https://spark.apache.org/docs/2.3.0/sql-programming-guide.html

Published by ktke109

I love open souce database management systems.