Scala - Apache Spark DataFrame API Cheatsheet

2018-06-24

spark

Having a good cheatsheet at hand can significantly speed up the development process. One of the best cheatsheet I have came across is sparklyr’s cheatsheet.

For my work, I’m using Spark’s DataFrame API in Scala to create data transformation pipelines. These are some functions and design patterns that I’ve found to be extremely useful.

Load data

val df = spark.read.parquet("filepath")

Get SparkContext information

println(sc.getConf.getAll.mkString("\n"))

Get Spark version

sc.version

Get number of partitions

df.rdd.getNumPartitions

Count number of rows

df.count

Print schema

df.printSchema

Preview top 20 rows

df.show

Design pattern for constructing as data transformation pipeline

import org.apache.spark.sql.DataFrame
 
def filter_slim_cat_dog(df: DataFrame): DataFrame = {
  df.filter(($"animal_type" isin ("cat", "dog")) && ($"weight" <= 100))
}
 
def join_vet(df: DataFrame): DataFrame = {
  df.join(vet_provider, Seq("VETID", "ANIMALID"))
}
 
val animal_slim_cat_dog = 
  animal.transform(filter_slim_cat_dog)
        .transform(join_vet)

Drop duplicate rows

df.dropDuplicates(Seq("VETID", "ANIMALID"))

For an exhaustive list of the functions, you can check out the Spark’s Dataset class documentation.

Hope you’ve found this cheatsheet useful. Thank you!