Scala - Apache Spark DataFrame API Cheatsheet
2018-06-24
Having a good cheatsheet at hand can significantly speed up the development process. One of the best cheatsheet I have came across is sparklyr’s cheatsheet.
For my work, I’m using Spark’s DataFrame API in Scala to create data transformation pipelines. These are some functions and design patterns that I’ve found to be extremely useful.
Load data
val df = spark.read.parquet("filepath")
Get SparkContext information
println(sc.getConf.getAll.mkString("\n"))
Get Spark version
sc.version
Get number of partitions
df.rdd.getNumPartitions
Count number of rows
df.count
Print schema
df.printSchema
Preview top 20 rows
df.show
Design pattern for constructing as data transformation pipeline
import org.apache.spark.sql.DataFrame
def filter_slim_cat_dog(df: DataFrame): DataFrame = {
df.filter(($"animal_type" isin ("cat", "dog")) && ($"weight" <= 100))
}
def join_vet(df: DataFrame): DataFrame = {
df.join(vet_provider, Seq("VETID", "ANIMALID"))
}
val animal_slim_cat_dog =
animal.transform(filter_slim_cat_dog)
.transform(join_vet)
Drop duplicate rows
df.dropDuplicates(Seq("VETID", "ANIMALID"))
For an exhaustive list of the functions, you can check out the Spark’s Dataset class documentation.
Hope you’ve found this cheatsheet useful. Thank you!