1. Spark Basics and the RDD Interface

Resilient Distributed Dataset (RDD)

RDDs are a core object that everything in Spark revolves around. It is fundamentally an abstraction for a giant set of data.

What is the difference between the map and flatmap function?

The map function creates a 1 to 1 relationship for every entry in the RDD. Every entry in the original RDD gets mapped to a new value in the new RDD. The flatmap function works the same way as the map function, but you can have have multiple or no results per original entry.

Functions that can be used with Spark

Map

https://images.squarespace-cdn.com/content/v1/619daecdeb74d32e6f533043/1661183395799-UFXXCXLO7EGS891UWRZY/understanding+RDD.png?format=750w

FlatMap

https://images.squarespace-cdn.com/content/v1/619daecdeb74d32e6f533043/1661183411670-OGURMXPQNDRUI5WMM206/flatmap+example.png?format=750w

Key Value Pair Sorting

https://images.squarespace-cdn.com/content/v1/619daecdeb74d32e6f533043/1661183477608-I4SOW104XLNR7TODZ9J7/key+value+pair+sort+example.png?format=750w

2. Spark SQL, Dataframes and Datasets