Hadoop Basics: August 2020

Prerequisite :

Create a free community edition account from practice:

Step 1: Upload file in DBS [DBFS is a Databricks File System]

Step 2: Spark code for reading csv data in data frame.

file_location = "/FileStore/tables/data.csv"

file_type = "csv"

# CSV options

infer_schema = "false"

first_row_is_header = "true"

delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.

df = spark.read.format(file_type) \

.option("inferSchema", infer_schema) \

.option("header", first_row_is_header) \

.option("sep", delimiter) \

.load(file_location)

display(df)

Output :

Step 3 : Casting and Aggregation:

# Applied casting and do Aggregation #sum of sal in city

from pyspark.sql.functions import col

df2 = df.withColumn('new_sal',col('sal').cast('integer'))

df2.groupby('city').sum('new_sal').show()

Step 4 : Filter and Save operations:

#Filter operations

df3 = df2.filter(col('city') == 'Pune')

df3.show()

#Write dataframe in table

df3.write.saveAsTable("my_test_table")

Table created in default schema:

Hadoop Basics