Monday, 17 August 2020

Basic pyspark operation on dataframe

Prerequisite :

Create a free community edition account from practice: 

https://community.cloud.databricks.com/


Step 1: Upload file in DBS  [DBFS is a Databricks File System]


Step 2: Spark code for reading csv data in data frame.


file_location = "/FileStore/tables/data.csv"

file_type = "csv"

# CSV options

infer_schema = "false"

first_row_is_header = "true"

delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.

df = spark.read.format(file_type) \

  .option("inferSchema", infer_schema) \

  .option("header", first_row_is_header) \

  .option("sep", delimiter) \

  .load(file_location)

display(df)


Output :


Step 3 : Casting and Aggregation:

# Applied casting and do Aggregation #sum of sal in city

from pyspark.sql.functions import col

df2 = df.withColumn('new_sal',col('sal').cast('integer'))

df2.groupby('city').sum('new_sal').show()


Step 4 : Filter and Save operations:

#Filter operations

df3 = df2.filter(col('city') == 'Pune')

df3.show()

#Write dataframe in table

df3.write.saveAsTable("my_test_table")



Table created in default schema: