Prerequisite :
Create a free community edition account from practice:
https://community.cloud.databricks.com/
Step 1: Upload file in DBS [DBFS is a Databricks File System]
Step 2: Spark code for reading csv data in data frame.
file_location = "/FileStore/tables/data.csv"
file_type = "csv"
# CSV options
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
Output :
Step 3 : Casting and Aggregation:
# Applied casting and do Aggregation #sum of sal in city
from pyspark.sql.functions import col
df2 = df.withColumn('new_sal',col('sal').cast('integer'))
df2.groupby('city').sum('new_sal').show()
Step 4 : Filter and Save operations:
#Filter operations
df3 = df2.filter(col('city') == 'Pune')
df3.show()
#Write dataframe in table
df3.write.saveAsTable("my_test_table")
Table created in default schema: