Basic Commands of Pyspark, Python + Spark
Apache spark originally used to support Scala Programming Language, but due to the popularity of Python, a tool was developed to support python with spark, that tool is called Pyspark. With the release of spark 2.0, it become much easier to work with spark, Here we will see the basics of Pyspark, i.e. working in spark using Python.
Assuming that spark is installed in Jupyter Notebook, the first thing we need to do is import and creaate a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(‘data’).getOrCreate()
A session named spark is created above. Next, we will read a CSV file in the following way
df = spark.read.csv(‘customers.csv’,inferSchema=True,header=True)
There are a lot of other formats available to read a file like table, text, json, etc. For now, we have successfully loaded the customer.csv file and created a data frame df.
In order to view the contents of the data frame, we use show method
df.show()
we can even pass the number of lines we wish to return
df.show(10)
The customer.csv file looks like this, this is a simple trading file available to download freely
Interestingly, you know to work with SQL, you can analyze this data using SQL syntax without even learning the data frame syntax. One example is below, let's say we want to see all the records for which the closing price is less than or equal to sixty, first, we will create a temporary view of the data frame and then would query using SQL
df.createOrReplaceTempView(‘data’)
result=spark.sql(‘select * from data’)
result.show()
If we want to see the details of the data frame we can use the following methods
df.printSchema()
Output:
root
|-- Date: string (nullable = true)
|-- Open: string (nullable = true)
|-- High: string (nullable = true)
|-- Low: string (nullable = true)
|-- Close: string (nullable = true)
|-- Volume: string (nullable = true)
|-- Adj Close: string (nullable = true)
To get the list of columns
df.columns
['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']
Adding a new column, let's say we want to add a new column called double_close which would contain the value 2*closing price, we can do this in the following manner
df.withcolumn(‘double_closing’, df[‘closing’]*2)
Group by is pretty easy in pyspark, here is small example of grouping records by Date column.
df.groupBy(‘Date’)
When we apply groupby function to data frame, it returns us a groupby object and that can be used for finding aggregation like mean, max etc.
df.groupBy(‘Date’).mean()
df.groupBy(‘Date’).sum()
df.groupBy(‘Date’).max()
df.groupBy(‘Date’).min()
Now, lets say we want to find the max of volume, we can ahieve this using following
df.agg({‘Volume’:’Max’}).show()
In simillar way we can use min, mean,sum etc. in aggregate functions
df.agg({‘Volume’:’Mean’}).show()
These all are the basics of pyspark and you are good to start working with it, if you liked this content please upvote. Thanks