Basic Commands of Pyspark, Python + Spark

Sumit Dubey
2 min readDec 5, 2020

Apache spark originally used to support Scala Programming Language, but due to the popularity of Python, a tool was developed to support python with spark, that tool is called Pyspark. With the release of spark 2.0, it become much easier to work with spark, Here we will see the basics of Pyspark, i.e. working in spark using Python.

Assuming that spark is installed in Jupyter Notebook, the first thing we need to do is import and creaate a spark session

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(‘data’).getOrCreate()

A session named spark is created above. Next, we will read a CSV file in the following way

df = spark.read.csv(‘customers.csv’,inferSchema=True,header=True)

There are a lot of other formats available to read a file like table, text, json, etc. For now, we have successfully loaded the customer.csv file and created a data frame df.

In order to view the contents of the data frame, we use show method

df.show()

we can even pass the number of lines we wish to return

df.show(10)

The customer.csv file looks like this, this is a simple trading file available to download freely

Interestingly, you know to work with SQL, you can analyze this data using SQL syntax without even learning the data frame syntax. One example is below, let's say we want to see all the records for which the closing price is less than or equal to sixty, first, we will create a temporary view of the data frame and then would query using SQL

df.createOrReplaceTempView(‘data’)

result=spark.sql(‘select * from data’)

result.show()

If we want to see the details of the data frame we can use the following methods

df.printSchema()

Output:

root
|-- Date: string (nullable = true)
|-- Open: string (nullable = true)
|-- High: string (nullable = true)
|-- Low: string (nullable = true)
|-- Close: string (nullable = true)
|-- Volume: string (nullable = true)
|-- Adj Close: string (nullable = true)

To get the list of columns

df.columns

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

Adding a new column, let's say we want to add a new column called double_close which would contain the value 2*closing price, we can do this in the following manner

df.withcolumn(‘double_closing’, df[‘closing’]*2)

Group by is pretty easy in pyspark, here is small example of grouping records by Date column.

df.groupBy(‘Date’)

When we apply groupby function to data frame, it returns us a groupby object and that can be used for finding aggregation like mean, max etc.

df.groupBy(‘Date’).mean()

df.groupBy(‘Date’).sum()

df.groupBy(‘Date’).max()

df.groupBy(‘Date’).min()

Now, lets say we want to find the max of volume, we can ahieve this using following

df.agg({‘Volume’:’Max’}).show()

In simillar way we can use min, mean,sum etc. in aggregate functions

df.agg({‘Volume’:’Mean’}).show()

These all are the basics of pyspark and you are good to start working with it, if you liked this content please upvote. Thanks

--

--