Optimize Conversion between PySpark and Pandas DataFrames

0 0 2 minutes read

PySpark and Pandas are two open-source libraries that are used for doing data analysis and handling data in Python. Given below is a short description of both of them.

Conversion between PySpark and Pandas DataFrames

In this article, we are going to talk about how we can convert a PySpark DataFrame into a Pandas DataFrame and vice versa. Their conversion can be easily done in PySpark.

Converting Pandas DataFrame into a PySpark DataFrame

Here in, we’ll be converting a Pandas DataFrame into a PySpark DataFrame. First of all, we’ll import PySpark and Pandas libraries. Then we’ll start a session. later, we will create a Pandas DataFrame and convert it to PySpark DataFrame. To do that, we’ll make a PySpark DataFrame via the createDataFrame() method and store it in the same variable in which we stored the Pandas DataFrame. Inside the createDataFrame() method, as a parameter, we’ll pass the pandas DataFrame name. These steps will convert the Pandas DataFrame into a PySpark DataFrame.

Example:

Python3

# importing pandas and PySpark libraries 
import pandas as pd 
import pyspark 
  
# initializing the PySpark session 
spark = pyspark.sql.SparkSession.builder.getOrCreate() 
  
# creating a pandas DataFrame 
df = pd.DataFrame({ 
  'Cardinal':[1, 2, 3], 
  'Ordinal':['First','Second','Third'] 
}) 
  
# converting the pandas DataFrame into a PySpark DataFrame 
df = spark.createDataFrame(df) 
  
# printing the first two rows 
df.show(2)

Output:

In case, if you would like to use the pandas DataFrame later, you can store the PySpark DataFrame in another variable.

Converting PySpark DataFrame into a Pandas DataFrame

Now, we will be converting a PySpark DataFrame into a Pandas DataFrame. All the steps are the same but this time, we’ll be making use of the toPandas() method. We’ll use toPandas() method and convert our PySpark DataFrame to a Pandas DataFrame.

Syntax to use toPandas() method:

spark_DataFrame.toPandas()

Example:

Python3

# importing PySpark Library 
import pyspark 
  
# from PySpark importing Row for creating DataFrame 
from pyspark import Row 
  
# initializing PySpark session 
spark = pyspark.sql.SparkSession.builder.getOrCreate() 
  
# creating a PySpark DataFrame 
spark_df = spark.createDataFrame([ 
  Row(Cardinal=1, Ordinal='First'), 
  Row(Cardinal=2, Ordinal='Second'), 
  Row(Cardinal=3, Ordinal='Third') 
]) 
  
# converting spark_dataframe into a pandas DataFrame 
pandas_df = spark_df.toPandas() 
  
pandas_df.head()

Output:

Now we will check the time required to do the above conversion.

Python3

%%time 
import numpy as np 
import pandas as pd 
  
# creating session in PySpark 
spark = pyspark.sql.SparkSession.builder.getOrCreate() 
  
# creating a PySpark DataFrame 
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\ 
           (np.random.randint(1, 101, size=100), newshape=(10, 10)))) 
spark_df.toPandas()

Output

3.17 s

Now let’s enable the PyArrow and see the time taken by the process.

Python3

%%time 
import numpy as np 
import pandas as pd 
  
# creating session in PySpark 
spark = pyspark.sql.SparkSession.builder.getOrCreate() 
  
  
# creating a PySpark DataFrame 
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\ 
           (np.random.randint(1, 101, size=100), newshape=(10, 10)))) 
  
# enabling PyArrow 
spark.conf.set('spark.sql.execution.arrow.enabled', 'true') 
spark_df.toPandas()