Optimize Conversion between PySpark and Pandas DataFrames

PySpark and Pandas are two open-source libraries that are used for doing data analysis and handling data in Python. Given below is a short description of both of them.
Conversion between PySpark and Pandas DataFrames
In this article, we are going to talk about how we can convert a PySpark DataFrame into a Pandas DataFrame and vice versa. Their conversion can be easily done in PySpark.
Converting Pandas DataFrame into a PySpark DataFrame
Here in, we’ll be converting a Pandas DataFrame into a PySpark DataFrame. First of all, we’ll import PySpark and Pandas libraries. Then we’ll start a session. later, we will create a Pandas DataFrame and convert it to PySpark DataFrame. To do that, we’ll make a PySpark DataFrame via the createDataFrame() method and store it in the same variable in which we stored the Pandas DataFrame. Inside the createDataFrame() method, as a parameter, we’ll pass the pandas DataFrame name. These steps will convert the Pandas DataFrame into a PySpark DataFrame.
Example:
Python3
| # importing pandas and PySpark libraries importpandas as pd importpyspark  # initializing the PySpark session spark =pyspark.sql.SparkSession.builder.getOrCreate()  # creating a pandas DataFrame df =pd.DataFrame({   'Cardinal':[1, 2, 3],   'Ordinal':['First','Second','Third'] })  # converting the pandas DataFrame into a PySpark DataFrame df =spark.createDataFrame(df)  # printing the first two rows df.show(2) | 
Output:
 
In case, if you would like to use the pandas DataFrame later, you can store the PySpark DataFrame in another variable.
Converting PySpark DataFrame into a Pandas DataFrame
Now, we will be converting a PySpark DataFrame into a Pandas DataFrame. All the steps are the same but this time, we’ll be making use of the toPandas() method. We’ll use toPandas() method and convert our PySpark DataFrame to a Pandas DataFrame.
Syntax to use toPandas() method:
spark_DataFrame.toPandas()
Example:
Python3
| # importing PySpark Library importpyspark  # from PySpark importing Row for creating DataFrame frompyspark importRow  # initializing PySpark session spark =pyspark.sql.SparkSession.builder.getOrCreate()  # creating a PySpark DataFrame spark_df =spark.createDataFrame([   Row(Cardinal=1, Ordinal='First'),   Row(Cardinal=2, Ordinal='Second'),   Row(Cardinal=3, Ordinal='Third') ])  # converting spark_dataframe into a pandas DataFrame pandas_df =spark_df.toPandas()  pandas_df.head() | 
Output:
 
Now we will check the time required to do the above conversion.
Python3
| %%time importnumpy as np importpandas as pd  # creating session in PySpark spark =pyspark.sql.SparkSession.builder.getOrCreate()  # creating a PySpark DataFrame spark_df =spark.createDataFrame(pd.DataFrame(np.reshape\            (np.random.randint(1, 101, size=100), newshape=(10, 10)))) spark_df.toPandas() | 
Output
3.17 s
Now let’s enable the PyArrow and see the time taken by the process.
Python3
| %%time importnumpy as np importpandas as pd  # creating session in PySpark spark =pyspark.sql.SparkSession.builder.getOrCreate()   # creating a PySpark DataFrame spark_df =spark.createDataFrame(pd.DataFrame(np.reshape\            (np.random.randint(1, 101, size=100), newshape=(10, 10))))  # enabling PyArrow spark.conf.set('spark.sql.execution.arrow.enabled', 'true') spark_df.toPandas() | 
Output
460 ms
Here we can see that the time required to convert PySpark and Pandas dataframe has been reduced drastically by using the optimized version.
 
				 
					


