Spark Trim String Column on DataFrame

In this article, we will see that in PySpark, we can remove white spaces in the DataFrame string column. Here we will perform a similar operation to trim() (removes left and right white spaces) present in SQL in PySpark itself.
PySpark Trim String Column on DataFrame
Below are the ways by which we can trim String Column on DataFrame in PySpark:
- Using withColumn with rtrim()
- Using withColumn with trim()
- Using select()
- Using SQL Expression
- Using PySpark trim(), rtrim(), ltrim()
PySpark Trim using withColumn() with rtrim()
In this example, we are using withColumn() with rtrim() to remove white spaces.
Python3
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import trim, ltrim, rtrim, col# Create a Spark sessionspark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()# Define your datadata = [(1, "ABC "), (2, " DEF"), (3, " GHI ")]# Create a DataFramedf = spark.createDataFrame(data, ["col1", "col2"])# Show the initial DataFramedf.show()# Using withColumn to remove white spacesdf = df.withColumn("col2", rtrim(col("col2")))df.show()# Stop the Spark sessionspark.stop() |
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
PySpark Trim using withColumn with trim()
In this example, we are using withColumn() with trim() to remove white spaces.
Python3
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import trim, ltrim, rtrim, col# Create a Spark sessionspark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()# Define your datadata = [(1, "ABC "), (2, " DEF"), (3, " GHI ")]# Create a DataFramedf = spark.createDataFrame(data, ["col1", "col2"])# Show the initial DataFramedf.show()# Using withColumn to remove white spacesdf = df.withColumn("col2", trim(col("col2")))df.show()# Stop the Spark sessionspark.stop() |
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
PySpark Trim using select() to Remove White Spaces
In this example, we are using select() to trim in PySpark.
Python3
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import trim, ltrim, rtrim, col# Create a Spark sessionspark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()# Define your datadata = [(1, "ABC "), (2, " DEF"), (3, " GHI ")]# Create a DataFramedf = spark.createDataFrame(data, ["col1", "col2"])# Show the initial DataFramedf.show()# Using select to remove white spacesdf = df.select(col("col1"), trim(col("col2")).alias("col2"))df.show()# Stop the Spark sessionspark.stop() |
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
Using SQL Expression
In this example, we are using SQL Expression to remove extra whitespaces.
Python3
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import trim, ltrim, rtrim, col# Create a Spark sessionspark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()# Define your datadata = [(1, "ABC "), (2, " DEF"), (3, " GHI ")]# Create a DataFramedf = spark.createDataFrame(data, ["col1", "col2"])# Show the initial DataFramedf.show()# Using SQL Expression to remove white spacesdf.createOrReplaceTempView("TAB")spark.sql("SELECT col1, TRIM(col2) AS col2 FROM TAB").show()# Stop the Spark sessionspark.stop() |
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
Using PySpark trim(), rtrim(), ltrim()
In PySpark, we can easily remove whitespaces or trim by using pyspark.sql.functions.trim() function. We use ltrim() to remove only left white spaces and rtrim() function to remove right white spaces.
Python3
from pyspark.sql.functions import trim, ltrim, rtrimdata = [(1, "ABC "), (2, " DEF"), (3, " GHI ")]df = spark.createDataFrame(data=data, schema=["col1", "col2"])df.show()# using withColumn and trim()df.withColumn("col2", trim("col2")).show()# using ltrim()df.withColumn("col2", ltrim("col2")).show()# using rtrim()df.withColumn("col2", rtrim("col2")).show()# Using selectdf.select("col1", trim("col2").alias('col2')).show()# Using SQL Expressiondf.createOrReplaceTempView("TAB")spark.sql("select col1,trim(col2) as col2 from TAB").show() |
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+



