How to compute Row Size in Complex Spark DataFrame?
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df=spark.read.parquet('xxxxx') | |
from pyspark.sql.functions import struct, collect_set, length,desc,col,to_json | |
cdf=df.groupBy("primaykey").agg(collect_set(struct([df[x] for x in df.columns])).alias("Allcontent")).select("primaykey",,"Allcontent") | |
cdf=cdf.withColumn('all',to_json('Allcontent')) | |
cdf=cdf.withColumn('len_all',length('all')) | |
display(cdf.orderBy(col('len_all').desc()).select('primaykey','len_all')) |
Definitely there may be a better way of doing this. I was running into an error where data load from lake to CosmosDB is failing due to a record size exceeding 2MB limit. Hence, converting it to JSON grouped by primary key and computing the size seemed appropriate and easy. Comment below if this can be done better.
Comments
Post a Comment
Feedback - positive or negative is welcome.