How to compute Row Size in Complex Spark DataFrame?



df=spark.read.parquet('xxxxx')
from pyspark.sql.functions import struct, collect_set, length,desc,col,to_json
cdf=df.groupBy("primaykey").agg(collect_set(struct([df[x] for x in df.columns])).alias("Allcontent")).select("primaykey",,"Allcontent")
cdf=cdf.withColumn('all',to_json('Allcontent'))
cdf=cdf.withColumn('len_all',length('all'))
display(cdf.orderBy(col('len_all').desc()).select('primaykey','len_all'))
view raw rowsize.py hosted with ❤ by GitHub

Definitely there may be a better way of doing this. I was running into an error where data load from lake to CosmosDB is failing due to a record size exceeding 2MB limit. Hence, converting it to JSON grouped by primary key and computing the size seemed appropriate and easy. Comment below if this can be done better. 

Comments

Popular posts from this blog

Updating SourceData/ Data Source of the Pivot Table

Salesforce.com migration tool - Deploying Weblink and migrating files with special characters

COM Add-in Deployment Issues