Parquet file is a popular file format used for storing large, complex data. It is an open-source file format that is available in a project in the Hadoop ecosystem. Data is stored in the flat columnar format which is considered to be more efficient than row-based files such as CSV.
Previously, We have to use ADLS (Azure Data Lake) for loading parqueting files in Power BI. So before, in order to connect to any parquet file, it was important to store the file in the Azure Data Lake.
How to load Parquet File in Power BI? In the Power BI February 2021 Update, the Parquet file connector has been provided to the Power BI Desktop. Earlier there were two options, one was to store the file in the Azure Data Lake and then connect Data Lake to the Power BI using the Azure Data Lake Connector. The second option is to use Power BI Dataflow as the Parquet file connector is available in the Power BI Dataflow.
Why to use Parquet File Format?
Parquet file format has become popular in recent days mainly in the Hadoop ecosystem where big data and complex data need to be processed. It`s important to why we should use parquet file format and what is the advantage of using parquet file over other formats such as CSV, TSV, Avro, etc.
Some of the benefits of Parquet file format are:
- Columnar format
- Self Describing
Difference between Parquet and CSV?
The major difference between Parquet and CSV file format is that data in Parquet is stored in the columnar format while data in CSV is stored in Record based format.
Let`s see the difference between CSV and Parquet with the help of an Example. When a CSV file converted to Parquet in Amazon S3, See the details below in the table.
|Dataset||Size on Amazon S3||Query Runtime||Data Scanned||Cost|
|Data stored as CSV file||1TB||236 seconds||1.5TB||$5.75|
|Data Stored in Apache Parquet format||130GB||6.78 Seconds||2.51GB||$0.01|
|Savings||87% less when using Parquet||34x faster||99% less data scanned||99.7% savings|
You can see in the above table that there is a great advantage of using parquet file format while handling big data. Size has been reduced by 87% in comparison to the CSV file. Query runtime is 34x faster than the CSV file. 99 % less data has been scanned, that`s amazing. and 99.7% savings in cost.
What is the difference between Avro and Parquet file format?
Both Avro and Parquet file is the file format which is widely used in Hadoop ecosystem. Avro file format is a row-based storage format for Hadoop while Parquet is a column-based file format for Hadoop.
If there is a scenario all the fields need to be scanned in a row then using Avro is the best choice. Please find below points related to Avro file format:
- Since it`s a row-based file format so it`s better to use it when all the fields are being accessed.
- File support block compression and are splittable.
- can be used with streaming data.
- suitable for write-intensive operation.
If there is a scenario where there are various columns but your use case requires the use of a subset of that column then storing data in the parquet file is the best choice.
- Since it’s a column-based format, it’s better to use when you only need to access specific fields
- Each data file contains the values for a set of rows
- Can’t be written from streaming data since it needs to wait for blocks to get finished. However, this will work using micro-batch (eg Apache Spark).
- Suitable for data exploration — read-intensive, complex or analytical querying, low latency data.
While going through the article you must have understood the importance of parquet file format in the Hadoop ecosystem. Also, earlier we have to use Power BI Dataflow for connecting the parquet file with Power BI which requires a premium subscription or we need to store the parquet file in Azure Data lake and then connect ADLS for accessing the parquet file in Power BI. Now with the recent update, we can connect to the parquet file directly using the Parquet file connector available in Power BI Desktop.
I believe you must have found this article informative. Please let me know your views in the comment box below.