File formats I've worked with
Files are mandatory when we talk about data things. Most common thing is file in which we keep data and program configurations.
There are various file formats in this field of work and now I am going to tell some major formats I have worked with.
Let's start now.
CSV stands for "Comma-Separated Values" (wiki). I am greatly certain we are most familiar with this because this file format is the most basic one we have to engage with.
Characteristic of CSV is straightforward yet fuzzy.
- It contains a line of header at the first row (sometimes not).
- It requires a separator as basically comma (sometimes as pipe
- It needs an exact number of values out of separators all along the file including the header. It must be a solid schema.
- In case a value contains the separator symbol, it must be encapsulated with double-quotes
"", or the file can't be read at that line due to inconsistent schema.
However, this is a base format we have to work with from its own readability and ease to update values.
I would lvoe to recommend this extension when work with CSV file on VSCode. This can help us verify schemas, columns, values of the opened CSV file at a good level. It is Rainbow CSV.
A good use of them is to align columns... and hover the mouse to see what column it is.
Besides, this is an example Python code when we want to write a
dict into a CSV file. The easiest way is to use module
And when I read a CSV file, I prefer module
JSON requires a single object and the object can contain any sub-objects in a key-value pair. Like this sample below.
We can write a
dict object into a JSON file using
JSON module like this.
When it comes to read a JSON file, I use either
JSONL is a kind of JSON but stands for "JSON Lines". It is also supported in BigQuery integration.
Big difference between JSON and JSONL is JSONL contains a JSON object per line. This is for JSON payloads in transaction manners in order to load it to OLAP (Online Analytical Processing) databases such as BigQuery.
For more information, please visit the official website at https://jsonlines.org
And we can easily write a JSONL file using this sample
Similar to JSON file, I use
pandas to read a JSONL file as well.
Parquet is designed by Apache (Apache Parquet). I don't use this much in the past but I can recommend this when you have a super large file which wouldn't be a good idea to import/export to CSV, JSON, or even JSONL above.
The figure below shows the size of same contents in different format.
And we can see Parquet file is just 4 MB while CSV is larger by 21 MB and JSON is 59 MB.
Parquet handles the contents into parts. This is an example of the contents when we read with Python.
Parquet cannot be read normally with basic text editors because it's not a text file. I recommend to code in Python or other languages to read it.
This example shows how can we write a data into a Parquet file.
And we can read it with the same module.
YAML stands for "Yet Another Markup Language" (wiki). I don't use this to store data but configurations instead.
Its characteristics is key-value pairing like JSON and sometimes we can use JSON for the task but YAML works better by its flexibility and feature-rich.
YAML is outstanding far from JSON by several reasons.
- YAML supports comments. I love this part the most.
- YAML has no brackets unnecessarily.
- YAML can work with variables. JSON cannot.
This is a YAML example.
One thing, a YAML file can be
.yml so we need to double-check the file name and the extension or we will encounter "File not found" error. I made it many times in the beginning.
I personally write a YAML manually due to my need for a configuration file. But we can write it using Python like this.
And we can use
yaml to read a YAML file so easily.
All source code put here can be found in my repo below.
These are major file formats I usually work with. How about yours?