File formats I've worked with
There are various file formats in this field of work and now I am going to tell some major formats I have worked with.
Files are mandatory when we talk about data things. Most common thing is file in which we keep data and program configurations.
There are various file formats in this field of work and now I am going to tell some major formats I have worked with.
Let’s start now.
CSV
CSV stands for “Comma-Separated Values” (wiki). I am greatly certain we are most familiar with this because this file format is the most basic one we have to engage with.
Characteristic of CSV is straightforward yet fuzzy.
- It contains a line of header at the first row (sometimes not).
- It requires a separator as basically comma (sometimes as pipe
|
or semi-colon;
). - It needs an exact number of values out of separators all along the file including the header. It must be a solid schema.
- In case a value contains the separator symbol, it must be encapsulated with double-quotes
""
, or the file can’t be read at that line due to inconsistent schema.
However, this is a base format we have to work with from its own readability and ease to update values.
I would love to recommend this extension when work with CSV file on VSCode. This can help us verify schemas, columns, values of the opened CSV file at a good level. It is Rainbow CSV.
A good use of them is to align columns… and hover the mouse to see what column it is.
Besides, this is an example Python code when we want to write a dict
into a CSV file. The easiest way is to use module csv
.
And when I read a CSV file, I prefer module pandas
.
JSON
JSON stands for “JavaScript Object Notation” (json.org). This format is popular for any purposes from its pattern which is intuitive and self-described.
JSON requires a single object and the object can contain any sub-objects in a key-value pair. Like this sample below.
We can write a dict
object into a JSON file using json
module like this.
When it comes to read a JSON file, I use either json
or pandas
.
JSONL
JSONL is a kind of JSON but stands for “JSON Lines”. It is also supported in BigQuery integration.
Big difference between JSON and JSONL is JSONL contains a JSON object per line. This is for JSON payloads in transaction manners in order to load it to OLAP (Online Analytical Processing) databases such as BigQuery.
For more information, please visit the official website at https://jsonlines.org
And we can easily write a JSONL file using this sample.
Similar to JSON file, I use json
and pandas
to read a JSONL file as well.
Parquet
Parquet is designed by Apache (Apache Parquet). I don’t use this much in the past but I can recommend this when you have a super large file which wouldn’t be a good idea to import/export to CSV, JSON, or even JSONL above.
The figure below shows the size of same contents in different format.
And we can see Parquet file is just 4 MB while CSV is larger by 21 MB and JSON is 59 MB.
Parquet handles the contents into parts. This is an example of the contents when we read with Python.
Parquet cannot be read normally with basic text editors because it’s not a text file. I recommend to code in Python or other languages to read it.
This example shows how can we write a data into a Parquet file.
And we can read it with the same module.
YAML
YAML stands for “Yet Another Markup Language” (wiki). I don’t use this to store data but configurations instead.
Its characteristics is key-value pairing like JSON and sometimes we can use JSON for the task but YAML works better by its flexibility and feature-rich.
YAML is outstanding far from JSON by several reasons.
- YAML supports comments. I love this part the most.
- YAML has no brackets unnecessarily.
- YAML can work with variables. JSON cannot.
This is a YAML example.
One thing, a YAML file can be .yaml
or .yml
so we need to double-check the file name and the extension or we will encounter “File not found” error. I made it many times in the beginning.
I personally write a YAML manually due to my need for a configuration file. But we can write it using Python like this.
And we can use yaml
to read a YAML file so easily.
Repo
All source code put here can be found in my repo below.
These are major file formats I usually work with. How about yours?