Tabs vs. spaces in byte size
While during my daily housekeeping, I got intrigued by how much of a the difference would a tab-based repository be in size in comparison to a space-based version. To check the difference, I’ve used the Linux kernel for comparing the size when converting all tabs to spaces and Python’s Pytest library when converting all spaces to tabs.
Using Python and C repositories was ideal, given Linux uses 8 tabs by default and the Python community generally sticks with 4 spaces.
Tab vs spaces #
The preference of tab vs. spaces in a codebase is a topic for another day. Today’s topic is how many bytes a tab differs from a space? A tab can be visually seen as a 4 or 8 space character, but is it more significant in size?
We can display the contexts of a text in hex and see how tabs and spaces are represented.
echo "s: t:\t" | hexdump -C
00000000 73 3a 20 20 20 20 74 3a 09 0a |s: t:..|
0000000a
The hex value 20
is the ASCII value for the space character, and the value
09
is the ASCI horizontal tab (HT
). We know that tab characters use less
space than the space characters when used as tab indentation (" "
vs.
\t
). We can also ensure by using wc
with the -c
flag for byte count.
echo " " | wc -c
echo " " | wc -c
2
3
echo "\t" | wc -c
echo "\t\t" | wc -c
2
3
So technically, using tabs in a codebase is much cheaper in storage.
Converting the repositories #
I’ve used expand
and unexpand
from coreutils
to convert spaces to tabs and
tabs to spaces. To speed things up when running against all files, I use fd
and pipe it against my script that converts the file’s tab to spaces and
vice-versa.
#!/bin/env bash
# replace-ts
# for tabs -> spaces, run with:
# fd -e c -e h -x bash -c 'replace-ts s {}'
#
# for spaces -> tabs, run with:
# fd -e c -e h -x bash -c 'replace-ts t {}'
set -euo pipefail
# tabs or spaces?
argtype=$1
# filename
argfilename=$2
if [[ -d $argfilename ]]; then
exit 0
fi
# tabs -> spaces
cksum=$(cksum $argfilename | cut -f1 -d ' ')
# temporary filename is filename.cksum
filename="$(basename $argfilename).$cksum"
if [[ $argtype == "s" ]]; then
expand -t 4 "$argfilename" > "/tmp/$filename"
mv "/tmp/$filename" "$2"
else
unexpand -t 4 "$argfilename" > "/tmp/$filename"
mv "/tmp/$filename" "$2"
fi
To convert all tabs to spaces, I’ve ran:
fd ~/workspace/linux -e c -e h -x bash -c "../replace-tabs s {}"
Or to convert all spaces to tabs:
fd ~/workspace/linux -e c -e h -x bash -c "../replace-tabs t {}"
Linux kernel #
Upstream #
The current (52d543b54) Linux repository has 1.5G in total; let’s see how it increases or decreases with spaces.
du -sch ~/workspace/linux/ | tail -1 | cut -f1 -d "t"
du -sch ~/workspace/linux/**/*.c | tail -1 | cut -f1 -d "t"
du -sch ~/workspace/linux/**/*.h | tail -1 | cut -f1 -d "t"
du -sch ~/workspace/linux/**/*.S | tail -1 | cut -f1 -d "t"
1.5G
627M
485M
12M
With spaces #
It seams we have an increase of size in the c
, h
and S
files. There was
also have an increase of 1G when running against all files – that’s 1GB extra
in the repository!
du -sch ~/workspace/linux/ | tail -1 | cut -f1 -d "t"
du -sch ~/workspace/linux/**/*.c | tail -1 | cut -f1 -d "t"
du -sch ~/workspace/linux/**/*.h | tail -1 | cut -f1 -d "t"
du -sch ~/workspace/linux/**/*.S | tail -1 | cut -f1 -d "t"
1.6G
699M
496M
14M
Pytest #
Upstream #
By default, Pytest uses spaces. The current (176d2d7b) Pytest repository has 38M in size and 3.2M in Python code.
du -sch ~/workspace/pytest/ | tail -1 | cut -f1 -d "t"
du -sch ~/workspace/pytest/**/*.py | tail -1 | cut -f1 -d "t"
38M
3.2M
With tabs #
And this is how much space it takes with tabs. We got a decrease of 1MB when converting all spaces to tabs. Not much, but interesting.
du -sch ~/workspace/pytest/ | tail -1 | cut -f1 -d "t"
du -sch ~/workspace/pytest/**/*.py | tail -1 | cut -f1 -d "t"
37M
2.8M